Every mixture of experts model has a router. It's the small neural network that looks at an input — a token, an image patch, a video frame — and decides which "expert" sub-network should handle it. Every major MoE paper for the past six years has treated that router as a necessary cost. A team at Florida State University and SUNY Buffalo just called that assumption wrong.
LiME, a paper posted to arXiv on Feb. 1, 2026, introduces zero-parameter routing: a method that eliminates the learned router entirely. Instead of training a gating network to choose experts, LiME computes routing decisions from the structure of frozen and adapted representations already present in the model. The result is a system that the paper's authors say achieves comparable accuracy to full MoE baselines while using up to 4x fewer trainable parameters, and training up to 29 percent faster on the same hardware.
Learned routing has been one of the defining architectural decisions in modern MoE design. Mixtral, the Sparse Mixture of Experts model from Mistral AI, uses a router. GPT-4 is widely believed to use MoE routing internally. The assumption built into these systems is that choosing which expert handles which input requires a learned component that must itself be trained and that adds parameters and compute to every forward pass. LiME's argument is that this is an artifact of how routing has been formulated, not a mathematical necessity.
To test the claim at scale, the authors built MMT-47, a multimodal multi-task benchmark spanning 47 tasks across text, image, and video. The benchmark is itself a contribution: prior MoE evaluation has often relied on fragmented benchmarks covering a single modality or a narrow set of tasks. MMT-47 is supposed to pressure-test whether efficiency gains hold across heterogeneous inputs, the kind of mixed-modality workload that production AI systems actually handle. On MMT-47, LiME's shared PEFT (parameter-efficient fine-tuning) module, modulated by lightweight expert vectors, performs comparably to MoE-PEFT baselines that have significantly more trainable parameters. The range of trainable parameters across LiME configurations is 0.02M to 0.57M depending on the setup.
The paper is a preprint posted to arXiv, which means it has not yet undergone peer review. The authors provided theoretical proofs alongside their empirical results, arguing that zero-parameter routing is not just a heuristic but follows from how frozen and adapted representations already encode routing-relevant information.
The paper demonstrates training efficiency. The inference question is separate. A routing mechanism that is cheap to train may still add latency during serving, when models are queried millions of times per day and every millisecond of router overhead compounds.
The MMT-47 benchmark raises a methodological flag common in new AI research: the team that built the system also created the test. The authors argue MMT-47 is built to stress multimodal generalization, not to flatter their architecture, but independent evaluation would be needed to confirm the results hold on other benchmarks. Treat the 4x parameter reduction and 29 percent training speedup as results on their benchmark until that confirmation arrives.
If the core insight survives scrutiny, the implications are concrete. Training a smaller parameter footprint means lower memory requirements during fine-tuning, which directly affects who can afford to adapt frontier-class models to specialized tasks.
The broader MoE research community has treated the router as load-bearing infrastructure for several years. LiME's authors suggest it may be an artifact. Whether that holds under independent scrutiny is the next question, and it is worth asking before the result gets folded into the next wave of efficiency claims.
The paper is at https://arxiv.org/abs/2604.02338.