Speculating Experts Accelerates Inference for Mixture-of-Experts
Running a mixture-of-experts model in production has a quiet bottleneck that benchmark papers rarely discuss: the CPU-GPU transfer. On a Qwen3-30B-A3B deployed on an A6000 GPU, 84 to 88 percent of inference time goes not to computation but to shuttling expert weights between CPU memory and GPU memory. The model is fast. The bus is not.
A new paper from Ashwinee Panda and colleagues at the University of Maryland proposes a fix: speculative expert execution. Instead of waiting for the correct experts to arrive, execute the wrong ones now and patch later. The paper, posted to arXiv on March 20, reports 5-14% reduction in time-per-output-token across tested architectures, with open-source code.
The architecture of the bottleneck
Mixture-of-experts models like Qwen3, Mixtral, and DeepSeek route each token to a small subset of specialized sub-networks (experts) rather than activating the full model. This makes them parameter-efficient at inference — only a fraction of weights are active per forward pass — but creates a scheduling problem. Which experts will be needed for the next token? In a GPU-memory-constrained deployment, those experts may need to be fetched from CPU RAM, and that fetch takes time.
The standard approach is to predict which experts are likely needed next and prefetch them before they're required. Several competing systems from 2025 — including a pre-attention linear predictor achieving 93-97% expert prediction accuracy — work this way. Higher prediction accuracy means fewer wasted fetches.
Panda's paper takes a different stance. Rather than optimizing prediction accuracy, it argues for executing speculatively with whatever experts are already resident in GPU memory. If the speculation is wrong, the cost is a cheap correction step — not a full re-fetch. The paper uses a "default vector" concept recycled from Panda's own NeurIPS 2025 training paper, applying it here to represent a default expert for speculative execution.
Where it works and where it doesn't
The results are architecture-dependent, and the paper is honest about this. GPT-OSS models — OpenAI open-weight models released in August 2025 — show clean performance across math, coding, and commonsense reasoning tasks. Speculative execution accuracy holds and the efficiency gains materialize without accuracy regression.
Qwen3-30B-A3B is messier. The model's early layers exhibit high representational drift — the internal representations used to predict future experts are unstable in layers 1-2. This tanks speculative execution for math tasks specifically. AIME24 and GSM8k scores fall when speculation is applied naively. The authors have a mitigation — skipping speculative execution for early-layer tokens — but it adds complexity to deployment.
The TPOT headline of 14% is the best case, observed on architectures that handle speculative execution cleanly. The range is 5-14% depending on model and hardware configuration.
The crowded field
There are at least four competing papers from 2025 attacking the same CPU-GPU transfer problem for MoE inference. The pre-attention linear predictor approach achieves higher raw prediction accuracy (93-97%). The tradeoff Panda's paper is implicitly making: prediction accuracy matters less if you can execute-and-correct cheaply. Whether that tradeoff holds at scale, across more diverse architectures, and against the best competing approaches is not yet settled.
The open-source code repository makes this tractable to evaluate, which is the right call. Results that can be reproduced are worth more than results that can't.
Who's behind this
Panda is a postdoc at UMD's AxoNN group, which is building YALIS as a systematic HPC inference research platform. The trajectory is interesting: early work in adversarial ML and safety, then a pivot into MoE optimization over the last year. The default vector concept linking this paper to the NeurIPS 2025 training work suggests a longer research arc, not a one-off result.
Bottom line
If 84-88% of your inference time is in memory transfers, a 5-14% TPOT reduction by attacking that bottleneck is legitimate progress — even if it's not the final word. The architecture sensitivity (clean on GPT-OSS, degraded on Qwen3 math tasks) is the honest limit of the current implementation, and the competing approaches with higher prediction accuracy will force a real comparison.
The paper sits at the intersection of systems and ML research, which is where the practically useful inference work happens. Worth watching as the field consolidates around which approach to the prefetching problem actually wins in production.
The paper is available at https://arxiv.org/abs/2603.19289.