Training Bug in Popular AI Reasoning Models Limits Search Power
Test-time search is becoming the default way to scale LLM capability: let a model propose many answers, select the best. But there is a catch that nobody caught until now. When you train a model to output the single best answer, it learns to concentrate probability mass on a narrow band of high-reward responses — the model gets good at one kind of right answer, at the expense of every other valid approach. When that model is deployed inside a search loop, the candidates it produces are near-duplicates. There is nothing diverse left to select from. The search hits a ceiling no matter how many samples it runs.
That is the finding from a new paper by researchers at MIT and Sakana AI, published May 21. The paper calls it a fundamental mismatch between how reasoning models are trained and how they are actually used. The culprit, the authors argue, is the post-training method that became the default for nearly every reasoning model released since DeepSeek-R1: Group Relative Policy Optimization, or GRPO. DeepSeek confirmed GRPO in its R1 technical report; the approach has been widely inferred — but not confirmed by OpenAI — for other frontier systems including o1 and o3, whose system cards name RL without specifying the advantage estimator method. The paper's claim is that this standard method is exactly what causes the collapse.
The proposed fix is called Vector Policy Optimization, or VPO. Where GRPO compresses multiple dimensions of quality into one scalar reward signal, VPO preserves a vector of rewards across independent quality dimensions — multiple valid approaches, each good at different things. In code generation, each test case passed is an independent quality dimension; collapsing those into a single score during training is what causes the model to converge on one strategy. The paper describes VPO as a drop-in replacement for the GRPO advantage estimator, requiring no infrastructure changes — a claim that would need verification on production GRPO checkpoints before treating it as confirmed.
The empirical results are concrete and the paper's sharpest finding is stated directly: "for evolutionary search, VPO models unlock problems that GRPO models cannot solve at all." The scalar-trained GRPO policy collapses onto a single strategy; VPO preserves multiple strategies across the Pareto frontier. The mechanism is documented in the paper's introduction: "policy gradient methods like GRPO drive the policy toward a narrow set of high-probability responses" and "after training, the diversity required for effective test-time search disappears, as additional samples become near-duplicates." On LiveCodeBench, a VPO-trained Qwen2.5-Coder-7B-Instruct improves both pass@k and best@k over a matched-compute GRPO checkpoint, with the gap widening as the candidate budget grows. The paper evaluates across four settings spanning code generation, multi-hop question answering, logic reasoning, navigation, and tool use. All four were authored or co-authored by the VPO team; independent replication remains the open gap.
The real-world stakes become concrete when you look at a named system. AlphaEvolve, Google DeepMind's evolutionary algorithm design framework powered by Gemini, runs exactly this kind of search workload at scale. The paper evaluates VPO inside the OpenEvolve search loop — the open-source ancestor of AlphaEvolve — finding the same diversity collapse pattern: GRPO-trained models exhaust their candidate pool before the search finds a solution. The failure is not hypothetical: in the evolutionary search setting, GRPO models cannot solve certain problems "at all" regardless of sample budget, because the diversity collapse happens during training, not during inference. For anyone running a GRPO-trained model inside a search loop today, the paper's mechanism predicts the model may be limiting what the search can discover. Whether teams have observed this ceiling in production is not yet documented in public literature — the paper demonstrates it in a controlled setting, not in deployed systems.
Independent researchers have flagged one important caveat. NovaSarc, an AI researcher who commented on the work, noted that VPO is only as good as the reward decomposition. If the reward vector is artificial or its components are highly correlated, the model may produce fake diversity that looks good under the objective but does not actually help search. Designing reward decompositions that produce useful search diversity rather than cosmetic reward-space coverage is an open problem the paper does not fully resolve.
Prior work has documented the GRPO diversity collapse risk from multiple angles. Researchers have shown that KL-regularized RL is structurally designed to produce mode collapse, that RLHF consistently reduces output diversity relative to base models, and that output diversity collapses at specific points in the post-training pipeline. The VPO paper adds the specific mechanism — reward scalarization during training — and the specific consequence for test-time search, but the underlying concern about GRPO's effect on response distribution has been flagged in independent work.
GRPO post-training underpins not just DeepSeek-R1 but a generation of reasoning models released since 2024. The Raschka newsletter on state-of-LLM-reasoning-model-training documents the breadth of GRPO's adoption, noting that reinforcement learning post-training has become standard for reasoning models across multiple providers. If VPO delivers on its claims, those models may be systematically mis-trained for the test-time compute regimes their designers intended them to use — a ceiling baked in during post-training that test-time compute cannot overcome. Labs already running GRPO pipelines are best positioned to adopt VPO first, since it requires no infrastructure change; labs locked into proprietary training stacks face a higher migration cost. That asymmetry is the competitive dynamics angle: the teams that can adopt earliest are the ones already on open GRPO infrastructure, and if VPO holds under replication, the early adopters compound their advantage while others wait for verified implementations.
The fix, if it holds under independent replication, is a code change, not an infrastructure rebuild. That is what makes it worth watching: not a new model, but a training bug in models already deployed at scale.