MaxProof's Olympiad Scores Are the Wrong Place to Start
The interesting move in the new arXiv preprint is not 35/42 on IMO 2025 or 36/42 on USAMO 2026. It is that one model now runs the entire proof search.
The interesting move in the new arXiv preprint is not 35/42 on IMO 2025 or 36/42 on USAMO 2026. It is that one model now runs the entire proof search.
The paper MaxProof, posted to arXiv this month, claims a system from the MiniMax-M3 series hit 35 out of 42 on the 2025 International Mathematical Olympiad and 36 of 42 on the 2026 USAMO, with the authors claiming both clear the human gold-medal threshold. The numbers are striking. They are also the wrong place to start.
What separates MaxProof from earlier olympiad solvers is structural. The system runs the same underlying M3 model in four roles: it generates candidate proofs, judges which candidates are correct, repairs the ones that are not, and ranks the survivors. Search proceeds across a population of candidate proofs rather than a single chain, and the final output is selected by tournament. The headline score is a property of that mechanism, not the other way around.
The training side mirrors the inference side. M3 was built around three proof-oriented capabilities: proof generation, proof verification, and critique-conditioned proof repair, combined into a single released checkpoint. The verifier is described as a defense-in-depth generative model engineered for a low false-positive rate. That choice matters. In a population search, the verifier is what prevents a system from confidently shipping a wrong proof, and a permissive verifier would inflate scores without doing real work.
The architecture also explains why the two scores sit close together despite very different problem sets. Both olympiads reward complete, checkable proofs rather than numerical answers, which is exactly the structure population search with a generative verifier is built to exploit. The system is doing the thing it was built to do.
It is also a vendor self-report on an arXiv preprint with no peer review and no independent reproduction cited. There is no disclosed compute, no public trace of the search, and no contamination audit. The "MiniMax-M3 series" framing in the abstract has not been independently checked against released model cards. A reader who treats 35/42 and 36/42 as established fact is reading past the provenance.
The bar this kind of result has to clear is also a moving target. DeepMind's AlphaProof, working with a formal Lean environment and far heavier search, reached olympiad gold-medal territory in 2024 and did so with a different shape of argument: a slow loop between a learned policy and a formal proof checker. MaxProof's generative verifier is a softer instrument, and the population search substitutes scale of candidates for tightness of grounding. Both are real advances. Neither is a finished account of how AI does mathematics.
What the paper does establish, even before any external check, is that the bottleneck for olympiad-class performance has moved. Earlier work treated proof generation and proof checking as separate problems with separate models. MaxProof folds them into a single trained system, then bets the result on scale of search across a population. The score is what that bet returns. The architecture is what makes the bet legible.
The wider question is whether olympiad-style proof is the right thing to scale. Competition problems are short, self-contained, and admit a clean binary judgment of correctness, which is exactly the regime where a population search with a learned verifier can win. Open mathematical research is longer, more conversational, and rarely settles into a single checkable artifact. A system that wins two olympiads has earned a real place in the reasoning-AI timeline. It has not, on this evidence, earned a place in the conversation about what mathematics is.