The University Lab That Beat Google and OpenAI at Their Own Math Benchmark
Four Georgia Tech researchers just beat Google DeepMind and OpenAI at their own math benchmark.
The team, led by PhD candidate Zelin Zhao with co-authors Bo Yuan, Jaemoo Choi, and Yongxin Chen, published a system called RMA that solved eight out of ten research-level mathematics problems on a benchmark called First Proof. DeepMind's Aletheia model solved six, according to the arXiv paper. OpenAI's GPT-5.2R solved fewer still. The First Proof problems were not toy puzzles. They were drawn from unpublished research by expert mathematicians in algebraic combinatorics, stochastic analysis, and symplectic geometry, among other fields. The benchmark was designed and entered by the researchers who built it.
That detail is load-bearing. Google and OpenAI scientists co-created First Proof and entered it expecting to win. They lost to a four-person Georgia Tech project with no dedicated compute cluster and no frontier lab parent. "This was not supposed to happen," is the polite way to put it.
The result, posted to arXiv on May 20, is a preprint. The code and full evaluation have not yet been released publicly. The authors say code drops on acceptance at a venue. Those are the caveats, and they are real. Self-reported benchmarks from teams with a paper to publish require skepticism. But the architecture behind RMA is worth taking seriously regardless.
RMA stands for Reasoning for Mathematics Agents. Unlike systems designed to compete in math Olympiads, it targets open-ended research problems: the kind that require literature search, long-horizon reasoning across multiple proof strategies, and iterative refinement of candidate solutions. The system uses a multi-role, multi-round workflow in which three agent types—an initializer, a proposer, and a verifier—cycle through candidate proofs and check each other's work. The verifier is particularly important. It applies formal verification checks that can catch logical errors a human reviewer might miss.
The authors ran ablation studies to understand which components drove the gains. The answer: none of them alone. Performance improvements came from the interaction between structured reasoning modules, iterative refinement, and verifier-based feedback. Remove any one of those three and the system degrades substantially. This matters because every major lab currently building autonomous research agents is effectively betting that some variant of this architecture will scale. RMA is evidence that the bet may pay off—but also that there is no single secret ingredient to harvest.
What does this mean for the humans who do mathematical research for a living? That is the question the paper does not answer, and perhaps cannot.
The philosophical framing writes itself: if AI can search literature, connect ideas across domains, refine proofs through iterative checking, and verify results through formal methods, what exactly was the irreducibly human contribution to mathematical research? The honest answer is that we do not know yet. The two problems RMA failed to solve may point to a systematic gap—perhaps in situations where the relevant literature is sparse, or where the problem requires a genuinely novel definition rather than a clever combination of existing ones. Or they may be random noise. The paper does not say.
There is a competing body of evidence worth noting. A paper called Hallucination Stations, published by Vishal Sikka and a collaborator in mid-2025, claimed to mathematically prove that large language models cannot reliably handle complex computational and agentic tasks. OpenAI published its own research in September 2025 showing that all tested models, including ChatGPT, continued to make up fake dissertation titles when asked simple citation questions. These are not niche results. They come from practitioners inside the field who have built the systems and are reporting what they found.
RMA's authors are aware of the contradiction. Their paper cites Sikka's work and notes that their system was designed to address exactly the kind of grounding failures those papers describe. Whether it succeeds is the empirical question. The benchmark result suggests something is working. The two unsolved problems suggest something still is not.
The broader implication is not that mathematicians are unemployed. It is that the pipeline from literature to proof may be substantially more mechanizable than the field assumed. If that is true, the human mathematician's role shifts toward choosing which problems matter, interpreting what the results mean, and asking the questions that formal verification cannot yet formulate. That is not a smaller job. It may be a different one.
What to watch: the First Proof benchmark will face independent evaluation as other teams attempt the problems. If RMA's solutions to the eight solved problems hold up under external review, the result stands. If the two unsolved cases reveal a coherent failure mode, that is the more interesting signal. The gap between what AI can and cannot do in pure mathematics is still being drawn. This chapter of that story was written by four people in Georgia.