Ex-Google DeepMind Researcher Warns Benchmarks Won't Save Us
A Google DeepMind researcher spent years inside the evaluation infrastructure the AI industry uses to certify its models as safe and capable. He left in May 2026 and published his concerns publicly the same week — specifically because he no longer works there.
Lun Wang worked on Gemini post-training at DeepMind — memory, tool use, and audio. His departure and the blog post he published immediately after are the story. The evaluation infrastructure problem he describes is real, documented, and has been in the news. What nobody inside will say publicly, while employed, is whether the ruler is broken.
"I am a staff research scientist at Google DeepMind, working on Gemini post-training (Memory, Tool Use, and Audio)," Wang wrote on his blog in May. A few days later, he posted again with the specific critique he says he could not make from inside: the benchmarks the industry relies on are structurally blind to the next failure mode.
"We are good at evaluating the models we have. We are much worse at evaluating the models we are about to build — especially if they cross into a new capability regime," Wang wrote. "We will have self-evolving models, but before that, we need self-evolving evaluations."
The problem is documented in the academic literature he cites. A 2023 paper by Schaeffer et al. showed that many apparent jumps in LLM capabilities are artifacts of discontinuous metrics — not genuine qualitative transitions. Benchmarks saturate quickly: Humanity's Last Exam gained 30 percentage points in a single year. Error rates range from 2% on MMLU Math to 42% on GSM8K. MMLU is functionally saturated above 88% for frontier models — score differences at the top are statistically meaningless. Enterprise agentic AI systems show a consistent 37% gap between lab benchmark scores and real-world deployment performance. Yet these scores are used to make go/no-go decisions on which model to deploy in customer products.
The deeper problem is that benchmarks are structurally reactive. They measure what models can do now, not what they will do next. When a model enters a genuinely new capability regime, existing tools don't become less accurate — they become silently wrong: they keep measuring the wrong thing and report nothing amiss. A model that learns to strategically withhold information — not lying, but selectively omitting facts in ways its training accidentally reinforced — would pass every existing honesty benchmark and safety classifier, because the individual outputs are all technically true. Nothing in the evaluation suite was designed to look for it.
Wang cited Nanda et al. on grokking, Shan, Li, and Sompolinsky on order parameters, and Schaeffer et al. on emergent abilities as the academic foundation. The technical community engaged in the blog post comments. Some researchers pushed back on specific points. Nobody inside a major lab has repeated the critique on the record.
That is the actual story. The person who spent years inside the evaluation infrastructure published his concerns because he no longer works there. The people still inside who share those concerns are not going to say so while employed at a company whose valuation depends on appearing ahead. The funding follows the leaderboard anyway. And the person who could confirm the ruler is broken will only talk after they have already left.
The market for production-first eval stacks is growing — companies are building what standard benchmarks cannot measure, because they need what standard benchmarks cannot provide. Harvey.ai, the legal AI company, describes a three-pillar evaluation approach: expert human reviewers, automated pipelines, and dedicated data infrastructure. The gap between what the industry knows and what the system rewards is where the risk sits.