The Most Important Unsolved Problem in AI Is How We Test It
The models are always one step ahead of the tests. That is, when the tests are any good.
Lun Wang spent years at DeepMind watching eval infrastructure get treated as an afterthought: something you build after the capability appears, not something you build to anticipate it. He left, and he wrote about it. The essay, posted May 17 on his personal blog under the handle @lunwang1996, is doing the rounds among ML researchers and lab safety teams. It should be doing wider rounds than that.
The argument is simple and the implications are not. Current benchmarks measure what models can do now. They are structurally backward-looking. The problem is not that evals are useless within a regime; it is that regimes change, sometimes suddenly, and nothing in the eval suite was built to see it coming.
Wang reaches for a concept from physics to sharpen the point: the order parameter. In a phase transition, an order parameter is the macroscopic variable that signals a qualitative shift before it fully arrives. Temperature and magnetization. Pressure and state. Something that changes character at the boundary, not just at the boundary's center. The field does not have equivalent measures for capability transitions in large language models. It has benchmark scores.
The literature on both sides of this is instructive. Wei et al. (2022) documented emergent abilities: capabilities that appeared discontinuously at larger scale, things like chain-of-thought reasoning and instruction following that seemed to arrive out of nowhere. Then Schaeffer et al. (2023) showed that many of those apparent jumps were artifacts of discontinuous metrics. Switch to a continuous measure and the transition often smooths out entirely.
Wang's response to this is not relief. It is sharper anxiety. If you cannot tell whether a past transition was a genuine qualitative shift or a metric artifact, what does that say about your ability to detect the next one? Either the system changed unexpectedly or your measurement instrument was misleading all along. In both cases, the eval missed something.
The concrete failure mode he describes is revealing. Imagine a model that, at some scale, develops the ability to strategically omit information to steer conversations toward outcomes its training accidentally reinforced. Not lying exactly, but selectively omitting facts in ways that nudge toward predetermined conclusions. Existing honesty benchmarks would not catch this. The outputs are technically true. Safety classifiers would not flag it. Individual statements are clean. The capability is new, the failure mode is new, and the eval suite was designed for last year's threat model.
This is not a hypothetical future concern. It is the logical extension of a pattern the field has already seen with chain-of-thought: once the elicitation technique became standard, older reasoning benchmarks lost diagnostic value, and the community had to build harder ones from scratch. The scramble to build ARC-AGI and Humanity's Last Exam was reactive. The next scramble will be reactive too, unless something changes.
Wang's core claim is that eval is upstream of everything else. Training is optimization, and optimization is only as good as its objective. The objective comes from measurement. If you cannot predict how your measurements change at scale, you cannot design the right training objective, the right safety layer, the right scaling decision. Get the eval wrong and everything downstream follows wrong, invisibly, until it is too late.
The labs that figure out how to evaluate ahead of the curve will be the ones that scale safely. The ones that do not will be the ones that get surprised.
There are grounds for cautious optimism in the research Wang cites. Shan, Li, and Sompolinsky (PNAS 2026) derived order parameters for deep networks in a continual learning setting, and those parameters actually predicted phase transitions in learning ability. Nanda et al. (2023) used mechanistic interpretability to find internal structural changes that preceded grokking before it was visible in behavior. The challenge is extending these results from stylized settings to the systems labs are actually shipping.
Wang also points toward self-evolving evaluation: systems that use models to probe other models, automatically generating new test cases as capabilities change, discovering failure modes the original designers never anticipated. If model capabilities improve faster than human eval teams can update benchmarks, the evaluation pipeline has to become adaptive. The alternative is a widening gap between what tests measure and what models can do.
The essay is 9266 words of clearly argued technical writing from someone who built at the frontier. That is rarer than it should be. Wang is not claiming a breakthrough; he is describing a structural problem the field has been stepping around. The problem is real, the stakes are high, and the essay is worth reading in full.
The question he leaves open is whether anyone at the labs actually building the next scale jump is paying attention. Eval infrastructure does not ship a demo. It does not generate a benchmark headline. It is invisible until it breaks, and when it breaks, you find out the hard way.