AI benchmarks are not broken. The metric is. That is the core of Noam Brown's new essay and No Priors podcast appearance, and it changes what a smart buyer or policymaker can read off a leaderboard today.
Brown, an OpenAI research scientist, argues that the field has spent two years collapsing three independent variables into a single number. Base model quality, the inference-time compute budget a model is allowed to spend on a question, and the dollar cost of that budget are now tangled together on every scoreboard. A higher number can mean a smarter model, or it can mean a model that was simply allowed to think longer and paid more to do it. "Performance on a benchmark as a single number doesn't really even make sense anymore," Brown said, as quoted by OfficeChai. Until evaluators report performance-versus-cost or performance-versus-time curves, the score a reader is looking at is a mislabeled package.
The cases Brown uses to make the point are stark. On the ARC-AGI abstract-reasoning benchmark, figures cited from his essay put OpenAI's o3 system at the highest published score at a reported cost of roughly $30,000 per question, while a competing small model reached 24 percent at roughly $0.20 per question. The two results are not on the same scale, yet a single ranked list places one above the other. Brown called that leaderboard ranking "already become meaningless".
A second example lands closer to the current frontier. On the MRC v2 benchmark, which tests reasoning over million-token contexts, figures cited in coverage of Brown's argument put GPT-5.4 Pro at 36.6 percent and GPT-5.5 Pro at 74.0 percent, a near-doubling that reads like a generational model jump. The two systems were not priced alike. API listings cited in the same coverage put GPT-5.5 Pro at roughly $5 per million input tokens and $30 per million output tokens, against $30 and $180 for GPT-5.4 Pro, a roughly six-times price delta. A buyer reading that as a flat quality delta is paying for reasoning budget they did not see on the chart.
The underlying mechanism is well studied but only recently absorbed by the benchmark ecosystem. In a 2024 paper now widely cited as the foundation for the field, Snell and colleagues at Google showed that scaling inference-time compute optimally can be more effective than scaling model parameters. Brown's case is that leaderboards have not caught up with that paper. Andrej Karpathy's published experiments and the UK AI Safety Institute's continuing cyber evaluations past 100 million inference tokens both show a logarithmic-linear pattern: more compute buys more capability in a predictable way. A static score freezes one point on a curve and calls it the model.
Brown also points the finger at himself. The reasoning-model pattern traces back to the launch of OpenAI's o1 in late 2024, and the approach has since spread to Anthropic's Claude extended thinking mode and Google's Gemini Deep Think. Inference-time compute is no longer a research trick. It is the default product surface. In a follow-up thread, Brown argues that safety evaluations and recursive-self-improvement limits inherit the same gap: a preparedness framework or Responsible Scaling Policy threshold that conditions on a static benchmark score is conditioning on a moving target, because the same base model looks more or less capable depending on how much reasoning time is allowed.
His proposed fix is deliberately modest. Report performance-versus-cost or performance-versus-time curves alongside the score, the way standardized tests like the SAT and the International Mathematical Olympiad normalize time across all takers. Treat inference budget as part of the model specification, not as a hidden multiplier. The ICLR 2026 workshop program on evaluation methodology, which Brown references, is one venue where the curve idea is being argued in public.
The honest caveat is that this is a researcher's argument, not a company position. Brown publishes the critique as a personal essay and on X, and OpenAI has not, to date, formally endorsed cost-curve reporting as an evaluation standard. The specific dollar figures attached to ARC-AGI and to the GPT-5.5 versus GPT-5.4 API comparison are circulating through secondary outlets that paraphrase his essay and his threads, and they should be checked against the primary text before being used as exact numbers in a procurement decision. The structural point survives even if any individual figure moves: the leaderboard number was never the unit of measurement, and the people buying frontier AI are the ones paying for that confusion.
What to watch next is whether the major evaluation suites, ARC-AGI, the MMLU successors, and the long-context reasoning benchmarks, adopt cost-normalized reporting before the next model release cycle, and whether the next round of AI safety frameworks explicitly condition thresholds on inference budget rather than on raw score.