The easy AI benchmarks are dying. The hard ones are just getting started.
There is a number that sounds like proof of AI's relentless progress: 96.88 percent, GPT-5.2's score on AIME 2025, a math competition benchmark. The number is real. But what it measures is narrower than it looks. AIME tests short-horizon reasoning on problems that, once you have seen enough of them, can be solved by pattern-matching rather than genuine mathematical reasoning. When a model scores near-perfect on AIME, it has largely solved the benchmark, not the frontier of mathematics.
The harder question is whether the benchmarks themselves are built to survive. A team at Epoch AI has been answering that question with a new class of evaluation designed around tasks that cannot be solved by memorization or short-horizon pattern matching. Their project, MirrorCode, co-developed with METR, an AI safety research organization, tests whether an AI system can reimplement a real command-line program from a specification and a black-box reference implementation it can query but not read. The tasks range from small utilities to programs spanning tens of thousands of lines of code. The design principle is that difficulty scales with the complexity of the real-world task, not the cleverness of the test.
Disclosure: Epoch AI co-developed MirrorCode with METR, an AI safety research organization.
The results are genuinely new. Claude Opus 4.6, Anthropic's most capable model, successfully reimplemented gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and more than 40 commands. Epoch estimates this task would take a human engineer without AI assistance two to 17 weeks. The model solved it autonomously. Anthropic's separate C compiler project, in which 16 Claude Opus 4.6 agents built a Rust-based compiler from scratch in two weeks for roughly $20,000, achieving a 99 percent pass rate on the GCC torture test suite, illustrates the same dynamic from a different angle: AI can now handle software engineering tasks previously assumed to require months of human work.
The structural distinction matters. Knowledge benchmarks saturate fast because they measure recall and short-horizon reasoning. Long-horizon coding benchmarks resist saturation because the difficulty is in the complexity of the task itself, not in how clever the test is. As a result, established knowledge benchmarks — MMLU-Pro at 90 percent plus, AIME at near-perfect scores — are measuring less than their scores suggest. SWE-bench's trajectory, from 60 to near-perfect in one year, is a better signal of sustained coding capability than any single knowledge benchmark.
"Benchmarks are less useful than they used to be," said Tom Adamczewski, a senior research engineer at Epoch who develops new benchmarks. "But there is still additional information, and people are going to continue to release new benchmarks." His colleague Greg Burnham, who leads Epoch's benchmarking team, put it more bluntly: "I think we are living through a golden age of benchmarking."
The disagreement with benchmark pessimism is real, but it is not just optimism. The data supports a structural reading. GPQA, a graduate-level science benchmark, took two years to saturate — in part because it was designed with expert review and adversarial testing. MirrorCode's hardest task, reimplementing Pkl, a configuration language from Apple with roughly 61,000 lines of code, remains unsolved even with a billion-token inference budget. The frontier is not a wall. It is a moving boundary that recedes as compute scales, and some benchmark designs hold it back longer than others.
The Stanford AI Index 2026 quantifies the pressure: frontier models gained 30 percentage points on Humanity's Last Exam within a single testing cycle. A popular math benchmark has a 42 percent error rate. These are not isolated failures. They are symptoms of a structural mismatch between what knowledge benchmarks measure and what capability actually means at the frontier.
The practical consequence is that the field is splitting into two evaluation regimes. Public knowledge benchmarks are increasingly theater: labs score near-perfect, reporters cite the scores, and the actual capability signal migrates to proprietary long-horizon evaluations that cannot be independently verified. This is not just a measurement problem. It is an accountability problem. When the scoreboard becomes private, researchers, regulators, and the public lose the ability to track AI development independently.
The benchmark is not doomed. The easy ones are.