For years, AI coding benchmarks have measured the wrong thing. They give models a few minutes, a few dollars, and a tidy problem, then declare victory when the model passes. MirrorCode, a new benchmark released this week by the AI research organization Epoch AI and the model-evaluation nonprofit METR (Model Evaluation and Threat Research), tries something different. It hands an AI agent a real software project, locks the source code away, gives it no human to ask questions of, and lets it run for weeks.
The point is not to crown a winner. The point is to find out what today's models can actually build when no one is helping.
MirrorCode asks AI systems to rebuild 25 real-world programs from scratch: software spanning bioinformatics tools, Unix utilities, cryptography libraries, and language interpreters. The model never sees the original source code, has no access to the maintainers, and gets no human feedback during the run. Epoch AI's preliminary results writeup frames the exercise as an attempt to measure the "upper bound of current autonomous software-engineering capability," the frontier of what AI can do entirely on its own.
The most striking number in the June 26 Epoch Brief is also the cheapest to explain. The largest task in the benchmark required roughly 19 days of continuous autonomous work and about $2,600 of inference compute (the cost of running the model) for a single run. By comparison, most existing software-engineering benchmarks cap inference at roughly $1 to $10 per task and run for minutes to hours, according to an independent engineering commentary on the benchmark. MirrorCode's whole premise is giving AI real runway.
The current leader is Anthropic's Claude Opus 4.7, the strongest AI coding model tested so far, which solved 56% of the tasks. That sounds impressive until you flip it. The leading model still failed on nearly 44% of real-world software programs it was asked to rebuild, even with weeks of compute and no human interference. The honest read of that number is not "AI is almost a software engineer." It is "AI can carry a meaningful slice of weeks-long engineering work, and a meaningful slice is still beyond it."
That distinction matters because MirrorCode is built to be a yardstick, not a leaderboard. Existing software-engineering benchmarks measure whether a model can fix a small bug or write a function in a single afternoon. MirrorCode measures whether a model can sustain the kind of multi-week, multi-file, multi-decision work that a human engineer does when building or rebuilding a real codebase. Labs, enterprise buyers, and policymakers will all be tempted to use that gap, the distance between minutes-long evals and weeks-long ones, to think more honestly about deployment timelines.
The benchmark is also a deliberate stress test of autonomy. With no source code visible and no one to ask, the model has to plan, debug, recover from its own mistakes, and decide when a task is done. Those are exactly the skills that turn a code-completion tool into something closer to a junior contributor, and they are the ones the prior generation of benchmarks largely skipped.
The near-term read is bounded co-piloting, not replacement. Even Anthropic's Claude Opus 4.7, the strongest model tested, leaves almost half of the work undone. The useful question MirrorCode lets the field ask next is not "which model is winning" but "which kinds of weeks-long tasks are still out of reach, and how fast is that frontier moving." The Epoch AI MirrorCode page and the preliminary results blog are the place to watch as more labs submit their systems and the failure modes get classified.
The interesting months ahead are the ones where the 56% goes up, and where the 44% stops looking like a list of edge cases and starts looking like the shape of the work itself.