The chatbot benchmark is being retired. Its replacement counts weeks of work per AI run.

The chatbot benchmark is being retired. Its replacement counts weeks of work per AI run. — type0 | type0

PREVIEWThe chatbot benchmark is being retired. Its replacement counts weeks of work per AI run. · MD

For most of the public AI conversation, the yardstick has been conversational. How well did the model answer the question? How politely? Did it pass the bar exam, the medical boards, the coding interview? Those tests produced familiar leaderboards and a familiar story: AI is getting smarter, fast.

That framing is now misaligned with what the most capable AI systems actually do. The shift is showing up first at the independent evaluation shops that governments and serious buyers have come to rely on. METR, a nonprofit that times how long real software tasks take human engineers and then asks AI to do them, the UK government's AI Security Institute (AISI), the cross-occupation comparison called GDPval, and the research group Epoch now all converge on the same question: how many hours, days, or weeks of human work does a single AI run replace, and what does it cost to produce?

That reframing is the actual content of what Ethan Mollick calls the twilight of the chatbots in his One Useful Thing column. It is not an obituary for conversational AI. It is a description of a quiet measurement crisis, in which the evaluative infrastructure (the benchmarks, the procurement criteria, the productivity metrics) is now several steps behind the capability jump it is meant to track.

The concrete data points are specific. Mollick cites a recent Epoch finding in which Opus 4.7, an Anthropic model, worked autonomously for roughly fourteen hours and shipped a software package he estimates as the equivalent of two to seventeen weeks of human engineering, for $251 in tokens. Anthropic's earlier Fable model ran unsupervised for about nine hours on complex software projects. These are not chatbot interactions. They are project deliveries.

The escalation shows up across multiple independent evaluations running in parallel. METR, Mollick reports, measures the time it takes a human professional to complete real software tasks, then asks an AI to attempt the same tasks, and tracks the task length the AI can finish end to end. The curve, according to Mollick's synthesis of recent results, is increasing at a better-than-exponential rate. GDPval, a comparison in which human experts and AI work side by side and professionals judge the output, points in the same direction. The UK AI Security Institute's parallel work on autonomous cyber capability tracks the same pattern in a narrower, more consequential domain. Multiple measurements, run independently, point the same way.

Three things follow. First, the capabilities are real but unevenly distributed. Mollick explicitly retains his "jagged frontier" framing: the new systems are very capable at some tasks and surprisingly poor at others, and average-case benchmark scores can mislead about specific deployments. Second, the new measurements are themselves old. Many of the underlying evaluation infrastructures were originally built for chatbots and have not caught up to agents that run for hours or days.

Third, and most under-reported, the shift is creating a new mismatch between what is measurable and what regulators and large customers can actually use. In mid-June 2026, Anthropic disabled Claude Fable 5 and Mythos 5 after a US export-control order, according to Forbes' report on Anthropic's customer notice and Wired's coverage. OpenAI then limited the rollout of GPT-5.6 after a government request, according to TechCrunch, while publicly arguing restrictions should not become the norm. The two most capable recent American frontier models are now restricted to narrower audiences than their benchmarks implied they should reach.

That squeeze opens a fault line the headline numbers do not show directly. Behind the closed US frontier of Anthropic, OpenAI, and Google, a separate ecosystem of "open weights" models, that is, systems whose trained parameters are released publicly so anyone can run or modify them, sits roughly six to twelve months behind on its own exponential curve, according to Mollick. For organizations whose procurement decisions now hinge on which model can credibly replace a week of analyst or engineering work, the practical capability gap is months rather than years.

The policy picture complicates everything. The Conversation argues the US move to disable Claude Mythos reflects an attempt to keep the most capable American systems out of adversary hands while leaving the closed frontier intact. That logic assumes the capability advantage compounds faster than adversaries can substitute. The Mollick read suggests the opposite: the closed-frontier advantage is real but the open-weights curve runs in parallel, and Chinese open models are catching up on their own trajectory.

The reader's stake here is not whether the chatbots are dying. They are not. Conversational AI will keep answering questions. The stake is whether the people deciding what to buy, what to build, and what to regulate are still grading these systems on a quiz they were originally designed to pass, while the actual work they can do has shifted from seconds to weeks. The new scoreboard is already in use at METR, AISI, GDPval, and Epoch. Most procurement decks have not noticed.

The chatbot benchmark is being retired. Its replacement counts weeks of work per AI run.

Sources