The best AI systems in the world just got destroyed on a benchmark designed to test whether they can think. Not perform, not match a pattern, not optimize for a reward function. Think.
ARC-AGI-3, launched publicly at Y Combinator's San Francisco headquarters on March 25, is the third iteration of a test created by François Chollet, an AI researcher who has spent years arguing that benchmarks measuring statistical pattern matching are not the same as measuring intelligence. The new version is the first fully interactive entry in the series — instead of selecting from a fixed set of outputs, AI systems must navigate hundreds of original turn-based environments with no instructions, no rules, and no stated goals, the same conditions an untrained human faces.
The results are an indictment. Frontier AI models scored between zero and four-tenths of a percentage point on the benchmark. Gemini 3.1 Pro Preview hit 0.37 percent. GPT 5.4 reached 0.26 percent. Opus 4.6 managed 0.25 percent. Grok-4.20 scored zero. Humans — the second-best of ten first-time players per environment — score 100 percent. The median time for a human to attempt an ARC-AGI-3 session was 7.4 minutes.
The gap would be alarming on its own. But the more instructive finding came from a separate evaluation conducted by Duke University with Opus 4.6, Anthropic's most capable model at the time of testing. When given a hand-crafted harness — essentially a scaffolding layer written by researchers who already knew the specific environment — Opus 4.6 scored 97.1 percent on a known environment and dropped to zero on an unfamiliar one. Ninety-seven percent to zero. The model did not learn to reason. It memorized a route.
"The question is not whether the model can solve this specific problem," Chollet said at the launch event, in a fireside conversation with OpenAI CEO Sam Altman. "The question is whether it can solve any new problem of this type. That is a fundamentally different capability."
The scoring methodology reinforces the distinction. ARC-AGI-3 uses Relative Human Action Efficiency, or RHAE — a formula that compares how many actions a human takes against how many an AI takes, squared per level and capped at 1.0. A model that takes 100 actions when a human takes 2 scores 0.0004 on that level. Getting credit requires not just solving the task but solving it the way an unbriefed person would. The benchmark penalizes brute-force trial-and-error heavily compared to solutions that mirror human reasoning.
Chollet has argued for years that what the field calls "AGI" is often a system that performs well on tasks humans have already trained it on, with extensive human scaffolding. General intelligence, in his framing, means the ability to acquire new skills in unfamiliar contexts without task-specific help. A system that requires a human-written harness to reach 97 percent on known problems and zero on new ones is not displaying general intelligence. It is displaying a very sophisticated form of memoization.
The benchmark carries a $2 million prize pool, with a $700,000 grand prize for a system that reaches 100 percent — matching the human baseline. The structural incentive is deliberately misaligned with how AI labs currently compete. The prize rewards generalization without scaffolding. The research and deployment culture rewards performance with human engineers in the loop.
That misalignment is the point. The ARC Prize Foundation was co-founded by Mike Knoop and François Chollet specifically to create an incentive that points in a different direction than compute and parameters. The 135 environments in ARC-AGI-3 were designed by a team of game designers led by Hunter Henry, a designer Chollet hired to build tasks that feel trivial to humans and impossible to pattern-match against training data.
What does this mean for the industry? Every public claim that frontier AI is approaching human-level reasoning on open-ended tasks deserves skepticism when the underlying evidence is performance on known tasks with extensive human support. The 97-to-zero gap is not an edge case. It is the baseline behavior of the most capable models when the scaffolding is removed.
The counterargument is that ARC-AGI-3 tests a narrow form of fluid reasoning that may not matter for most real-world applications. A model that scores 0.26 percent on this benchmark can still write coherent code, summarize documents, and pass professional exams. The question is whether those capabilities are building toward generalization or whether they represent a ceiling.
Chollet's answer, backed by years of this specific line of research, is that the distinction matters enormously for what comes next. If the ceiling is memorized competence, the path to genuine generalization requires a different approach — and the billion-dollar question is whether anyone is actually building toward it, or whether the industry is simply measuring the wrong thing in a race where the prize is revenue, not reasoning.
The next ARC-AGI-3 evaluation window opens later this year. The prize remains unclaimed.