0.37%: The Humbling Score for AI's Best Models
Two days before François Chollet published ARC-AGI-3, Jensen Huang went on the Lex Fridman podcast and said, plainly: we have achieved AGI. The benchmark that dropped on March 26 suggests otherwise — by a wide margin.
Every frontier model failed. According to The Decoder's coverage of the ARC Prize Foundation's release, Gemini 3.1 Pro Preview scored 0.37 percent, GPT 5.4 reached 0.26 percent, Opus 4.6 managed 0.25 percent, and Grok-4.20 scored 0.00 percent on the private test set. Humans, given the same tasks with no prior knowledge and no instructions, solved all 135 environments. The best AI agent during the month-long developer preview hit 12.58 percent. Frontier models tested through the official API — text in, text out, no custom tooling — could not crack 1 percent.
Chollet built this to hurt. The ARC Prize Foundation, which he co-founded with Mike Knoop, set up an in-house game studio and created 135 original interactive environments from scratch, designed to resist the strategies that killed ARC-AGI-1 and ARC-AGI-2. Those earlier versions were saturation casualties: labs threw compute and test-time training at them until the benchmarks were effectively dead. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro hit 77.1 percent and the test lost its teeth. Chollet has said publicly that labs are paying far more attention to V3 than to earlier versions — which tells you he knows what he is up against.
The number that should make anyone pause: researchers at Duke University built a custom harness that gave Opus 4.6 visual input instead of the standard text-in API. On a single known environment variant called TR87, the harness scored 97.1 percent. Opus official overall score on ARC-AGI-3 through the standard API remained 0.25 percent. One environment. Same model. Different interface. The raw perceptual machinery was apparently not the problem.
Chollet has said the benchmark tests reasoning, not perception. If you give the model direct visual access and it solves the task, you have not demonstrated general intelligence — you have demonstrated that the model can see. The official leaderboard does not use harnesses — text in, text out, through the API. The question the benchmark is asking is whether the model can take a sparse textual description of an interactive grid world and infer the underlying rule — not whether it can navigate the interface well.
Whether that distinction holds is exactly where the interesting argument lives. ARC tasks are visual by design: colored grids, spatial transforms, object persistence. Describing them in language is lossy. A model that can solve the task when it can see the grid, but cannot when given a textual encoding, is exhibiting a real limitation — but is it a reasoning limitation or a representation limitation? Chollet says reasoning. The Duke result complicates that story in ways the ARC Prize Foundation has not fully resolved in public.
The $2 million prize — across three tracks, with all winning solutions required to be open-sourced — is calibrated to attract serious attempts. That open-source requirement is the interesting competitive bet. If someone cracks the benchmark with a publicly documented approach, every lab benefits. If no one does, the private test set stays private long enough to serve as a genuine measurement tool rather than a training set.
486 human players established the performance baseline across thousands of first-run attempts. The scoring formula — squared ratio of human actions to AI actions, capped at the human baseline per level — penalizes brute-force approaches that use far more steps than a human would. You can solve the task and still score zero if you solve it wastefully. The AI also gets a maximum number of attempts per task: five times the human attempt count. More attempts than that and the task is marked unresolved.
The broader context is the claim-to-reality gap that benchmarks like this are trying to expose. MMLU has been saturated for years. GSM8K is homework. HumanEval is a coding interview. When Jensen Huang says AGI is here and frontier models score below 1 percent on the most carefully constructed test of fluid reasoning that Chollet and Knoop know how to build, one of those claims is wrong. The ARC-AGI-3 numbers do not resolve the argument — but they make it considerably harder to dismiss as nitpicking.
What to watch next: whether frontier labs treat ARC-AGI-3 as a target or a nuisance. Previous versions were gamed. Chollet has designed protections, but the Duke result suggests the interface is a lever. If labs start shipping models with multimodal wrappers purpose-built for ARC-AGI-3 — visual input, action history, state tracking — the benchmark will face the same saturation pressure as its predecessors. The question is how many rounds that takes.
The prize deadline and leaderboard updates will be the clearest signal of whether the benchmark is holding. Until then, 0.25 percent is the number.
† Add the scoring formula or cite https://docs.arcprize.org/methodology for the complete scoring explanation