49dAINEWS

Why a 24% Score on a Reasoning Benchmark Is an Argument About Compute

reported by Sky · 4 min read · published April 6, 2026

PREVIEWWhy a 24% Score on a Reasoning Benchmark Is an Argument About Compute · MD

A neuro-symbolic reasoning system built by CoreThink AI has achieved 24.4% on ARC-AGI-2, a benchmark specifically designed to punish brute-force approaches, by separating perception from rule induction rather than scaling up test-time compute. The result raises a pointed question for the industry's dominant orthodoxy: if a four-stage symbolic pipeline can extract 8.4 percentage points of performance from a weak base model without any fine-tuning, what exactly is all that extra inference-time compute buying?

ARC-AGI-2, the second generation of a benchmark introduced by François Chollet in 2019, evaluates fluid intelligence: the ability to infer abstract transformation rules from a handful of examples and apply them to a novel test case. Unlike conventional supervised learning benchmarks, it offers no training distribution to exploit and explicitly bars tasks susceptible to brute-force enumeration. It is, by design, hostile to scale.

The benchmark's public evaluation set contains 120 tasks. Current frontier models handle it well: Google Gemini 3.1 Deep Think scores 85%, OpenAI GPT-5.4 Pro reaches 83%, and Anthropic Claude Opus 4.6 lands at 69%, according to a March 2026 leaderboard maintained by BraCAI. Below them the curve drops steeply. Grok-4, the strongest open-access model on the board, manages 16%. DeepSeek v3.2 scores 4%. Qwen 3 scores 1%. Llama 4 Maverick scores 0%.

Most of the field is effectively guessing. Yet the response from the industry has been to throw more compute at inference time. Test-time scaling, extended reasoning chains, massive sampling, and self-consistency voting have produced the headline numbers at the frontier. They have done almost nothing for the long tail of weaker models.

CoreThink AI, a startup with researchers from Stanford University, took a different path. Their system, posted to arXiv on April 2, implements a four-stage pipeline that treats perception and reasoning as separate problems. The first stage parses each ARC grid into a structured symbolic scene graph: it identifies connected components, computes bounding boxes and centroids, builds color histograms, and detects enclosed cavities. No neural network is involved. It is deterministic image processing.

The second stage uses a neural prior to propose candidate transformation programs, but those programs are drawn from a fixed domain-specific language (DSL) of 22 atomic Unit Patterns. Think of these as a vocabulary of visual operations: primitives for symmetry detection, filling, rotation, reflection, pattern propagation, and relational constraints between objects. The neural model does not generate arbitrary pixel output. It ranks plausible compositions of these primitives.

The third stage applies cross-example consistency filtering. If a candidate program explains training example one and three but fails on example two, it is discarded. The pipeline keeps only programs that jointly satisfy all demonstrations. It then selects the simplest such program by primitive count, a form of Occam preference that discourages overfit explanations.

The fourth stage generates the test output, either by executing the symbolic program directly or by conditioning an LLM on the retained symbolic hints. A self-consistency vote over multiple LLM samples provides robustness. When combined with outputs from ARC Lang Solver via a meta-classifier, the ensemble reaches 30.8% on the public set.

The paper's own ablation numbers are the most telling part of the result. Removing symbolic hints while keeping self-consistency voting drops performance from 24.4% to 17.5%, a loss of 6.9 percentage points. Removing self-consistency while keeping hints drops it from 24.4% to 20.5%, a loss of 3.9 points. The symbolic preprocessing layer is doing the heavy lifting. It adds negligible runtime. The LLM sampling that dominates runtime cost contributes the smaller gain.

What this means for the compute-as-progress story is worth sitting with. Test-time scaling proponents have argued that throwing more inference compute at a problem produces better results, full stop. The CoreThink result does not refute this at the frontier, where Gemini 3.1 Deep Think at 85% is not using a neuro-symbolic pipeline. But for the broad middle of the capability distribution, the result suggests that structural inductive bias matters more than stochastic sampling. A model at 16% can reach 24.4% with the right scaffolding around it. That scaffolding is cheap. The LLM inference is expensive.

The paper makes the point explicitly: separating perception, neural-guided transformation proposal, and symbolic consistency filtering "reduces reliance on brute-force search and sampling-based test-time scaling." That is a direct challenge to the industry's dominant investment thesis, framed not as a philosophical argument but as an empirical result on a public benchmark.

There are legitimate reasons for skepticism. The 22-primitive DSL is hand-engineered, which means domain knowledge was injected by humans rather than discovered from data. The pipeline was evaluated on ARC-AGI-2, a specific task family, and it is unclear how well the approach generalizes to other reasoning domains. The 30.8% ensemble figure combines the neuro-symbolic reasoner with an LLM-based solver, making it harder to attribute gains precisely. And the gap to frontier performance remains enormous: the architecture that helped Grok-4 climb from 16% to 24% has not closed the 60-point chasm between Grok-4 and Gemini 3.1 Deep Think.

Whether the approach scales with the frontier or merely narrows the gap for weaker models is an open question. The authors note that at time of release in November 2025, their standalone Compositional Reasoner held state-of-the-art under their experimental setup, before frontier models had reached their current heights. That qualifier matters. The result is real. The generalization claim is not yet proven.

The code is open-source at github.com/CoreThink-AI/arc-agi-2-reasoner. The paper is arXiv:2604.02434, submitted April 2 by Anugyan Das, Omkar Ghugarkar, and Vishvesh Bhat from CoreThink AI, alongside Asad Aali from Stanford University.

What the CoreThink result suggests, cautiously, is that for tasks requiring genuine compositional generalization, the architecture of reasoning matters independently of the scale of the model doing it. The 24.4% figure is not going to make frontier labs reconsider their roadmaps. But for anyone building systems that need to generalize from few examples rather than interpolate from massive datasets, it is a proof of concept worth examining.

Why a 24% Score on a Reasoning Benchmark Is an Argument About Compute

Sources