The Four-Benchmark Minimum: Why Your LLM Leaderboard Can't Tell #1 From #2
A stereological reading of three public leaderboards says the suite sees the world through a 3- to 5-dimensional lens — and that a principled 4-benchmark core recovers the ranking with a published guarantee.
Pick a public LLM leaderboard — Open LLM v2, the extended 12-benchmark suite, or LiveBench. Now hold this in your head: the visible score gap between the model at #1 and the model at #2 on that board is smaller than the geometry of the suite itself can resolve. The ranking wobbles. A new theory paper by Jason Z. Wang, "The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models" (arXiv:2606.05169, v1, April 2026), shows why — and offers a fix.
The wobble, in one number
Wang measures the effective dimensionality d_eff of three independent leaderboards on their competitive frontier and finds it falls in [2.86, 4.80]. That figure is the punchline: even though the suites are built from 12 named benchmarks, the information they carry about frontier models behaves as if it lives in roughly 3 to 5 independent directions, not 12. A 12-axis instrument that is secretly 3- to 5-dimensional cannot, by construction, separate two models whose visible score gap is smaller than the suite's intrinsic resolution.
The numbers that follow from that one fact are blunt. The structural blind spot — the maximum hidden capability gap compatible with identical published scores — exceeds the observed runner-up score gap on the same board by roughly two orders of magnitude, and dominates statistical noise on the leaderboard by 52× to 127× (Wang, arXiv:2606.05169). The field has been reading tiny deltas as news; the suite's own geometry says they are inside the noise floor.
What the wobble looks like in practice
Wang runs a chi-squared projection model to simulate the worst case. Across six hidden-capability priors and four ambient dimensions, the simulated half-split swap rate of the top two models stays in [0.38, 0.49] — essentially a coin flip. Then he runs the harder experiment: 500 trials in which he randomly partitions the visible benchmarks from a held-out set, recomputes the ranking, and asks how often the top-1 model survives. The answer, per the paper, is that 92% of trials swap the #1 ranking, with an average of 2.83 of the top 5 models changing per split. The "ranking" you read this morning is one draw out of a distribution.
This is the single most legible number in the paper. If nine out of ten reasonable splits of the same suite give you a different #1, the suite is not telling you who is #1.
The constructive payoff: the irreducible four
Here is the part evaluators should print out and tape to a monitor. Wang formulates benchmark selection as a submodular coverage problem and applies a greedy algorithm with the Nemhauser–Wolley–Fisher (1 − 1/e) approximation guarantee. On the 12-benchmark extended suite, the algorithm finds a stable core of 4 benchmarks — an irreducible set, in the sense that removing any one of them collapses coverage of the capability surface. Seven of the twelve benchmarks suffice for ~90% coverage, and that 7-benchmark subset transfers across temporal quarters with 93–97% retention, meaning the chosen core is not a quirk of one snapshot of the leaderboard.
Translated into a question you can put to a benchmark vendor tomorrow: which four? If they cannot name four irreducible benchmarks and tell you which directions of capability they pin down, the suite they are selling you is decorative.
Does the theory predict reality?
The risk with a result this clean is that it could be a mathematical artifact. Wang runs a counterfactual validation against 12 internal benchmarks and 27 public Chatbot Arena categories. The eigenstructure he derives predicts, with Spearman ρ = −0.69 (p = 0.013), which evaluations are irreplaceable — removing them from the suite causes the largest disruption to the recovered ranking — and predicts, with ρ = +0.38, which external evaluations bring genuinely new information rather than restating what the core already covers (arXiv:2606.05169). The geometry is not just internally consistent; it lines up with which benchmarks actually move the needle.
A second, independent result
As a separate theoretical contribution, the paper resolves Gardner's Problem 1.5 (1995) for C² support functions, establishing a minimax rate of Θ(R / (κ m^{2/(D−1)})) in general dimension via optimal recovery theory on the sphere S^{D−1} (Wang, arXiv:2606.05169). The Gardner result is not required reading for the LLM-evaluation story, but it does mean the bounds Wang uses are not ad hoc — they sit on a clean optimal-recovery footing that predates the LLM era by three decades.
What to do with this
Stop reading the #1–#2 score gap on a 12-benchmark suite as a meaningful signal. Treat deltas below the published structural blind spot as noise. The suite is a 3- to 5-dimensional object, not a 12-axis one, and the field has been spending its attention on the 12 axes. Wang's constructive contribution — a 4-benchmark irreducible core, a 7-benchmark 90% set, and a counterfactually validated eigenstructure — gives evaluators a way out: pick the core, publish its d_eff, and stop racing to add benchmarks when the geometry says the racing was never resolving anything in the first place.
The paper is a single-author preprint and has not been peer reviewed; the 92% swap rate, the ~100× ratio, and the 4-benchmark core should be cited to Wang (2026) / arXiv:2606.05169, not asserted as established consensus. But the constructive test the result invites is sharp and falsifiable: ask your next leaderboard which four benchmarks form its irreducible core, and ask it to publish d_eff. If it cannot, the leaderboard is a scoreboard — and scoreboards, the paper argues, are not the same thing as evaluations.