Wrong Date vs. Fabricated Ruling: The Hidden Shape Inside AI's Mistakes

PREVIEWWrong Date vs. Fabricated Ruling: The Hidden Shape Inside AI's Mistakes · MD

A model that confidently cites a court ruling that does not exist, and a model that gets a date wrong: on a standard error-rate benchmark, both register as a single "error." In a new preprint, Jason Z Wang's ERRORQUAKE argues that this collapse is not a minor reporting quirk but a structural blind spot in how the field evaluates open-weight large language models — and that any leaderboard that reports only a scalar accuracy is, in effect, refusing to answer the question every high-stakes deployer actually cares about: not just how often a model is wrong, but how bad the wrongness gets.

The paper, posted to arXiv on 15 April 2026 as arXiv:2606.05170, introduces a benchmark called Errorquake-10k: 10,000 queries scored on a continuous 0–4 severity scale across eight domains — BIO, LAW, HIST, GEO, SCI, TECH, FIN, CULT — and five difficulty tiers, with per-model severity distributions fit for 21 open-weight LLMs from 10 families spanning roughly 3B to 37B active parameters. The headline tool is a Gutenberg–Richter-style "b" index — the slope of the upper tail of the severity distribution — reported with 95% bootstrap confidence intervals. A "matched-accuracy" pair is defined as two models whose scalar error rates differ by less than 0.05 on human-consensus scoring, the criterion the author uses to hold the rate constant and ask what severity shape looks like underneath.

The empirical claim is sharp. Of 210 such matched-accuracy pairs drawn from the 21 models, 85 have disjoint 95% confidence intervals on b — meaning that, even when two models are essentially tied on the rate-of-being-wrong axis, their tails behave as if they belong to different systems. The worked example Wang names is deepseek-v3.2 versus ministral-14b: at a matched error rate of ε = 0.586, deepseek-v3.2 has b = 0.655 while ministral-14b has b = 1.122 — a Δb of 0.467 with non-overlapping 95% confidence intervals. The framing is the author's own and should be read as such, not as field consensus.

To back the b index, Wang ran a 519-item three-rater human study. Inter-rater reliability was reported at ICC(2, k=3) = 0.85 (95% CI [0.83, 0.87]), and an LLM-judge versus human agreement of ρ = 0.89 (p < 0.001) across 15 models. A dense-model scaling check returned ρ_s = −0.86 on human data, stronger than the judge-based ρ_s = −0.56. These are author-attributed validation numbers on a single 519-item set, not independently replicated results, and any downstream claim should carry that caveat.

The theoretical spine is what Wang calls a Non-Reducibility Theorem: the conditional mutual information between model identity and the b index, given accuracy, is reported as I(b; model | ε) = 1.56 bits, with roughly 64.5% of cross-model variance in b left unexplained by ε. In the author's framing, severity profile and error rate are informationally non-redundant — knowing one does not let a reader skip the other. The exact independence and sample-size assumptions behind that 1.56-bit figure are not fully resolved in the source material, so the number is best reported as the paper's own claim rather than as an established lower bound.

The mechanism story is the part most likely to be useful to builders. Wang's severity taxonomy, reported with inter-rater κ = 0.83, finds that low-severity errors are dominated by retrievals (71% within that bin) while high-severity errors are dominated by fabrications (39% within that bin), and that the composition shifts with model size at p < 0.0001. The percentages are compositions within severity bins, not a partition of all errors, and the draft should not present them as covering the whole error space. Read carefully, the claim is narrower and more interesting: as models scale, the mix of "got the wrong date" versus "invented a plausible-looking ruling" tilts in measurable ways, and that tilt is invisible on a scalar leaderboard.

The constructive lever Wang proposes is simple to state and uncomfortable for the field: report the severity distribution alongside accuracy, not in place of it. The 85-of-210 result is the empirical case that accuracy alone cannot resolve which model is safer to deploy; the non-reducibility figure is the information-theoretic case that severity carries signal the rate cannot; the mechanism shift is the case that the signal is not noise but a learnable property of the model. None of that is a claim that severity reporting "solves" hallucination, and the paper does not argue that. It argues that any evaluator who cares about the difference between a wrong date and a fabricated court ruling is currently using a ruler that cannot see the difference.

The caveats that travel with the source matter. arXiv:2606.05170 is a sole-author v1 preprint in cs.LG, not peer-reviewed, with no companion press release, lab blog, or co-author institution visible in the source set. The precise independence assumptions behind the Non-Reducibility Theorem, and any independent replication or critique since 15 April 2026, are open follow-ups, not blockers for the central claim. A reader who wants to use b as a knob should still treat the headline numbers as author-attributed and the framing as the author's, not as a settled industry position.

The practical upshot is a new question to put to any model card, any leaderboard, and any procurement memo: not just what is the error rate, but what is the tail — and which kind of wrong populates it. ERRORQUAKE's contribution is to give that question a number, a validation study, and a theorem saying the number cannot be recovered from accuracy alone. Whether the field adopts the number is a separate question. The conflation it names is already here.

Wrong Date vs. Fabricated Ruling: The Hidden Shape Inside AI's Mistakes — type0 | type0

Wrong Date vs. Fabricated Ruling: The Hidden Shape Inside AI's Mistakes

Sources