AI models keep scoring differently on the same test. A new reporting standard wants to fix that.

PREVIEWAI models keep scoring differently on the same test. A new reporting standard wants to fix that. · MD

When the same AI model posts two wildly different scores on the same benchmark, that gap is usually an accountability problem, not a measurement quirk. LLaMA 65B has been recorded at both 63.7 and 48.8 on MMLU, the standard multiple-choice knowledge test for large AI models, depending on who ran the evaluation. The roughly 15-point swing is what happens when benchmarks are run by many labs in many ways and the underlying settings are rarely published.

In February 2026, a coalition of researchers and a major model hub set out to make that gap harder to ignore. The EvalEval Coalition launched its Every Eval Ever, or EEE, JSON schema, a single shared format for recording who ran an evaluation, which model, how it was accessed, what generation settings were used, and what the metric actually means. Hugging Face followed with Community Evals, a feature that lets anyone push benchmark scores into the small YAML files that live on each model's Hub page. A converter now syncs EEE results into those YAML fields, so the same number does not have to be typed in twice.

Together the two pieces add up to one bet: if reporting is standardized, the field will at least know what it is comparing.

The scale of the underlying mess is concrete. The EEE datastore, now hosted on Hugging Face, holds, according to the project, roughly 229,000 evaluation results across more than 22,000 models and 2,200 benchmarks, pulled together from 31 different reporting formats. The coalition estimates reproducing the underlying runs from scratch would cost hundreds of thousands of dollars. That is the size of the existing scattered record the schema is trying to knit into something a buyer, a safety researcher, or a policymaker could actually compare.

EEE is a reporting schema, not a new benchmark, and that distinction matters. According to the coalition's project page, the schema records what was reported. It does not generate new scores. The schema accepts results from harness logs, leaderboard scrapes, and published paper numbers. The LLaMA 65B swing is one of the cases the project uses to argue why standardized record-keeping matters: same model, same test, and the published numbers diverge until the evaluation settings are pinned down.

What the standard does not do is verify anything. A schema records the claim; it does not grade the run. Schema compliance means the run is documented in the same shape as every other documented run. It does not mean the documentation matches reality. The field still has to police benchmark gaming, prompt contamination, and cherry-picked reporting settings. The schema makes those problems easier to audit, not impossible to commit.

Adoption is voluntary and, so far, narrow. Community Evals ships on one hub. EEE contributors include researchers and policy researchers, per the project's preprint and arXiv listing, but the coalition has not published a roster of participating organizations, and no competing model hub has announced an equivalent integration. The schema is open source on GitHub, which lowers the barrier for adoption. A free download is not the same as a mandate.

The right way to read the launch is also the narrow way. It is reporting infrastructure for a field that has been grading itself in private. Whether it becomes accountability infrastructure depends on whether outside labs, smaller teams, and the model pages that already exist start filling in the missing fields.

The next concrete signal to watch is whether the YAML metadata on a sample of the 22,000 model pages actually populates. If the schema attaches and the fields fill in, the LLaMA 65B number pair goes from an anecdote to a queryable range. If the fields stay blank, the launch is a new front door to the same mess.

AI models keep scoring differently on the same test. A new reporting standard wants to fix that. — type0 | type0

AI models keep scoring differently on the same test. A new reporting standard wants to fix that.

Sources