He stopped trusting AI benchmarks. He built 240 tests of his own.

He stopped trusting AI benchmarks. He built 240 tests of his own. — type0 | type0

PREVIEWHe stopped trusting AI benchmarks. He built 240 tests of his own. · MD

The next time a vendor publishes a benchmark gain, the question is: better at their test, or better at yours? A working engineer who tried to answer that question found the gap was large enough to change how they pick models entirely.

The starting point was a familiar pattern in late 2025 model releases. Moonshot's Kimi K2.7 Code arrived reporting a +21.8% jump on Kimi Code Bench v2, an +11% gain on Program Bench, and a +31.5% leap on MLS Bench Lite (the post that started the conversation). All three are Moonshot's own benchmarks. Zhipu's GLM-5.2 hit a score of 51 on the Artificial Analysis Intelligence Index, a third-party relative ranking service, though the model parameters behind that score are self-reported. ByteDance's Seed 2.1 had thin public information and no third-party leaderboard entry the author could find, so "Seed 2.1 is good" was not yet a verifiable claim in either direction.

The author's argument is not that any of these numbers are wrong. Vendor benchmarks answer one question ("are we better at our own test") rather than the other ("are we better at your workload"), and the gap is structural, not a rounding error. It is the gap the rest of the piece is built on.

The method the author landed on is short enough to copy. They froze 240 tasks sampled from their own production traffic, the actual inputs their product sends to a model. They routed every candidate model through a single shim, in this case GPTProto, a model routing service, so that prompts, order, cost, and latency all came back in one log schema. They then recorded four numbers per model: pass rate, latency, token cost, and a per-area subjective quality score. The frozen set meant the input distribution stopped drifting. The single shim meant the test stopped rewarding whichever model was easiest to integrate that day.

The result was humbling. The public-leaderboard winners did not always win on the author's set, and the gap between first and second place on those 240 tasks was much smaller than press releases imply. That is the part of the post that lands: the same models that look like generational leaps in vendor posts can look nearly interchangeable on a real workload, and the reverse happens too.

The structural reason is that a vendor benchmark is a frozen test set the vendor has had months to study, characterize, and optimize against. A held-out third-party benchmark, of which DeepSWE is the one example the post credits with producing a meaningful spread between code models, narrows the gap by moving the test outside the vendor's reach. Academic work has been tracking this problem. A March 2025 preprint, "Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators", argues that private, vendor-curated test sets create exactly the conditions that make ranking claims fragile.

A working counterexample is the SWE-Bench Pro private leaderboard hosted by Scale Labs (Scale Labs: SWE-Bench Pro (Private Dataset)). The test set is held out. Vendors do not see it. That makes the ranking harder to game and the spread between models more informative. The test set is still a proxy for the real distribution rather than the real distribution itself, but it is closer to the kind of evidence a buyer can act on.

For teams that want to build their own, the open-source template is straightforward. OpenAI's public evals repository on GitHub gives a starting structure for a custom eval harness: a frozen task set, a consistent model interface, and a small set of comparable metrics. The engineering work that follows is the part the post actually does: choose 200 to 300 tasks from real production traffic, freeze them, route every model through the same shim, and read the four numbers side by side.

The watch items from here are whether more vendors start publishing numbers against held-out, third-party coding benchmarks like DeepSWE, and whether third-party indexes like Artificial Analysis tighten their public reporting on parameter counts and training-data provenance. The other thing worth watching is whether more practitioners post their own 240-task results publicly, the way this engineer did. A handful of frozen, on-distribution leaderboards from real products would do more for model selection than another year of vendor self-reporting.

He stopped trusting AI benchmarks. He built 240 tests of his own.

Sources