Bloomberg Beat a Benchmark. The Problem: Most Answers Were Wrong.

Bloomberg Beat a Benchmark. The Problem: Most Answers Were Wrong. — type0 | type0

Bloomberg built a system that can write database queries from plain English questions. On Monday the company posted that it scored higher than any system before it on a widely-used AI benchmark. The problem is that researchers identified three months ago that most questions in that benchmark have wrong answers.

The system, called PExA, achieved 70.2 percent execution accuracy on Spider 2.0, a dataset of 632 real-world SQL problems used to test how well AI systems translate English questions into database queries. The score puts PExA in eighth place on the Spider 2.0 leaderboard. But a study published in January found that 62.8 percent of the reference answers in Spider 2.0 are incorrect, meaning the test stopped being a reliable measure of what it was designed to test. The same study found that annotation errors shift where systems rank on the leaderboard by up to nine positions, and that scores on the flawed test correlate only weakly with scores on the corrected version, with a Spearman correlation coefficient of 0.32.

PExA's architecture uses a Planner that generates semantic test cases in parallel, then feeds those results back into the final SQL generation step, according to the paper accepted at ACL 2026. Rather than committing to a single interpretation of an English question, the system explores multiple interpretations simultaneously, which Bloomberg says makes it more robust to ambiguous database schemas. That design choice may help PExA handle noisy, real-world data better than competitors, or it may mean PExA happens to perform well on the specific queries where annotation is wrong.

The gap between first place and eighth on the Spider 2.0 leaderboard may be entirely noise. What would settle the question is an evaluation on a corrected test set, which does not yet exist publicly.

Until then, every SOTA claim on Spider 2.0 is qualified by the fact that the test itself is wrong for most of its cases. The question for anyone using these benchmarks to make build-or-buy decisions: would you hire a database engineer based on a test where most of the questions have wrong answers?

Bloomberg Beat a Benchmark. The Problem: Most Answers Were Wrong.

Sources