0.337: The Best an AI Science Agent Can Do, Tested in a Clean Room

PREVIEW0.337: The Best an AI Science Agent Can Do, Tested in a Clean Room · MD

When frontier AI agents are tested in a clean room, with no shortcuts to web data that might leak the answers, the best of them still only reaches a factual F1 of 0.337 at synthesizing scientific conclusions. That number, reported in a new paper on arXiv introducing SciConBench, is not a passing grade. It is also not the most important finding in the work. The most important finding is that the field finally has a way to measure this problem honestly.

The benchmark pairs 9,110 questions with conclusions drawn from systematic reviews, the kind of expert-written summaries that medical and scientific communities already treat as the gold standard. Each conclusion is decomposed into atomic facts, and agents are scored on whether their syntheses include the right facts (recall) and avoid the wrong ones (precision). The scale and provenance of the questions, detailed in the SciConBench paper, are what make the resulting number credible. These are not trivia items scraped from the open web. They are claims that human experts have already vetted in published reviews.

The clean-room piece is what gives the score its weight. The companion tool, SciConHarness, controls an agent's web access during evaluation so that the model cannot simply look up the answer or read the source review directly. Under those conditions, the same eight frontier models and deep research agents that produce confident, polished summaries in unconstrained settings drop measurably in performance. The 0.337 figure is best read as the most rigorous floor the field has for this task: not a typical user experience, but the number a model earns when you remove the ability to cheat.

The same paper audits two consumer products that real people use to read science: Google AI Overview and OpenEvidence. Both produce incomplete, and in places contradictory, scientific conclusions even when the underlying systematic reviews are available as ground truth. That audit is what most readers can act on. If you have ever asked an AI to summarize a research question and felt uneasy about the answer, the benchmark explains why. In the cleanest evaluation the field has built, the underlying capability is not there yet.

The constructive case is the harness itself. SciConBench plus SciConHarness, introduced together in the SciConBench paper, is a shared measurement tool that labs, integrators, and reviewers can run. It does not require vendor cooperation, it does not depend on a particular model, and it does not require trusting a leaderboard curated by the company whose model is being measured. That makes it a yardstick the field can actually iterate on, and a way for "AI can synthesize science" to stop being a marketing claim and become a falsifiable one.

The open question is whether the clean-room gap closes. If next year's best agent clears 0.5 in SciConHarness, the field will know it has moved. If the number stays flat, the field will know that the bottleneck is the agent loop, the retrieval, or the verification, and not the underlying model. Either outcome is more useful than the current state, in which synthesis quality is mostly asserted and rarely measured.

0.337: The Best an AI Science Agent Can Do, Tested in a Clean Room — type0 | type0

0.337: The Best an AI Science Agent Can Do, Tested in a Clean Room

Sources