The boring researcher problem: AI agents and the reproducibility crisis
The first boring researcher is here, and it's not going to win any awards for insight.
A paper published last week by researchers at ETH Zurich tested a question that sounds arcane but has direct implications for anyone who makes product, investment, or policy decisions based on published social science: can AI agents reproduce research findings by reading a paper and its data, without ever seeing the original code? The answer, at least for the best-performing agents, is yes — most of the time, and with remarkably little variance.
The study, titled "Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results" and posted to arXiv April 23, ran four AI agent scaffolds against four frontier language models across 48 papers that had been manually verified as reproducible by the Institute for Replication (arXiv preprint). The best system — GPT-5.4 guided by the OpenCode scaffold — reproduced coefficients that matched the sign of the original findings more than 85 percent of the time, and landed within the original 95 percent confidence intervals roughly 70 to 80 percent of the time. Claude Opus 4.6 running on Codex, OpenAI's own coding agent, came close behind.
Those numbers deserve context. The landmark 2016 Camerer et al. study that defined the reproducibility crisis in experimental economics found that independent teams could replicate published results only 61 percent of the time, with effect sizes coming in at roughly two-thirds of the original (Science). AI agents on pre-verified papers are performing at or above that human baseline — consistently, without fatigue, and without the option to quit when the analysis gets tedious.
The meaningful distinction, though, isn't whether AI is better than human economists. It's what AI is optimizing for. Human teams produced a wide spread of answers from the same data, including some deeply wrong ones and some that spotted effects no one else saw. The AI agents cluster tightly around the median. No outliers in either direction.
That sounds like a bug but it's actually the feature. For the VC deciding whether to build a feature around a behavioral economics finding, or the product manager weighing whether a study on nudges applies to their user base, the appeal of AI isn't that it will discover something surprising. It's that it will reliably produce the expected answer, every time, without the tail risk that comes from a human analyst who might be having a bad week or a creative one.
"We have never had a cheap, scalable instrument for measuring intellectual contribution," the Kohler et al. paper notes, indirectly. What AI agents now provide is a baseline: if a published finding falls outside what a competent agent would reproduce from the paper and data alone, that's worth knowing before you structure your entire launch around it.
The failure analysis is where it gets uncomfortable for researchers. When AI agents couldn't reproduce a result, the errors split roughly evenly between two sources. The first is exactly what you'd expect: the agents misread the methods or wrote buggy code. The second is less convenient: the papers themselves underspecified what they did. The methods section didn't contain enough information for an agent — or, presumably, a human reader — to reproduce the analysis from scratch.
That second category is a quiet indictment of how economics and social science papers are written. If a result can only be reproduced with access to the original code, the paper's methods description isn't doing its job. The AI didn't fail to understand the research. The research failed to explain itself, as one independent commentator observed.
The 70 percent confidence-interval hit rate also has an asterisk. The 48 papers were selected because they had already been verified as reproducible by the I4Replication team — meaning they were relatively clean cases with complete replication packages. Real-world published papers, where you don't know in advance whether the methods are specified well enough to reproduce, are presumably harder. An AI agent running against a random paper from a mid-tier journal might do considerably worse.
There's also an open question the paper doesn't resolve: whether the findings AI agents suppress — the extreme results, the minority positions, the surprising effects that don't replicate — include the stuff that actually matters. Science has historically advanced through the productive disagreement of analysts who looked at the same data and reached different conclusions. Tightening the spread might eliminate noise. It might also eliminate the surprising wrongness that precedes a paradigm shift.
For now, the practical use case is clear. Before betting product direction on a behavioral economics paper, before structuring a pricing experiment around a nudge finding, before trusting any empirical claim that will drive a significant decision: run the paper and its data through an agentic system and see if it comes back with the same answer. If it does, you've bought yourself a cheap sanity check. If it doesn't, you've discovered something — either about the paper or about your confidence in it — before it discovers you.
That's not a breakthrough. It's just boring, and reliable, and worth something.