Science has a quality-control problem, and AI may be turning into the first cheap tool that can actually do something about it.
That is why Ethan Mollick's recent paper-reconstruction examples matter. Mollick, a Wharton professor who writes the One Useful Thing newsletter, has shown that an AI agent can take a published paper's methods and data archive, rebuild the analysis, and check whether the published numbers survive the trip. In one documented case, he wrote that Claude read a 2020 MIT economics paper, opened the archive, converted the authors' statistical code from STATA to Python, and reproduced the findings without step-by-step help.
That sounds like a technical parlor trick until you remember what science has failed to fix for years. A 2015 PLOS Biology paper estimated that irreproducible preclinical research alone was costing the United States about $28 billion a year. Replication work is slow, dull, and terrible for status. Labs get rewarded for finding something new, not for spending months checking whether the last famous result survives contact with its own data.
So the live question is whether science has stumbled into a useful outsider, one that can redo the audit work without caring whose paper it embarrasses.
Serafin Grundl, an economist at the Federal Reserve Board, tested part of that question in an April 2026 paper. He gave three AI agent systems and 146 human research teams the same causal inference task, meaning they had to estimate a cause-and-effect relationship from the same dataset instead of just describing a pattern. The case was the effect of DACA eligibility on full-time employment. Human teams had broad discretion in how they approached the problem. The AI systems got the same assignment.
The median AI estimate and the median human estimate landed in roughly the same range. The stranger result came after that. Grundl asked other AI systems to act as peer reviewers, reading the code and writeups from groups of submissions and ranking which work was strongest. Across tasks and reviewer models, the AI submissions ranked above the human ones. As Marginal Revolution wrote, the ordering was GPT-5.4 first, then GPT-5.3-Codex, then Claude Opus 4.6, then human researchers.
That is not a clean victory lap for the machines. When economist Nicolai Foss reviewed the paper in The Economist Will See You Now, he pulled out the caveat that matters most: the same AI system could produce estimates with opposite signs across different runs. One run is not a verdict. It is one draw from a distribution.
Humans have distributions too, and Grundl's paper found that human estimates had wider tails than the AI ones. But human error comes bundled with incentives. Researchers know what journals like, what prior literature found, and what kind of result attracts attention. AI can still be wrong. It just tends to be wrong for different reasons.
That difference is what makes this more than another "AI can do white-collar work" story. Replication is one of the few scientific jobs where not having a career, a reputation, or a preferred answer may actually be an advantage.
The economic logic is almost rude in its simplicity. Systematic paper replication barely happens because it is expensive and thankless. A graduate student can burn months reproducing a paper and get little credit for it. If an AI agent can do a first-pass reconstruction in an afternoon and flag where the numbers diverge from the published result, the cost structure changes immediately.
The catch is that nobody should treat a single model run as ground truth. The right frame is closer to polling than revelation: run several agents, run them more than once, and look at the spread. That is inconvenient if you want a magic machine. It is perfectly normal if you want an audit system.
What is still missing is infrastructure. Right now AI paper reconstruction is mostly a researcher trick, not a standard layer in publishing. That probably will not last. The economics point too strongly in one direction. If automated replication audits become routine, someone will build the software, someone will set the standards, and someone will end up with unusual power over what counts as trustworthy research.
That is where this stops being a science-curiosity story and turns into a platform story. A research world that could not afford to check itself may soon be checked by systems built by a handful of AI companies. That could clean up a lot of bad science. It could also create a new gatekeeper class in the middle of the scientific record.
For now, the important shift is simpler. Science has spent years treating reproducibility as a moral problem, then an institutional problem, then a statistical problem. AI suggests it may also be a labor problem, with a fairly brutal fix: hand the audit work to something that does not need credit and does not care who looks foolish when the spreadsheet is rerun.