Companies and researchers are increasingly turning to large language model agents to stand in for real people in online spaces. The agents write product reviews, simulate user reactions, and populate synthetic focus groups for everything from moderation tooling to marketing tests. The practice is already common, but the evidence that it actually works has been thin. A new preprint called MiroBench, posted to arXiv this month by Yaoning Yu, Ye Yu, Haojing Luo, and Haohan Wang, is the first cross-domain attempt to measure the gap statistically rather than by gut feel. (MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions)
The team built their test from 4,292 real Reddit threads, drawn from five topical domains — Credit Card, Laptop, Cellphone, Camera, and Headphones — and asked five language model agents to simulate those discussions from scratch. They then compared the agents' output against the real threads across four properties that any reader would recognize from a few minutes on a forum. The first is repetition and uniformity: do the simulated users all sound the same? The second is narrative shape: does the conversation arc like a real one, with shifts in tone and topic? The third is toxicity and aggression: do the agents reproduce the friction of real disagreement, or sand it off? The fourth is structural complexity: do the reply trees look like the messy, branching, sometimes dead-ended trees of genuine human conversation? (MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions)
Across all four axes, the statistical fingerprints of the simulated threads did not match the real ones. The gap held across all five domains the team tested, suggesting the failure is a property of how the simulators behave rather than a quirk of a particular topic. (MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions)
The more interesting finding is what happened when the researchers tried to fix it. They applied a lightweight prompt-engineering intervention — a few paragraphs of instruction telling the agents to vary their voices and mimic the texture of real disagreement — and re-ran the comparison. The needle barely moved. The most generous reading is that the surface prompt was the wrong level of intervention. The stingier reading is that current simulators do not yet have a representation of how a real thread unfolds, and that closing the gap will require changes in training or architecture, not just better instructions. The paper is closer to the second reading, and the result is the part of the work most worth taking seriously. (MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions)
There are two ways to read what MiroBench is for. The skeptical read is that it is a public demonstration that AI cannot, yet, faithfully simulate a Reddit thread, and that the result should give pause to anyone using synthetic users for product research or policy modeling. The constructive read, and the one the authors are positioning, is that the field has been missing a yardstick. Without a shared benchmark, every vendor could claim its own simulator was realistic, and the claims were not comparable. MiroBench turns that into a measurable claim with a fixed dataset, a fixed set of models, and four axes that can be tracked over time. The first version of the benchmark already produces a result. The next versions, and the next round of submissions, are where the field gets useful signal.
The limitations are real and worth naming. The paper is a preprint, not peer-reviewed, and the methodology has not been challenged in public. The four axes are choices, and a different team could pick different properties of real conversation. The dataset is Reddit, which is convenient because threads are public and topic-grounded, but Reddit is not the whole of online social life, and a benchmark that starts with Reddit will need extensions to other platforms before its results generalize. The authors are explicit about the Reddit scope. The question for the next year is how the benchmark grows.
What to watch is straightforward. The first wave of responses will be other labs running their own models through the same protocol, and the question is whether any of them move the four-axis scores in a way the prompt-engineering fix did not. The second wave is targeted diagnosis: which axis breaks first for a given model, and which interventions move which axis. The third is the harder one. If no published model closes the gap on MiroBench in the next year, the field will have to decide whether simulation of real online discussion is a research target at all, or whether the more useful direction is agents that work alongside real people rather than in their place. The benchmark does not answer that question, but it makes the question answerable.