Every researcher who has stared down a reporting checklist for a simulation model, wondering whether the hours ahead will be worth it, has had the same daydream: what if a machine could do the first pass, and a human could check the work?
That daydream now has a measured answer, courtesy of a preprint by Peer-Olaf Siebers and Christopher Frantz. The researchers tested four large language models against a real published simulation paper, using one of the field's underused reporting standards. The result is not a vision statement. It is a working rule for when to let the model draft, and when to take the keyboard back.
Agent-Based Modelling, the field at the center of the study, is a way of running virtual experiments. Researchers program many autonomous "agents" (people, firms, hospitals, viruses) to interact inside a computer, then watch what patterns emerge. It is the workhorse behind epidemic forecasts, market simulations, and crowd-dynamics studies. The trouble is reproducibility. Without a clear, written account of how a model was built and what its creators actually did, outside researchers cannot tell whether a striking result is a real finding or a quirk of the code.
To keep simulations honest, the community has built a stack of voluntary reporting checklists, as Siebers and Frantz catalogue. The familiar ones are ODD for describing the model itself, and TRACE and EABSS for documenting the process. None of these are perfect. RAT-RS, the standard for rigour, transparency, and reproducibility reporting, has been sitting on the shelf for years with almost no uptake. Researchers know it is there. Almost nobody fills it out.
Siebers and Frantz wanted to know whether a large language model could change that. They took one published agent-based modelling paper and asked four LLMs to read it and produce a RAT-RS report. The models were not graded on whether they were clever. They were graded on whether their output matched what a careful human would write.
The result, as the authors describe it, splits cleanly down the middle. Descriptive extraction is reliable. When a checklist item asks what the model is, what its agents do, or what data it uses, the LLMs produced answers that a domain expert could check off with light editing. The four models disagreed with each other less than the authors expected, and most of the friction came down to formatting, not substance.
Explanatory and evaluative extraction is not reliable. When a checklist item asks why a design choice was made, or whether the model's behavior is credible, the LLMs produced prose that sounded fluent and meant very little. The authors flag a recurring failure mode: the models paraphrase until the original claim is gone, or quietly over-claim, asserting things the source paper does not actually establish. A human reader who knows the field will catch this. A skim reader will not.
The constructive payload of the paper is a division of labor. The model drafts the parts of a reporting checklist that ask for a description of what is in the code or the data. The human owns the parts that ask for a judgment. Put differently: let the machine write the nouns, and write the verbs yourself.
The paper's heuristics, simplified for use in your own work, are roughly these. Trust the model when the checklist item asks for a fact that is sitting in the paper in front of you. Rewrite it yourself when the item asks for an interpretation, a justification, or an assessment of credibility. Treat any model output that sounds unusually confident on a judgment question as a flag, not a draft.
The honest scope matters. This is a single published paper, one reporting standard, and four language models. The authors do not claim that LLMs will solve the broader documentation problem in computational science. They argue, more modestly, that for underused standards like RAT-RS, supervised extraction is a feasible way to lower the barrier. The bigger payoff, if there is one, is in the pattern: documentation debt is a general problem in any field where a model is too complex to fit inside a methods section, and the supervised-assistant approach is portable.
The finding is also a quiet warning. Documentation debt is the kind of work that gets skipped not because it is unimportant, but because no one has time for it. A model that drafts the descriptive half of a checklist well can free up a researcher's afternoon. A model that paraphrases the judgment half into meaninglessness can quietly degrade trust in published science, one checklist at a time, if no one is watching.
The next step, the authors suggest, is community-scale testing. Run the same kind of supervised extraction across more papers, more standards, and more models, and see whether the split holds. If it does, the cost of better documentation drops, and so does the cost of catching the cases where the model is filling space rather than telling the truth.