Microsoft's agent benchmark found one safe harbor: Python
Most companies buying so-called agentic AI want to hand off messy office work, not just get one answer from a chatbot. A new Microsoft Research preprint on arXiv suggests that handoff is still brittle: in a benchmark of long document-editing workflows, frontier models quietly damaged the work over time, and Python code was the only domain where a majority of tested models met the paper's readiness bar for delegated use.
That sharpens the pressure on every company selling AI coworkers as reliable stand-ins for human knowledge work. Microsoft Research tested 19 large language models across 310 work environments in 52 professional domains. The paper found that even top models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupted an average of 25 percent of document content by the end of long workflows, while Python was the only domain where most models looked ready for delegation.
According to the arXiv preprint, those failures got worse as the simulated work stretched on, documents grew longer, and distractor files piled up. In the paper's full HTML version, the authors say average degradation across all tested models reached 50 percent by the end of the 20-interaction simulation Microsoft Research. They also found that giving models agentic tool use did not improve performance arXiv.
The commercial problem is not just that a model can make one bad edit. Delegated work becomes harder to trust when damage can accumulate quietly inside a document that still looks plausible. That matters because the current agent pitch is about turning supervision-heavy office work into something a model can carry for long enough that a human only checks the result at the end.
The public release is substantial enough that the benchmark is harder to dismiss as a vague teaser. The DELEGATE-52 GitHub repository is live, and the Hugging Face dataset card says the public package includes 234 work environments across 48 domains, with the rest withheld because the underlying seed documents could not be redistributed. The same dataset card also carries the paper's most important caveat: the authors say DELEGATE-52 is not well suited for drawing conclusions about human-AI collaboration without proper human studies Hugging Face.
That caveat should stay attached to any broader industry claim. This is a benchmark, not a field study of real companies running agents in production. It does not prove that every workplace agent will fail in the same way, or that reliability generally holds anywhere software can verify the output. What it does show is narrower and still uncomfortable: in this benchmark, Python was the only domain where most models cleared the paper's readiness threshold, while document-heavy work degraded much faster.
An independent analysis by Cekrem helps make the failure mode easier to see. Long-horizon agents do not mainly fail by refusing the task. They fail by slowly introducing damage while still looking useful. That is exactly the sort of error enterprise buyers should worry about, because the checking cost does not disappear. It just comes due later, after the model has already touched more of the artifact.
So the pressure here is not one ugly benchmark score. It is on the sales pitch behind delegated AI work. If Python is the only domain where most tested models look ready, then vendors still have a long way to go before they can claim the same reliability for messy document-heavy work. The next thing to watch is whether agent vendors build better verification around those messier workflows, or keep selling trust before they can measure it.