AI Models Can Audit Computer-Use Agents — But Disagree on Complex Tasks
A new study reveals that vision-language models can reliably audit computer-use agents on straightforward tasks — but start diverging significantly when the work gets messier. Researchers Marta Sumyk and Oleksandr Kosovan published "CUAAudit" on arXiv (March 11, 2026), evaluating five VLMs as au...

A new study reveals that vision-language models can reliably audit computer-use agents on straightforward tasks — but start diverging significantly when the work gets messier.
Researchers Marta Sumyk and Oleksandr Kosovan published "CUAAudit" on arXiv (March 11, 2026), evaluating five VLMs as autonomous judges of task completion across three widely-used benchmarks covering macOS, Windows, and Linux environments. The results: strong accuracy and well-calibrated confidence scores overall, but notable degradation in complex or heterogeneous settings.
The core tension
Computer-use agents (CUAs) are increasingly able to execute multi-step tasks in desktop environments — filling forms, organizing files, navigating interfaces. But evaluating whether they actually succeeded has been a pain point. Traditional methods rely on static benchmarks with rule-based success checks, or human reviewers. Both are expensive and brittle.
Using VLMs as autonomous auditors seemed promising. These models can observe a final environment state and determine whether the agent accomplished its goal — no human in the loop needed.
And the auditors aren't dumb. Across the three benchmarks, state-of-the-art VLMs achieved strong accuracy and their confidence estimates tracked actual performance well. When they said they were unsure, they were usually right to be unsure.
Where it breaks down
But here's the catch: when tasks involved more complex or heterogeneous environments — think varied operating systems, multi-window workflows, or ambiguous success criteria — performance dropped across the board. More troubling, even the top-performing models disagreed with each other frequently.
In practical terms, this means if you deploy two different VLMs to audit the same CUA task, they might reach opposite conclusions about whether the work was done correctly. That's a problem if you're building automated quality assurance into agent pipelines.
Why it matters
The authors frame this as a fundamental limitation of current model-based auditing. We're not yet at the point where you can hand off evaluation entirely to an LLM and trust the judgment unconditionally — especially in real-world settings where environments vary and success isn't always binary.
For builders deploying CUAs in production, the takeaway is nuanced: VLMs can handle the straightforward cases, but you'd be wise to build in explicit uncertainty handling and variance accounting. Don't treat any single auditor's verdict as ground truth.
The paper adds to a growing body of work questioning whether AI systems can reliably evaluate other AI systems — a meta-problem that's only going to get more important as agents become more capable.
This article synthesizes the peer-reviewed research from arXiv, connecting the technical findings to real-world implications for AI practitioners building and deploying computer-use agents.
