Microsoft Research Open-Sources AgentRx to Debug AI Agent Failures
# Microsoft Research Open-Sources AgentRx to Debug AI Agent Failures Microsoft Research has released AgentRx, a new framework designed to automatically pinpoint where and why AI agents fail—addressing one of the biggest challenges in deploying autonomous systems. As AI agents evolve from simple...

Microsoft Research Open-Sources AgentRx to Debug AI Agent Failures
Microsoft Research has released AgentRx, a new framework designed to automatically pinpoint where and why AI agents fail—addressing one of the biggest challenges in deploying autonomous systems.
As AI agents evolve from simple chatbots to systems that manage cloud incidents, navigate complex web interfaces, and execute multi-step workflows, debugging becomes harder. When an agent fails ten steps into a fifty-step task, identifying the exact cause is often an "arduous, manual process," according to Microsoft.
"Traditional success metrics like 'Did the task finish?' don't tell us enough," Microsoft noted. "To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable."
AgentRx treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to guess the error, it uses a structured pipeline: normalizing heterogeneous logs, synthesizing executable constraints from tool schemas and domain policies, evaluating constraints step-by-step, then using an LLM judge to identify the "critical failure step"—the first unrecoverable error.
The team also released a benchmark with 115 manually annotated failed trajectories across three domains: τ-bench (API workflows), Flash (incident management), and Magentic-One (multi-agent web tasks). They derived a nine-category failure taxonomy covering issues like "Plan Adherence Failure," "Invention of New Information" (hallucination), and "Invalid Invocation."
Results: AgentRx improved failure localization by +23.6% and root-cause attribution by +22.9% compared to prompting baselines.
The framework and benchmark are open-source.
