The Audit Trail That Separates Real Agent Infrastructure From Fancy Chatbots
Enterprise AI has an accuracy problem. Not the kind that makes headlines — the kind that quietly breaks compliance audits.
A new preprint from Yonyou AI Lab, published April 8, 2026, introduces a framework called LOM-action that takes a direct shot at what the authors call "illusive accuracy": a model's ability to produce correct-sounding outputs while consistently selecting the wrong tools. Their solution is an event-simulation-decision pipeline where every action traces back to a deterministic graph mutation driven by real business events — and all decisions derive exclusively from the evolved simulation graph.
The numbers are stark. On 11 enterprise graph reasoning tasks, LOM-action hit 93.82% accuracy and a 98.74% tool-chain F1 score. Doubao-1.8 and DeepSeek-V3.2 — both respectable models — landed at roughly 80% accuracy but managed only 24–36% on tool-chain F1. Same accuracy, completely different operational reality. The authors quantify this gap as IA(M) = Acc(M) − F1_chain(M), their "Illusive Accuracy" index. The higher the number, the more a model's outputs look right while its tool selections go wrong.
Yonyou's researchers recommend deploying simulation-sensitive enterprise AI only when Tool-Chain F1 hits 0.90 or above and Illusive Accuracy stays below 0.30. By that standard, the frontier models they're benchmarking against aren't ready for production.
What the architecture actually does
LOM-action's design has two modes. Skill mode handles registered tool calls — predictable, contract-bound operations that the system can audit. Reasoning mode handles novel computations where no pre-registered skill applies. Both modes feed into the same deterministic graph engine: business events trigger scenario conditions encoded in an enterprise ontology, which drive graph mutations in an isolated sandbox. The working copy evolves into a "simulation graph" — call it G_sim — and all downstream decisions derive exclusively from that graph. No model output is trusted directly. Everything traces back to the graph.
The practical benefit is replayability. Every decision has a full audit log. If the system approved a vendor payment that later turns out to be fraudulent, the decision trace shows exactly which event triggered which graph mutation, and why the model recommended approval. That's a meaningfully different failure mode from a chatbot that just says "approved."
Who Yonyou is and why it matters
All eight authors are from Yonyou Network Technology, a major Chinese enterprise software vendor — think SAP-adjacent, ERP-tier. No independent academic co-authors, no third-party validation in the paper itself. This matters for calibration: Yonyou has an incentive to show that their tooling beats DeepSeek and Doubao, and they're measuring on tasks that align with their product portfolio. Independent replication isn't here yet.
That said, the finding is structurally sound and the framing — illusive accuracy as a deployment-readiness filter — is a genuinely useful concept regardless of who invented it. The specific thresholds (F1 >= 0.90, IA <= 0.30) are Yonyou's recommendations, not industry consensus. Treat them as starting points, not certification criteria.
The broader pattern this fits into: enterprise AI buyers are increasingly waking up to the fact that benchmark accuracy doesn't predict operational reliability. A model that hallucinates 15% of the time but always uses the right tools is often easier to audit than one that gets 90% of answers right but calls the wrong API. LOM-action is Yonyou's bet on that tradeoff.
What this means for agent infrastructure
For builders: the tool-chain F1 metric is worth tracking separately from accuracy. If your agent system is making downstream tool calls — not just generating text — then the fidelity of those calls is a first-class reliability concern, not a footnote.
For investors: the simulation-graph approach adds infrastructure overhead. You're running a graph engine alongside the model. That overhead is justified in high-compliance environments (financial services, healthcare, procurement) where the audit cost exceeds the compute cost. In lower-stakes automation, it's probably not worth it.
For the broader agent infra ecosystem: Yonyou's framing — simulation-first, then decision — is one architecture for making agents auditable. LangGraph, AutoGen, and other frameworks are solving the same problem from different angles. The competition isn't just model-vs-model. It's architectural approach-vs-approach, measured on operational outcomes, not leaderboard scores.
The preprint is live on ArXiv. Yonyou's press release with the deployment-readiness thresholds is on PR Newswire.