Who Decides What a Good AI Agent Is?
IBM Research and Hugging Face published a benchmark. Their choices about what to measure and how will become the industry's default definition of success — whether they say so or not.
IBM Research and Hugging Face released a leaderboard this week. The more consequential thing they released is a set of assumptions about what a good AI agent does, and who gets to define that. The Open Agent Leaderboard, announced May 18, is a full-factorial evaluation harness testing five agent architectures against five backbone language models across six benchmarks. It is also, in a way the announcement does not advertise, a statement about which properties of AI agents ought to be measured, rewarded, and replicated — and by extension, which ones will get funded, deployed, and built around.
The numbers are the first reason to pay attention. The researchers found that agent architecture — the scaffolding that surrounds a language model, including how it plans, calls tools, and recovers from errors — swings performance by as much as 12 percentage points on the same backbone model, according to the arXiv paper (2602.22953) revised May 11. On the leaderboard itself, the top three agents all use the same underlying model but different agent systems. The model is not the variable. The agent is.
This is a direct challenge to how most teams currently build and buy AI agents. The dominant assumption for the past two years has been: pick the right foundation model, then wrap it in whatever agent logic the task requires. The IBM data suggests that calculation is incomplete. On four of the six benchmarks tested — SWE-Bench Verified, BrowseComp+, AppWorld, and tau2-Bench across airline and retail domains — general-purpose agents performed indistinguishably from heavily-customized domain-specific agents. Custom per-domain tuning, the data implies, may not be where the marginal gains are.
The failed-run economics are equally pointed. IBM's blog notes that failed agent runs cost 20–54% more than successful ones — a figure the arXiv paper does not independently verify, and which should be read as an IBM Research claim rather than field-validated data. But the directional signal is not in dispute: failed agents are expensive in ways that successful ones are not, and architecture choices that reduce failure rates have direct economic value.
The third finding may be the most durable. The researchers identified what they call "generality sinks" — specific benchmark or architecture combinations where open-weight models collapse in ways that frontier closed-source models do not. This failure mode is not visible in overall leaderboard scores. It requires looking at architecture-distinctive error signatures, which the Exgentic evaluation framework was specifically designed to surface.
That framework — open-sourced at GitHub the same day — is the real infrastructure bet. Exgentic is not only a benchmark; it is a proposed evaluation protocol that the authors hope becomes a shared standard the way ImageNet became a shared standard for computer vision. The first credible open evaluation framework sets the default assumptions. Late entrants align to it or explicitly opt out.
This is the governance question the announcement raises and does not answer: what did IBM and Hugging Face choose to measure, and what did they leave out? Agent tasks are open-ended, environment-dependent, and increasingly social in ways that resist standardized scoring. The six benchmarks in Exgentic — spanning code generation, web browsing, and airline/retail telecom tasks — cover significant ground but are a subset of what agents actually do in production. Which task categories were excluded and why? Were frontier lab agents consulted on evaluation design, and if so, does that create a conflict of interest between evaluator and evaluated? How reproducible are the adversarial protocols as agent architectures evolve? A static benchmark that cannot keep pace with rapidly changing agent capabilities is a snapshot, not a standard.
IBM and Hugging Face have positioned themselves as neutral arbiters of agent quality — a credibility play in a space crowded with vendor-published benchmarks that say whatever the vendor needs them to say. That role is valuable and probably genuine. It also comes with agenda-setting power: what Exgentic measures becomes what the field optimizes for.
The leaderboard is real. The methodology is documented. The 12-point architecture effect is a genuine empirical finding, not marketing. But for anyone building or buying agent systems right now, the question worth asking is not just which agent ranks where. It is whose definition of "good" is encoded in the ranking — and whether that definition matches the properties your deployment actually needs.
The Open Agent Leaderboard is worth watching. So is who decides what goes on it.