The First AI Benchmark Built for Agents, Not Single Prompts

The First AI Benchmark Built for Agents, Not Single Prompts — type0 | type0

PREVIEWThe First AI Benchmark Built for Agents, Not Single Prompts · MD

The first independent benchmark built for chained, multi-step AI workloads just published its opening results, and the leaderboard looks lopsided on purpose.

Artificial Analysis, the third-party measurement firm that has become a regular reference point for AI infrastructure buyers, has released the first round of results for AgentPerf, which it calls the industry's first benchmark designed for agentic AI infrastructure. On the headline metric of rack-scale efficiency, the test winner is NVIDIA's Blackwell GB300 NVL72, a 72-GPU rack platform. NVIDIA says it runs 20 times as many agents per megawatt as its own previous-generation HGX H200 system, per the same vendor announcement.

That "agents per megawatt" figure is the part that matters, and it is a deliberate departure from how the AI industry has talked about speed for the last three years. The old yardsticks, including tokens per second and time to first token, measure a single call to a large language model, the kind of sprint you get when you ask a chatbot a question. Agentic work is a relay. An "agent," in the sense the benchmark is using, is a system that breaks a goal into many LLM calls, interleaves them with tool calls such as code execution, database lookups, and web search, and runs until the task completes. The context the model sees grows with every hop. Workload complexity scales multiplicatively, not additively, which is the key reason a single-call benchmark under-measures this class of work.

The first published round of AgentPerf is narrow in a way the headline does not advertise. It tests one mixture-of-experts model, DeepSeek V4 Pro, where MoE is a model design that activates only a fraction of its parameters on any given token, making it a representative class of large production model. It reports results at two service-level objectives: 20 tokens per second per agent, and 60 tokens per second per agent. Those SLOs are the throughput targets the system is asked to hold while running the chained workload, and they matter because the efficiency ranking can shift depending on which one you set. The platforms compared are both NVIDIA: the new Blackwell GB300 NVL72 rack, and the prior-generation HGX H200. There are no AMD, Intel, Google, Amazon, or Microsoft in-house accelerators on the leaderboard, and no hyperscaler-internal systems, per the same announcement.

Treating the result as a general ranking of agentic infrastructure would be premature. It is a vendor-on-vendor result on the first published round of a new benchmark, with the vendor that built the test winner also being the party announcing the result. Artificial Analysis is the publisher, and the methodology, workload list, and scoring rules are what determine how durable the ranking turns out to be. The next rounds are where the picture gets sharper: broader model coverage, cross-vendor participation, and hyperscaler testing. Until those land, the cleanest read is that a measurement framework for chained, tool-using, multi-step AI work now exists, the first numbers favor one specific NVIDIA rack on one specific efficiency metric, and the field of competitors has not yet shown up to be measured.

The trade-off worth naming is that faster agentic throughput is not unambiguously better. The benchmark rewards systems that push many chained calls through a rack at high efficiency per watt, which is exactly the axis that matters for a data center running thousands of agents. It says less about latency on any individual step, cost per token, how long a context the system can carry, or how snappy a tool call feels. Buyers running interactive assistants will care about a different mix of those numbers than buyers running overnight batch automation, and AgentPerf in its first round is calibrated toward the latter.

The First AI Benchmark Built for Agents, Not Single Prompts

Sources