Before AI Agents Can Claim to Be Efficient, Someone Has to Define 'Efficient'
A new paper proposes the right unit for agentic energy measurement. The problem: the infrastructure to actually use it doesn't exist yet.
A new paper proposes the right unit for agentic energy measurement. The problem: the infrastructure to actually use it doesn't exist yet.
Ask an infrastructure team what their AI agent costs to complete a task, and you'll get a rough answer per token or per API call. Ask what a successful goal costs, and you'll get a blank look. That gap is the subject of a new paper from Deepak Panigrahy, an independent researcher, and Aakash Tyagi at Texas A&M University. Their work — Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems, posted to arXiv on May 20, 2026 — is the most serious attempt yet to articulate what agentic AI energy measurement should actually look like. It also inadvertently reveals how far the field is from being able to do it.
The paper introduces two complementary concepts. EpG, or Energy per Successful Goal, is an absolute unit: joules consumed across an entire workflow, normalized by the number of goals actually completed. This includes failed attempts and retries, which the authors argue are not noise but signal — they are part of what agentic systems do. The second concept is the Orchestration Overhead Index, or OOI: a ratio comparing agentic workflow energy to linear execution energy on identical task instances under identical success criteria. Where EpG is a standalone number, OOI requires a baseline, making it a comparative instrument.
On reasoning tasks — factual QA, arithmetic, multi-step logical reasoning — the results are stark. Across 827 agentic goals spanning five task families, the researchers found a mean OOI of 4.33x: agentic workflows consumed 888.1 joules per successful goal versus 205.3 joules for linear baselines. The mechanism is not what you might expect. The overhead is not driven by inference compute — it is driven by orchestration. Retries, intermediate planning steps, and failure-recovery cycles accumulate energy costs that inference-level accounting simply does not capture.
But there is a twist. For tool-augmented tasks — workflows where an agent can call external tools rather than generating tokens from its own inference budget — OOI inverts below 1.0x. Agentic dispatch replaces costly token generation. The agent is cheaper than the linear baseline, not more expensive. The authors are careful to note that this confirms OOI responds to orchestration structure rather than imposing a fixed upward bias. It is a meaningful result, and a counterintuitive one: the claim that agents always cost more energy is simply wrong for a real and growing class of workflows.
The contribution the paper actually makes is identifying what EpG needs in order to be meaningful — and the distance between that and what exists.
Three conditions must be met. First: an agreed definition of what counts as "agentic." The paper uses a working definition — multi-step orchestration, tool use, failure recovery — but no community standard exists. Without an agreed boundary, any two EpG measurements may be counting different things. Second: an agreed criterion for what constitutes a successful goal. This is harder than it sounds. Success conditions vary by task type, user expectation, and evaluation methodology. The paper's own success criteria are specific to its experimental design. Third: independent measurement infrastructure. A-LEMS is a research framework; it is not a certified benchmark. The authors can measure their own systems. Nobody can yet measure a vendor's production agentic deployment with independent, reproducible tooling.
None of these three pieces exist at community scale. This is not a knock on the paper — it is the point. The paper is a diagnosis and a proposal, not a solution.
The timing matters. Regulatory pressure is arriving before the measurement infrastructure. The EU AI Act's most remaining obligations take effect August 2, 2026. California's SB 253 requires Scope 1 and Scope 2 greenhouse gas emissions reporting for companies with more than $1 billion in US revenue doing business in California, with first disclosures covering fiscal year 2025 due August 10, 2026. Agentic deployments — which consume more energy per interaction than single-turn inference — are a growing part of data center demand trajectories. Organizations will soon be required to account for this energy in ways that are reproducible, comparable, and aligned with actual task outcomes. Energy-per-inference satisfies none of those requirements for agentic workloads.
What the paper establishes is that EpG and OOI are the correct vocabulary, that A-LEMS is a plausible measurement methodology, and that the 4.33x finding is real — for the specific task families and hardware the authors used. What it does not establish is that the metric is ready for procurement comparisons, vendor benchmarking, or regulatory filing. Those applications require infrastructure that does not yet exist.
The paper's three-hash reproducibility protocol — binding measurements to a hardware fingerprint, software environment, and execution state — is the right idea. It is also a research protocol, not an audit standard. The gap between the two is not technical; it is institutional. No standards body has adopted EpG. No hyperscaler has published EpG figures. No independent lab offers EpG certification. When a vendor claims their agentic product is more energy-efficient than a competitor's, there is still no agreed method for testing that claim.
The practical implication for type0 readers is specific. Procurement teams evaluating agentic platforms should ask what a successful goal costs in joules — and then ask which EpG measurement methodology was used, who validated it, and against what success criteria. Compliance teams preparing for EU AI Act and SB 253 obligations should note that energy-per-inference is insufficient for agentic workloads and begin tracking when goal-level measurement standards emerge. Infrastructure architects should understand that OOI is not a fixed tax — tool-augmented workflows with good dispatch design can reduce energy costs relative to linear execution.
The paper is a preprint, not yet peer-reviewed. Its experimental results use local llama_cpp inference, which has different energy characteristics than API-based agentic deployments on commercial cloud infrastructure. The 4.33x finding, while well-motivated, should not be treated as a general industry benchmark. These are standard caveats for pre-publication technical work — and they are also precisely the reasons the paper cannot yet serve as the measurement foundation it proposes to build.
EpG is the right idea at the right time. The infrastructure to operationalize it is missing. That gap — not the metric itself — is what matters now.