The Per-Token Tax on AI Agents Finally Has a Challenger
The Per-Token Tax on AI Agents Finally Has a Challenger
Xiaomi's new open-source model can build a complete Rust compiler in 4.3 hours. The question is whether the cost savings are real.
When a model can write a full Rust compiler from scratch in a single afternoon, the question is no longer whether AI agents can handle serious engineering work. It is whether you can afford to let them run long enough to finish.
Xiaomi's MiMo-V2.5-Pro, released April 27 under a permissive MIT license, is a 1.02-trillion-parameter Mixture-of-Experts model that activates 42 billion parameters per token. The headline claim is a 40 to 60 percent reduction in tokens consumed per task compared to Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro at equivalent capability on the ClawEval benchmark for agentic tasks. If that number holds, it reprices the economics of every long-horizon agent deployment.
The most direct evidence Xiaomi offers is a demo, not a press release. MiMo-V2.5-Pro was given the specification for a complete SysY compiler, the kind of multi-week assignment handed to computer science students at Peking University's Compiler Principles course, and asked to build it autonomously. It finished in 4.3 hours, using 672 tool calls, and scored 233 out of 233 on the hidden test suite. The model did not complete the task by luck. It first scaffolded the full pipeline, then worked through each layer systematically: lexer and parser, then the Koopa IR stage, then the RISC-V backend, then performance optimization. When a refactoring pass introduced regressions at step 512, it diagnosed the failures and corrected them before proceeding. Xiaomi calls this "harness awareness," the model's capacity to manage its own working context across long trajectories. The number that matters is not the 4.3 hours. It is the 233 out of 233 on tests the model had never seen.
Xiaomi ran two other long-horizon demos. The model produced an 8,192-line desktop video editor across 1,868 tool calls in 11.5 hours of autonomous work. It also designed and optimized a graduate-level analog circuit, an FVF-LDO regulator in TSMC's 180nm process, by running it through an ngspice simulation loop and iteratively improving four key metrics by an order of magnitude over its own initial attempt.
The token efficiency claim is specific and measurable. On ClawEval, MiMo-V2.5-Pro achieved a 64 percent Pass@3 score using approximately 70,000 tokens per trajectory. The company's benchmark charts place it in the upper-left corner of the cost-capability chart: high score, low token consumption. For teams running agents that plan, call tools, write code, and recover from errors over extended sessions, that corner of the chart is where the math starts to work.
API pricing confirms the direction. MiMo-V2.5-Pro costs $1 per million input tokens and $3 per million output tokens for context windows up to 256,000 tokens, with input costs dropping to as little as $0.20 per million on cache hits. A comparable capability level on Claude Opus 4.7 runs $5 per million input and $25 per million output. The VentureBeat pricing table, which compiled rates across more than 20 providers, puts MiMo-V2.5-Pro in the bottom quartile of cost for top-tier agentic performance.
"The key benchmark signal is not just accuracy, but tokens per successful task," Pareekh Jain, CEO of Pareekh Consulting, told InfoWorld. "Frontier models often reach higher success rates on complex coding benchmarks, but do so with massive reasoning overhead. MiMo-V2.5 is designed for token efficiency, meaning it achieves comparable results with significantly fewer input and output tokens."
Ashish Banerjee, a senior principal analyst at Gartner, was more direct about the structural implication. "When tasks stretch into millions of tokens, metered proprietary APIs stop looking like a convenience and start looking like a tax on iteration," he said. "By contrast, MiMo's MIT license, open weights, 1M-token context window, and relatively low pricing make private-cloud or self-hosted deployment strategically credible."
That is the economics argument. The asterisk is the same one that has followed every Chinese AI release this cycle: origin. Lian Jye Su, chief analyst at Omdia, noted that adoption may face challenges because Chinese-origin models can trigger concerns in regulated Western organizations. Financial services, healthcare, and government contractors operate under compliance frameworks that make vendor selection a non-trivial question. MIT licensing resolves the legal ambiguity around commercial use. It does not resolve the political risk that regulators or enterprise security teams may treat a Xiaomi model the way they treated Huawei.
The benchmark caveat deserves its own paragraph. Every number Xiaomi has published about MiMo-V2.5-Pro is Xiaomi's own measurement. The 40 to 60 percent token reduction, the 233 out of 233 on the SysY compiler, the SWE-bench Pro score of 57.2 percent that edges past Claude Opus 4.6's 53.4 percent, the ClawEval 64 percent Pass@3: all of it comes from Xiaomi's internal evaluation suite. Independent third-party testing has not yet replicated these results. The economics angle lives or dies by that number. If the token reduction is real, the story is that the per-token API model has its first credible open-source challenger at the agentic capability frontier. If the benchmarks are substantially inflated, the story is another vendor claiming parity at a discount with self-reported data.
The model is real and the weights are downloadable from HuggingFace. The architecture is documented in full on the model card. Teams that want to verify the claims can do so without asking permission or paying for an API key. That is a meaningful difference from closed models, where the only verification path is running the same prompts against the vendor's API and hoping the results match the marketing.
What Xiaomi has not disclosed is what the model fails at. Every autonomous demo succeeded. No regression data, no failure mode analysis, no side-by-side comparison against a frontier model on the same task where one finishes and the other does not. Real agent deployments encounter prompts that break models, contexts that exceed context windows, and failure cascades that require human intervention. The demos Xiaomi published are not that.
The closest thing to an independent data point is the model card on HuggingFace, which includes base model evaluation results across standard benchmarks including MMLU, HumanEval+, and LiveCodeBench. The numbers are consistent with Xiaomi's claims about general capability, but these are not agentic tasks. They measure whether a model can answer questions or write code snippets. They do not measure whether a model can sustain coherence across 1,868 tool calls while building a video editor.
For teams evaluating what to run in production today, the honest summary is this: MiMo-V2.5-Pro has the architecture, the license, and the pricing to be a credible agentic workhorse. It has a plausible token efficiency advantage over closed frontier models on the benchmarks that matter for long-horizon tasks. It has not been independently verified. The MIT license means you can run it without per-token billing. The Chinese origin means your legal and compliance team will have questions. The self-reported benchmarks mean you should test it yourself before betting your agent pipeline on the numbers.
What Xiaomi has done is lower the floor for what "good enough" means at what price. That floor just moved. Whether the model lives up to it is a question for engineers, not press releases.
Xiaomi released MiMo-V2.5 and MiMo-V2.5-Pro on April 22 and April 27, 2026, respectively. The models are available via API and as open weights on HuggingFace.