Microsoft Says Small Browser Agents Can Beat Frontier Systems. The Benchmarks Havent Been Verified.
Lede: Microsoft released a browser agent this week that its own numbers say beats OpenAI's Operator and Google's Gemini 2.5 Computer Use — at a fraction of the compute cost. The benchmark story is real. The catch is that nobody except Microsoft has verified it.
The core claim: Microsoft's Fara1.5-27B scores 72 percent on Online-Mind2Web, a benchmark covering 300 tasks across 136 live websites Microsoft Research. On the same test, OpenAI's Operator scores 58.3 percent and Google's Gemini 2.5 Computer Use scores 57.3 percent MarkTechPost. Yutori's Navigator n1, a focused competitor, reaches 64.7 percent. The predecessor model, Fara-7B, scored 34.1 percent — Fara1.5-9B nearly doubled that at 63.4 percent.
Before the numbers get treated as settled fact, the credibility problem: Browserbase, the evaluation firm that partnered with Microsoft on Fara-7B and has called LLM judging unreliable for computer-use benchmarks, found that UI-Tars 1.5-7B "performed far worse than its published numbers" in independent testing Browserbase. When Browserbase ran human-verified checks against LLM-judged results, they found that "many tasks reported as 'successful' were incomplete or incorrectly executed, including cases where an LLM judge passed trajectories that a human evaluator would not consider correct" Browserbase. The variation stems from differing prompt templates, threshold settings, and model versions across evaluation runs — the same trajectory can be ruled correct or incorrect depending on which judge evaluates it. Websites in live benchmarks also break, redesign, or rate-limit, producing inconsistent results across runs. Browserbase has not published independent results for Fara1.5-27B. Microsoft's 72 percent figure comes from the company's own evaluation stack. The solver that scores 83 percent on WebJudge is GPT-5.4 — the same frontier model Fara1.5 is meant to beat MarkTechPost. OpenAI and Google did not respond to requests for comment before publication.
The models use Qwen3.5 as a base in 4B, 9B, and 27B configurations Microsoft Research. Training ran on roughly two million samples, 60 percent from real web trajectories Microsoft Research. Browserbase has said that small computer-use models like Fara-7B may meaningfully change the unit economics of browser agents — engineered for lower deployment cost, with sub-second inference that in principle enables running multiple agents in parallel Browserbase. Actual per-task cost data is not publicly available, but the architecture suggests a cost differential that independent testing could verify.
The practical implication for non-specialists: today, automating a browser task through a frontier model API — filling out a form, navigating a multi-step workflow, extracting data from a dynamic web page — costs money per operation. If a 27B open-weight model genuinely competes with frontier products on these tasks, the cost structure for browser automation could shift. Tasks that currently require expensive API calls per action could become local, parallelizable operations — which changes what automation products can cost to build and who can afford to build them.
That's the story beyond the leaderboard. If the benchmark numbers hold, the implication is straightforward: browser automation — a core use case for the most expensive AI products on the market — could become commodity infrastructure rather than a premium capability. Frontier labs that have positioned computer use as a reason to pay premium API prices would need to demonstrate a capability lead that matches the price premium. If the benchmarks don't hold — if LLM judging inflated the scores the same way it inflated UI-Tars — the competitive pressure on OpenAI and Google evaporates along with the advantage. OpenAI and Google were contacted for this story; neither had responded by publication.
Short version: the economics thesis is plausible and the benchmark scores are unverified. Both of those things are true at the same time, and that's the story.