Google DeepMind Cuts Web Agent Failures by 10 Percentage Points
The hard problem in long-horizon agent design isn't getting started — it's staying on track.

image from FLUX 2.0 Pro
The hard problem in long-horizon agent design isn't getting started — it's staying on track. Web navigation agents routinely set off toward a goal, encounter something unexpected halfway through, and quietly veer into failure without any visible signal that they've lost the thread. A preprint posted to arXiv on March 20 from five researchers at Google DeepMind names this failure mode explicitly — and proposes two separate mechanisms for fighting it.
The paper, authored by Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, and Edward Grefenstette, introduces what they call a subgoal-driven framework. It's really two separate tools bundled under a common argument: one for improving proprietary models at inference time, another for training open models via reinforcement learning. The distinction matters because these aren't interchangeable — they solve different parts of the same problem at different points in the pipeline.
The inference-time piece uses a planner model to decompose the final goal into subgoals in real time. When the agent gets new information — a dynamic page load, an unexpected form, a redirect — the planner reorients rather than letting the agent drift. The researchers tested this against Gemini (Google DeepMind's proprietary large language model) on WebArena-Lite, a benchmark for autonomous web navigation tasks. The result is roughly a 10 percentage point absolute improvement in success rate. That's not a trivial lift, but it's also not something you can replicate unless you're running Gemini; this technique is explicitly designed around proprietary model access.
The open-model story is different, and more interesting to anyone who ships agent infrastructure. The researchers introduce MiRA — short for Milestoning your Reinforcement Learning Enhanced Agent — an offline RL fine-tuning framework that replaces sparse end-task rewards with milestone-based dense rewards. The idea is simple enough: if the agent only gets a signal when it succeeds or fails the whole task, it has very little information about which steps mattered. Milestone rewards inject signals at intermediate checkpoints, giving the training process something to work with across the full action sequence.
Applied to Gemma3-12B, Google DeepMind's open-weights model, MiRA pushes the success rate from 6.4% to 43.0% on WebArena-Lite. That's a substantial jump. The paper benchmarks against WebRL (38.4%) from Tsinghua University's THUDM group and frames the result as a new open-model state of the art — but WebAgent-R1, from an Amazon team accepted at EMNLP 2025, hit 44.8% on the same WebArena-Lite benchmark using a Llama-3.1-8B model. That result appears in the paper's related work section, not its headline comparison. At 43.0%, MiRA sits below WebAgent-R1; the SOTA framing depends on what you count as a comparable baseline — different training data, different base models, different conditions. WebRL's code is open-source on GitHub; MiRA's is not.
The failure analysis embedded in the paper is arguably its most durable contribution. The researchers quantify specific failure modes by name — "mid-task stuck" patterns where agents stop making progress on subgoals — and show that roughly 50 percent of Gemini-2.5-Pro failures without the subgoal framework fall into this category, with over 30 percent observed even for SFT Gemma-12B. Having a named taxonomy for how agents fail is more useful for practitioners than a benchmark number, because it's actionable regardless of what model you're running. "My agent hits mid-task-stuck patterns" is something you can debug. "My agent doesn't score 43% on WebArena-Lite" isn't.
For context on where this sits in the broader deployment picture: Google DeepMind launched its Gemini 2.5 Computer Use model in October 2025, which powers Project Mariner (a web agent product) and several Firebase tooling integrations. That model scores around 75 percent on aggregate UI task benchmarks — but drops to 36 percent on open-ended web navigation tasks like the ones WebArena-Lite targets. The inference-time subgoal technique in this paper is directly relevant to closing that gap in production systems.
What's missing is code. The WebRL baseline they're competing against has a public repository. This paper does not. For the MiRA technique to move from interesting academic result to usable infrastructure, Google DeepMind would need to release training code, model weights, or both. The Gemma3-12B base is open, but fine-tuning MiRA requires the milestone-based reward pipeline, which currently lives only in a preprint.
The subgoal decomposition contribution for proprietary models is immediately useful to anyone running Gemini in an agentic context — and Google is almost certainly already deploying a version of this internally. The RL training contribution for open models is promising but not yet reproducible. That's an honest summary of where the infrastructure actually stands.

