Google Built a Benchmark for AI Coding Agents. It Used the Company's Own Bugs.

PREVIEWGoogle Built a Benchmark for AI Coding Agents. It Used the Company's Own Bugs. · MD

Jules is Google's asynchronous AI coding agent: it clones a repository, reads the codebase, and proposes code changes while the developer does something else. On June 22, the team behind it published a developers blog post and a companion arXiv preprint arguing that the field has been measuring the wrong thing about coding agents.

Existing public benchmarks such as SWE-Bench, a leaderboard that grades AI coding agents on narrowly defined bug fixes, reward whether a model can complete a discrete task. The Google team's claim is that production coding agents have already passed that bar. The work that now matters is goal-level: figuring out what the developer is trying to build, exploring the codebase, and surfacing findings without being told where to look. The team calls the first stage autonomy and the second proactivity, and argues that no public benchmark yet grades the second.

This is also where the tension lives. The benchmark Google is publishing is built from 705 bugs and 1,178 individual code changes (called CLs in Google's internal terminology) drawn from internal Google codebases. The vocabulary that frames the entire proposal, including insight policy, aspirational goals, temporal proximity, and semantic similarity, comes from Google's own paper. The paper's venue and peer-review status are not yet confirmed in the publicly available source material, and no outside lab has replicated the measurement. For a category whose product surface is whether an agent interrupts at the wrong moment, the standard for "good" is being written by the same company shipping the assistant.

Proactivity, in the framing Google uses in the paper and the blog, is a different kind of problem from task execution. An autonomous agent runs the task you hand it. A proactive agent decides what deserves your attention, what evidence supports that judgment, and whether to interrupt you, ask a question, draft a change, or stay silent. The proposed architecture treats the codebase as a stream of context, maintains a model of what the developer is doing, and emits one of four insight types: a notification, a question, a draft, or silence. The choice between those four is what the Google team calls an "insight policy," and the argument is that a good policy can be trained from how developers respond to each emission.

Jules, the product this measurement is meant to evaluate, graduated to general availability earlier in 2026. The unit framing the measurement question is Google Labs, Google's applied-AI research group. A practical third-party walkthrough describes the developer workflow: hand Jules a goal, let it run against the repo, and review the resulting pull request. The question the Google team is now naming is what happens before that pull request: whether the agent noticed the right thing, asked the right question, or correctly stayed quiet.

The next test is external validation. SWE-Bench earned credibility the way public benchmarks usually do, by being run, contested, and revised by outside teams. Until an independent research group, a competing vendor, or a developer team running a coding agent on real production code publishes results against this benchmark, proactivity remains a category whose standard of good is being defined inside one company. The benchmark Google just published is the first version of that standard. Whether outside teams adopt, contest, or fork it will determine whether proactivity becomes a measurable capability or a marketing term.

Google Built a Benchmark for AI Coding Agents. It Used the Company's Own Bugs. — type0 | type0

Google Built a Benchmark for AI Coding Agents. It Used the Company's Own Bugs.

Sources