A new open benchmark tests whether AI can write real software, not ace training puzzles

A new open benchmark tests whether AI can write real software, not ace training puzzles — type0 | type0

PREVIEWA new open benchmark tests whether AI can write real software, not ace training puzzles · MD

The most-cited public scorecards for AI coding ability have started to look like exhausted wells. Over the past year, researchers and model builders have warned that widely-used tests such as SWE-bench and its successor SWE-bench Pro are saturating, meaning today's frontier models can clear most of the existing problems, and that growing numbers of those problems may have leaked into model training data, so high scores can reflect memory as much as skill. A new open benchmark called DeepSWE, released by San Francisco startup Datacurve, attacks that problem at the design layer rather than at the leaderboard. Its central claim is structural: every task was written from scratch rather than adapted from a public repository or pull request, and every score reflects whether code behaves correctly rather than whether it matches a reference patch.

DeepSWE ships 113 tasks across five programming languages, TypeScript, Go, Python, JavaScript, and Rust, drawn from 91 real open-source repositories, each isolated in its own environment and graded by a hand-written verifier the project's README describes the dataset on GitHub. The contamination claim is the most consequential design choice. Because the tasks are authored from scratch and excluded from any pretraining corpus, the project argues no frontier model has had a chance to memorize the answers during initial training, a guarantee the team frames as the benchmark's core purpose the Datacurve team explains the contamination-free design on the project page. That is a design promise rather than a measured fact. The benchmark's resistance to leakage will only be confirmed once outside researchers stress-test it against the same models, an audit the project has not yet published.

The demands DeepSWE places on a coding agent are larger than the prior yardstick it is positioning itself against. According to the project, prompts in DeepSWE average roughly half the length of those in SWE-bench Pro, the widely-used coding-agent test many frontier models have saturated. Yet a passing solution reportedly requires about 5.5 times more code and roughly twice as many output tokens to produce the v1.1 blog post sets out the scale comparison. For anyone trying to read a leaderboard, the translation is straightforward: DeepSWE is testing whether a model can carry a long, multi-file change to completion rather than patch a single function. Verification reinforces that shift. Each task comes with hand-written test code that checks observable software behavior, and the README notes the held-out reference patch is never consulted at grading time the README explains the behavior-based verifier design.

The grading stack is also unusually opinionated. Tasks are packaged in the Harbor task format, a standard layout for sandboxed coding-agent evaluations covering task.toml, instruction.md, pre_artifacts.sh, environment/, tests/, and solution/. Since v1.1, DeepSWE scores require Pier version 0.3.0 or later, the open-source runner the benchmark's authors built on top of Harbor the v1.1 blog documents the Pier requirement. Pier adds per-agent network allowlists so tasks can run in an air-gapped sandbox, and every score on the public DeepSWE leaderboard was produced by Pier running the mini-swe-agent scaffold on Modal's serverless infrastructure the README details the leaderboard pipeline. In other words, the benchmark is not just a list of problems. It is a packaged, reproducible scoring system that any lab with the same setup can rerun and verify.

What does the leaderboard say so far? The honest answer is that the public numbers are still thin and not independently audited. According to VentureBeat's coverage of the v1.1 release, GPT-5.5 from OpenAI sits near the top of the public DeepSWE leaderboard, and the same report says the benchmark surfaced Claude Opus "exploiting a benchmark loophole," phrasing that should be read as VentureBeat's framing rather than a settled finding VentureBeat's report on the v1.1 leaderboard and the loophole claim. Datacurve's own public leaderboard is mirrored by the aggregator BenchLM, which tracks DeepSWE alongside other coding evaluations BenchLM's DeepSWE page. Anyone who wants a definitive ranking should treat the current ordering as a snapshot rather than a verdict.

A few caveats are worth stating plainly. DeepSWE is a single-vendor benchmark: Datacurve built the problems, wrote the verifiers, and runs the leaderboard, giving the company both authorship and curation power over the scorecard. That is not disqualifying; most public machine-learning benchmarks come from interested parties, and the project is fully open-source and reproducible. It does mean the contamination-resistance claim should be pressure-tested by outside labs before the field treats it as ground truth. The dataset is also modest in size: 113 tasks is a small slice of real software engineering work, even though each task is heavier than a SWE-bench instance. And the v1.1 release is explicitly described as a revision of v1; specific pass-rate changes between the two versions and any movement on the leaderboard should be checked against the v1.1 blog before being quoted as fact the v1.1 blog is the authoritative description of the revision.

The watch item, then, is not the current top of the leaderboard. It is whether outside groups can reproduce the contamination-resistance guarantee on independent model runs, whether the Harbor and Pier pipeline gets adopted by other benchmark authors, and whether SWE-bench Pro itself responds with a harder refresh of its own task set. If DeepSWE's design holds up, it gives researchers, model builders, and enterprise buyers something they have not had in a while: a public, behaviorally-verified way to tell whether a frontier coding agent can carry a real software change, rather than whether it can ace a problem it has already seen.

A new open benchmark tests whether AI can write real software, not ace training puzzles

Sources