A new public benchmark called SWE-Pro puts 102 real software performance tasks in front of large language models, the AI systems that power coding assistants like GitHub Copilot, and the early score is near zero. Human-written fixes run the code 15.5 times faster on average and cut its memory use by 171.3 times. The best AI attempts barely move the needle.
That gap is the headline. SWE-Pro, built by researchers at the Technical University of Munich's software engineering group, is the first standardized yardstick for a task the field has not had a serious way to measure: making existing code faster and leaner, not just functionally correct. The benchmark, described in an arXiv preprint, is built from 102 optimizations that experienced engineers wrote for real open-source projects, with parameterized tests and noise-aware measurement so the numbers mean what they say.
On runtime, the human solutions won 91.2 percent of the 102 tasks. On peak memory, they won 65.7 percent. The current generation of large language models produced "negligible" runtime gains and "near zero" memory improvements, according to the paper.
Why does the gap look so large? Two reasons stand out. The first is that the task is multi-objective. A good optimization cannot just make the program faster; it has to keep memory bounded, and it has to keep doing so across varied input data and execution conditions. That is a different problem from getting a function to pass a unit test, which is what earlier coding benchmarks like SWE-Bench were designed for. The second reason is that the humans doing the comparison are not amateurs. Each baseline fix was hand-written by an engineer who knew the codebase and often rewrote significant portions of the code to win both axes at once.
Methodology matters here. The TUM team scores each task on three things: runtime, peak memory, and a third measure called Time-Weighted Memory Usage, which tracks how much memory the program uses over the life of a run rather than at a single worst moment. Tests are parameterized, so each solution has to be correct across multiple inputs. Measurement is noise-aware, because speed numbers from a single run can lie. That structure is what makes SWE-Pro uncomfortable for the models: a 15 percent speedup on one input that flunks a different input does not count.
The framing matters too. SWE-Pro arrives while the marketing around AI coding assistants has shifted from "writes the code for you" to "speeds up your code." Most public benchmarks in the field are still tuned for the first claim. They check whether the model can produce a program that passes a test, not whether the program is any good at runtime. A benchmark that asks the second question, with a public dataset, is closer to the test engineering teams need when they decide whether to trust an AI tool on performance-sensitive work.
There are caveats. The paper is an arXiv preprint and has not been peer reviewed at the time of writing. The quantitative gap between experts and models is the authors' framing, and the 15.5x and 171.3x figures are aggregate averages across 102 tasks, not guarantees on any individual one. The benchmark's choice of which optimizations count as "expert" is also a design decision; another author team might pick differently. None of those caveats change the underlying point that the current public measurement, on a yardstick designed for the question, is near zero.
What to watch next: which model labs submit runs against SWE-Pro, whether vendors start publishing their own numbers against it, and how the benchmark evolves as someone closes part of the gap. The yardstick exists now. The interesting story is the first team that posts a non-negligible score and shows their work.