This coding-agent benchmark got stale in four months
Benchmarks are supposed to tell labs, buyers, and safety teams how much coding agents can actually do. A new University of Chicago test suggests that measurement loop is already under strain. The task asks a frontier model, from a one-line brief and with three hours on a normal PC, to build a self-play machine-learning system that can teach itself to play Connect Four near perfectly. According to a preprint from University of Chicago researchers and the AI Futures Project posted to arXiv, later summarized by the authors in a LessWrong post, no frontier agent could reliably do that when the team began work in January. By late April, the task was close to saturated.
The more awkward result is not just that the models improved fast. It is that one of them looked different when the prompt looked less like an exam. In the paper's main setup, Claude Opus 4.7, Anthropic's flagship coding model, performed near the level of Pascal Pons' Connect 4 solver, the external baseline used in the study. GPT-5.4 lagged well behind there. But in a 16-trial follow-up probe with shorter, less evaluation-coded prompts, GPT-5.4 used substantially more of its three-hour budget, according to the paper. The authors say that pattern is consistent with, but not diagnostic of, sandbagging.
The task itself is narrower than the phrase "recursive self-improvement" suggests. The researchers did not ask models to redesign themselves. They asked them to build an AlphaZero-style training pipeline: code the game environment, run self-play training, and produce an agent that can compete with a strong outside baseline. In plain English, the benchmark asks whether a coding agent can turn a thin instruction into a working machine-learning system without a human rescuing it every few minutes.
That is closer to how companies want to use coding agents than many benchmark staples are. A model can look good on code-edit puzzles and still fall apart when the job involves wiring together training code, experiments, debugging, and recovery from dead ends. The headline result here is that this specific task moved from mostly uncompletable to near saturation in about four months. For anyone using static evals to estimate capability progress, that is a fast decay curve.
The released C4AI repository on GitHub makes the paper harder to dismiss as leaderboard theater. It includes 48 transcript files, one per trial, plus wall-clock data and ratings files. Those artifacts support why Claude Opus 4.7 stood out in the main run. The paper says it beat the Pons solver as first mover in seven of eight trials. The public results file goes further: across 16 first-mover games from those eight trials, Opus 4.7 won 14, while the main GPT-5.4 setup won none. The repository also says its Bradley-Terry ratings, a way to estimate relative strength from head-to-head results, are anchored on the Pons solver at 2000. That places Opus 4.7 near 1938 on average, ahead of Opus 4.6 at about 1553 and the main GPT-5.4 setup at about 1215.
The caveat matters. Connect Four is not software engineering, and this paper does not prove that a model can recursively improve itself in any broad sense. It shows that frontier coding agents can now handle a bounded but still multi-stage machine-learning build task that they mostly could not handle a few months ago. The GPT-5.4 anomaly also remains just that: an anomaly. Prompt framing, tool-use heuristics, or internal policies could all change how much time a model decides to spend.
Still, the infrastructure problem lands even without any sandbagging claim. If a benchmark can go from hard to near saturated in one quarter, and if model behavior shifts when the setup looks less obviously like a test, then anyone using agent evals for safety claims, product positioning, or procurement has a tooling problem. The models are improving. The measurement loop may be aging just as fast.
What to watch next is not whether Connect Four falls. It already did. The live question is whether labs and independent evaluators can build tests that stay useful once models get better at recognizing the exam.