The Robot Coding Benchmark That Puts VLAs on Notice
When Will Knight gave his OpenClaw agent a robot arm last week, the WIRED headline called it a breakthrough. The arm itself cost $100. The motors inside cost $15 each.WIRED
That gap — between the announcement and the artifact — is where robot coding actually lives right now. But a new benchmark from NVIDIA, Berkeley, Stanford, and Carnegie Mellon suggests the gap between what coding agents can do and what the industry has been betting on may be even larger.CaP-X Project Page
The benchmark is called CaP-X. Published in March 2026, it is the most systematic attempt so far to measure whether language models can write robot control code that actually works. The results should make executives at Physical Intelligence, Figure, and 1X uncomfortable.arXiv preprint (cs.RO)
On perturbed manipulation tasks — where objects are displaced or instructions altered from training conditions — state-of-the-art vision-language-action models scored near zero. OpenVLA and π0 both returned 0% on the LIBERO-PRO suite. The best VLA tested, π0.5, reached 13%. A training-free coding agent called CaP-Agent0, which generates executable code rather than end-to-end learned policies, scored 18% on the same test.arXiv preprint (cs.RO)
Eighteen percent is not a victory. But it is eighteen times better than the alternative the industry has spent hundreds of millions building.
The benchmark also quantified the frontier gap. Gemini 3 Pro, the strongest model tested, achieved 32.3% average success on robot coding tasks. Human performance on the same tasks sits at 88.5%. The best coding agents still fail more often than they succeed.CaP-X Project Page
The one result that changes the math: when researchers applied reinforcement learning directly to the coding agent itself — not to a learned policy, but to the code the agent wrote — a 7 billion parameter model jumped from 20% to 72% success in simulation. The learned behaviors transferred to a real Franka Emika robot with a minimal sim-to-real gap, reaching 84% on cube lifting and 76% on stacking. That is approaching human-expert performance on narrow tasks, achieved by optimizing code generation rather than policy networks.arXiv preprint (cs.RO)
The WIRED demo illustrated the current state precisely. Knight used Codex and OpenClaw to configure a LeRobot SO-101 arm, calibrate its joints, and write a Python script for gripping a ball. Hallucinations introduced bugs, particularly around hardware communication. Human judgment remained in the loop throughout. The arm — an open-source HuggingFace project called SO-101, built around $15 Feetech motors with asymmetric gear ratios — worked, slowly, with help.WIREDGitHub / LeRobotLeRobot SO-101 Docs
Ken Goldberg, the Berkeley roboticist whose group co-developed CaP-X, put the trade-off plainly: AI-powered coding has the potential to bridge conventional engineering, which is reliable but does not generalize, and vision-language-action models, which generalize but are not yet reliable. Neither extreme is sufficient for real deployment. The question is which path closes the gap first.WIRED
Code-as-Policy is not new. The foundational paper dates to September 2022.arXiv What is new is the scale of the evaluation and the specificity of the results: VLAs break under perturbation, coding agents do not, and scaling reinforcement learning on code generation closes the gap substantially even on small models.
The conflict of interest is worth stating plainly. The CaP-X results come from institutions — NVIDIA, Berkeley, Stanford, CMU — that benefit if the industry shifts toward coding stacks and away from end-to-end learned policies. The 0% VLA scores on perturbed tasks are real, but they are a single benchmark on a specific kind of perturbation. Real-world deployment data for code-as-policy beyond a Franka lab in Berkeley remains thin.
Physical Intelligence, which has raised close to $700 million on the VLA thesis, has not publicly responded to the CaP-X results. Neither have Figure or 1X. That silence is itself informative. A clean rebuttal would be easy; the absence of one suggests either quiet concern or the absence of a good answer.
The WIRED headline promised a breakthrough. The CaP-X benchmark suggests the breakthrough, if it exists, will be measured differently — in perturbation robustness and code-generated policies rather than in demos of arms waving at red balls.