CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation
The robots are not ready to program themselves. Not yet.
A new benchmark study from researchers at UT Austin, Stanford, UC Berkeley, and NVIDIA — with Fei-Fei Li, Ken Goldberg, Shankar Sastry, Jiajun Wu, and Linxi Fan among the authors — systematically tested whether code-generating AI agents could autonomously control a robot arm. Fan, a Distinguished Scientist and Director of AI at NVIDIA, runs the Project GR00T and GEAR Lab there. Across 12 frontier language and vision-language models and 187 manipulation tasks, the result was consistent: the models worked when a human engineer had already done the hard part.
The study, published on arXiv March 23 as CaP-X — a framework encompassing both the evaluation environment (CaP-Gym), the benchmark (CaP-Bench), and two derived agent approaches (CaP-Agent0 and CaP-RL) — is the first systematic evaluation of Code-as-Policy agents for robot manipulation. The core finding is a kind of scaffolding dependency: every model tested performs well when given human-crafted abstractions — high-level primitives like stack_objs_in_order() — but degrades sharply when those priors are removed and the model must generate low-level control code from scratch. The gap between model-generated code and human expert code remains large.
The researchers call this the designer scaffolding problem. The agent in the loop is not the robot — it is the human engineer who decomposed the task in advance. Strip away that scaffolding, and the models struggle.
What closes the gap? Test-time compute. The team found that scaling agentic computation during execution — through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning — substantially improved robustness even when agents operated over low-level primitives. In other words: give the model more time to think, try again when it fails, and have it write helper functions on the fly, and performance recovers significantly.
From this, the researchers derived CaP-Agent0, a training-free framework that recovers near human-level reliability on several manipulation tasks by layering multi-turn scaffolding onto existing code-generating models. They also show CaP-RL, where reinforcement learning with verifiable rewards applied to the coding agent itself improves success rates and transfers from simulation to a real robot with minimal sim-to-real gap.
The framing matters. The robotics field has watched code-generating agents succeed in software — SWE-Bench, where models debug real codebases — and asked whether the same approach could work for physical tasks. The answer from CaP-Bench is qualified: yes, but not without the human engineer still in the loop. The scaffolding has not gone away; it has moved from training time to inference time.
For the deployment question — can robots be deployed and reprogrammed by non-experts in the field? — the benchmark provides a honest measurement. Current models need either significant scaffolding or significant compute at runtime. Neither is a showstopper, but neither is the autonomous flexibility the field is hoping for. The human scaffolding problem is real, and it is not solved. It is measured.