Ten months ago, Anthropic asked non-robotics-expert employees to drive an off-the-shelf robotic quadruped around a warehouse and complete a list of tasks, with Claude as their copilot. The Claude-enabled team finished faster and did more than an internet-only control team, but the model itself could not even connect to the robot unaided. On 2026-06-18, Anthropic's Frontier Red Team published Project Fetch: Phase Two, repeating the experiment with a newer Claude model, Opus 4.7. This time, by the authors' scoring, the model ran the full task suite solo and completed the work in roughly one-twentieth the time of the fastest human team in the August 2025 run.
That headline number is Anthropic's own. The same paper names what the model still cannot do. Claude Opus 4.7 completed the overall task suite but could not reliably execute the precise "fetch" task at the heart of the experiment: picking up a beach ball and carrying it to a marked spot. The experiment also stops well short of low-level motor control, actuation policies, or generalization across different robot bodies. The result is real, and so is the asterisk.
What changed in ten months is which layer in the embodied-AI stack is the binding constraint. In August 2025, with Opus 4.1, the binding constraint was the language model's ability to interface with the robot at all; the model got stuck on the preliminary step of establishing a connection. By June 2026, with Opus 4.7, the model can sequence the higher-level plan, dispatch subtasks, and recover from missteps inside a curated task suite. The robots are the same off-the-shelf quadrupeds. The integrator expertise, sensor stack (cameras and laser-based depth sensors called lidar), and warehouse environment are the same. The only variable is the model. The Frontier Red Team frames this as part of a "help humans, then help models, then have the models do it themselves" pattern they say is now visible in cybersecurity and coding agents, and which has now crossed into physical robotics.
That arc is doing more work than the experiment itself. Embodied robotics deployment has historically been gated on hardware maturation, integrator expertise, and certification timelines measured in years. The Frontier Red Team's ten-month gap between Phase One and Phase Two is now the cadence at which the model layer is iterating. Peer review, regulatory approval, insurance underwriting, and the principal-of-record question, the question of who is on the hook when a robot in a warehouse harms a person or property, are still running on multi-year clocks. The stack dependency has flipped. Physical-robot deployment is no longer gated on the slow layers. It is now gated on language-model iteration speed, and that clock is running at months, not years.
The original August 2025 run included a moment the authors report with some relief: a runaway quadruped nearly rammed one of the human teams. That detail matters because it is the kind of failure mode the experiment does not really test. The next binding constraint to watch is the layer beneath language-model planning: the low-level control policies that translate a sequenced plan into safe, recoverable motion across unfamiliar bodies and environments. Project Fetch does not reach that layer. Phase Three, when it comes, will be the test of whether the arc keeps going down the stack, or whether the next ten months expose a new gap between what a frontier model can plan and what an off-the-shelf robot can be trusted to do.