The Hard Part of AI Is No Longer Seeing the Future. Its Deciding What to Do Next.
World models have become alarmingly good at predicting what comes next. A robot arm nudges a coffee cup. A car approaches an intersection. A robotic gripper reaches for a drawer handle. Give a world model enough video of how the physical world works, and it will tell you what happens next with increasing accuracy. That problem, it turns out, is largely solved.
The harder problem is what happens after that.
A new planner from Berkeley Artificial Intelligence Research and Meta, called GRASP, published on the BAIR blog this week, addresses exactly the gap between seeing futures and deciding what to do about them. The paper, first posted to arXiv in January, describes a gradient-based planning algorithm that works reliably with learned world models at horizons where prior methods fall apart. Yann LeCun is a co-author.
GRASP stands for Gradient RelAxed Stochastic Planner. The name is also the point: it grabs hold of a world model’s learned physics and plans across long sequences of actions without the optimization going catastrophic.
The core problem is this. A world model takes a current state and a proposed action, and predicts the next state. Roll that forward fifty steps and you have a simulated trajectory. To plan, you optimize the action sequence to minimize the distance between the final predicted state and your goal. This is gradient-based planning, and it works reasonably at short horizons. At long ones, it breaks in two ways.
First, backpropagating error signals through fifty successive applications of a learned model creates exponentially exploding or vanishing gradients. The optimization signal that tells action zero whether it contributed to the final error gets garbled by the time it arrives. Second, the loss landscape at long horizons is non-greedy: the optimal trajectory often moves away from the goal temporarily before reaching it. A planner that only knows how to go straight toward the goal will fail tasks that require going around an obstacle first.
The authors’ fix has three ingredients. The first is lifting: treating the intermediate states in the trajectory as optimization variables rather than derived quantities, so each step of the world model is evaluated independently rather than composed into a deep computation graph. This allows parallel computation and sidesteps the exploding gradient problem. The second is noise: adding stochastic perturbations to the state variables during optimization helps the planner hop out of local minima that trap greedy solutions. The third is the key innovation, and it comes from an adversarial robustness argument.
Deep learning models behave well on the data distribution they were trained on, but have a tendency to be exploited in directions orthogonal to that distribution. Tiny nudges to a state vector that would never occur in real experience can make the model output anything the optimizer wants. The authors call this the sensitivity of state-input gradients. Their fix is to simply stop gradients from flowing into the state input of the world model entirely, while keeping the gradient signal through the action input, which is lower-dimensional and more densely trained. The planner then descends on actions, not on states.
On the Push-T benchmark, which requires a simulated robot to push a T-shaped object to a target orientation, GRASP substantially outperforms the cross-entropy method and vanilla gradient descent at horizons between 40 and 80 steps. At horizon 80, GRASP succeeds 10.4 percent of the time in a median of 58.9 seconds. The cross-entropy method succeeds 2.8 percent of the time in 132.2 seconds. Vanilla gradient descent succeeds 6.4 percent of the time in 161.3 seconds. GRASP is not dramatically more accurate at the longest horizon so much as it is dramatically faster and more reliable.
This is a narrow benchmark. GRASP has not been tested on manipulation tasks requiring contact-rich dexterity, on locomotion, or on any real robot hardware. The paper was on arXiv for three months without triggering a wave of adoption or independent validation. Code is public on GitHub, which is more than most academic papers offer, but it has not yet been incorporated into any major robotics framework.
What it does demonstrate, however, is a genuine fix for a specific known failure mode in the world-model-for-robotics stack. The robotics industry has spent the past two years placing large bets on learned simulators as a way around the data bottleneck in robot training. Rather than collecting millions of hours of physical robot demonstrations, a robot could train inside a world model that learned physics from video. The problem, as Bessemer Venture Partners noted in a recent landscape analysis, is that world models can predict well but struggle to generate reliable plans at the long horizons that real tasks require.
That is the gap GRASP is trying to close. World models learned to see. The question now is whether anyone can build the planner that makes acting on that vision reliable.