New Robot Model Lifts Cross Task Transfer to 31.2% From Zero
The Robot That Cannot Generalize
There is a number buried in the Agentic-VLA paper that should make every robotics journalist reconsider the headlines they have been writing: zero. Zero percent cross-task transfer without task-specific demonstrations. Not nearly zero. Not asymptotically zero. Zero.
Agentic-VLA, a paper from researchers Ruofan Jin and Zaixi Zhang posted to arXiv on May 21, pushes that number to 31.2 percent. On the LIBERO benchmark, a standard test for robotic manipulation in simulated home and warehouse environments, the model achieves 12.3 percent better performance on long-horizon tasks, 28.5 percent better in one-shot learning scenarios, and converges 2.4 times faster than existing online adaptation methods, according to the paper. These are real numbers from a real preprint, and they represent genuine progress on a genuinely hard problem.
They also represent, depending on your timeline, either a breakthrough or a punchline.
Thirty-one percent means the robot fails almost seven times out of ten when it encounters a task it was not specifically trained to do. For manufacturers running high-mix low-volume operations, the stakes are especially sharp: even a 98 percent success rate fails at scale, as an analysis of VLA requirements for physical AI notes. For a field that has spent three years describing itself as the foundation of general-purpose robotics — the technology that will let machines fold laundry, restock shelves, and assist in surgeries across any environment — that failure rate is not a footnote. It is the headline.
The baseline nobody admits
VLA stands for vision-language-action model. The idea is elegant: take a large neural network already trained on images and text, add a layer that outputs robot motor commands, and you get a machine that can follow natural-language instructions without task-specific programming. VLAs have been called promising, transformative, and the next frontier in robotics. What nobody in the field has been saying out loud is that the cross-task version of that promise does not currently work. At all.
The architecture picture is fragmented. Other VLAs in the literature use different internal designs, according to a robotics center comparison guide: OpenVLA relies on discrete tokens while pi0 uses flow matching, fundamentally different approaches from the Agentic-VLA method. That diversity of approaches is itself a sign the field has not converged on a standard design.
The Agentic-VLA paper names this plainly: current VLA training methods suffer from poor generalization to novel environments and low training efficiency requiring extensive demonstrations, the paper states. What that formulation omits, and what the zero-percent baseline makes unavoidable, is that the generalization failure is not a calibration problem or a data quality issue. It is categorical. Without task-specific fine-tuning, existing VLAs do not transfer across tasks. Not somewhat. Not poorly. Not at all.
The 31.2 percent number Agentic-VLA reports is therefore not a modest improvement on a working system. It is the first crack in a wall that was, until now, completely immovable.
Three things the robot learns to do differently
Agentic-VLA introduces three mechanisms to get there. Adaptive Reward Synthesis generates training signals automatically from language feedback rather than requiring hand-labeled success metrics: the robot grades its own work using natural-language reasoning about whether it succeeded. Language-Guided Exploration lets a language model suggest exploration strategies when the robot hits uncertainty, replacing random motion with directed hypothesis-testing. Experience Memory maintains a rolling buffer of past attempts that the robot can reference when adapting to a new task, approximating the cross-task intuition a human operator builds over years.
The optimization backbone is GRPO, Group Relative Policy Optimization, borrowed from DeepSeekMath, a reinforcement learning technique originally developed for mathematical reasoning, the paper explains. Think of it this way: instead of telling the robot the right answer, you let it compare its own attempts against each other and figure out what worked, even without a labeled success signal. The authors apply that same logic to physical task completion. It is an unusual cross-domain port that nobody appears to have tried before in this direction.
Whether that matters depends on what happens next.
The 31.2 percent question
LIBERO is a simulated benchmark. It measures performance in controlled environments on well-specified tasks. Real-world generalization: a robot encountering an unfamiliar kitchen, an unstructured warehouse aisle, or a task that shares semantic structure with training but differs in physical execution, may behave very differently from what LIBERO captures. Moritz Reuss, a researcher who has examined the LIBERO benchmark directly, notes that most models reporting results on it simply train on the full dataset and do not actually perform the continual-learning task the benchmark was designed to measure, according to his analysis. If the benchmark is compromised, the improvement numbers are harder to trust.
The paper is also three days old as of this writing. It has not been peer reviewed. The authors have not released code as of this writing, though the arXiv submission includes a link to what appears to be a GitHub repository. Nobody has replicated the results.
These are not reasons to ignore the paper. They are reasons to read it carefully before writing a press release about the end of task-specific robot training.
What is true is that the problem Agentic-VLA is solving is real, and the problem is worse than the field's public framing has admitted. The VLA paradigm has been presented as a promising path toward general-purpose robots. The underlying data says the path has so far gone nowhere. Agentic-VLA is the first published work that credibly begins to move.
Whether it moves far enough to matter: that is the question the next twelve months of robotics research will answer.
The paper is at arXiv.org/abs/2605.22896.