The robotics industry is betting billions that scaling vision-language-action models will eventually produce generalist robots. A new position paper argues that bet has hit a wall — and identifies four specific "grounding" interfaces the field has been ignoring.
The paper, titled "Robots Need More Than VLA and World Models" and posted to arXiv on June 4, 2026 by authors from UCL, Stanford, ETH Zurich, TU Darmstadt, IIT, and the startup Motoniq.ai, surveys the dominant research trend of the past several years: train a large model on internet-scale data, give it robot action outputs, and scale up. Systems like RT-1 and RT-2 from Google, OpenVLA from NVIDIA, and π0 from Physical Intelligence have each pushed that playbook further. The authors don't dispute that these systems work — they argue the approach has a structural ceiling.
The Grounding Problem
The bottleneck, according to the paper, isn't model size. It's conversion: turning raw physical experience — how things feel when grasped, how forces propagate through a body, what success looks like in the real world — into supervision signals a robot model can actually use. The authors call these grounding failures, and they argue the field has been systematically underfunding four of them.
The first is data grounding — the problem of labeling unstructured behavior. A robot watching hours of human motion video sees movement, but not which movements are purposeful. Systems like DROID, BridgeData V2, and RH20T have made progress here, but the paper argues that reliable autolabeling of physical behavior at scale remains unsolved. The second is embodiment grounding — translating human motion to a robot's specific kinematics and force profiles. Work like ALOHA, Dobb-E, and the BC-Z dataset take different tacks, but human-to-robot retargeting still requires significant manual intervention or domain-specific tuning.
The third is world-model grounding — giving models not just pattern recognition but physically plausible 3D reasoning. The authors note that current VLAs process pixels and language tokens but don't reliably represent collision geometry, contact forces, or causal physical sequences — something pure Diffusion Policy approaches also struggle with. The fourth is reward grounding — inferring task success from video and language without a hand-engineered reward function. SayCan and PaLM-E represent early attempts, but the paper argues that robust, generalizable success inference from raw observation is still an open problem.
What a Grounded Pipeline Would Look Like
Rather than just cataloging what's missing, the paper sketches an alternative pipeline. Start from broad physical experience — human motion capture, internet video of physical tasks, simulation, tactile sensing — then pass each stream through its corresponding grounding interface to produce robot-usable actions, contact models, and reward signals. The Open X-Embodiment dataset is cited as a step in this direction, though the authors argue it's a starting point, not a solution.
This framing distinguishes the paper from a standard literature review. The authors aren't just summarizing RT-1, OpenVLA, and π0 — they're arguing that the entire research agenda organized around scaling those systems has been asking the wrong question.
A Note on Disclosure
Three of the paper's authors — Karcini, Mehrban, and Nguyen — are co-founders of Motoniq.ai, and researcher Max Schwager also lists a Motoniq affiliation. The paper includes a formal COI disclosure stating these relationships. That's worth noting plainly: the position the paper argues — that the VLA scaling agenda is insufficient — also happens to align with a commercial interest in alternative approaches to robot learning. The disclosure doesn't disqualify the analysis, but it is part of the context for understanding how the paper's拱 is constructed.
What This Means for the Field
The paper is a preprint, not a peer-reviewed result. Its claims about the limits of current VLA systems reflect the authors' interpretation of the field's state, not a consensus view. And it doesn't offer a working system that demonstrates its alternative pipeline at scale — it's a position and a research agenda, not a product or a benchmark.
That said, the institutional weight of the author list is unusual for a position paper. The combination of researchers from UK, Swiss, German, Indian, and US institutions — alongside a robotics startup — suggests the "grounding" critique reflects something circulating in the research community beyond this single paper. Whether it shifts the field's direction, or whether the VLA scaling agenda simply absorbs the critique and continues, is the more interesting question the paper leaves open.