Large language models have text tokens; computer vision has pixels, but embodied AI, the work of teaching robots to grasp, assemble, and manipulate real objects, has never agreed on a common basic unit. On June 24, 2026, Chinese startup RoboScience (机器科学) released Visics, a general embodied model built around what the company calls Object Trajectory (36kr). Object Trajectory is a 3D point-cloud trace of the object being manipulated, not the robot's joint angles, and the company pitches it as the embodied-AI equivalent of a text token.
The bet is bigger than a benchmark result. Robotics has long suffered from representation fragmentation: policies trained on a Franka arm rarely transfer to a dexterous hand, and policies that master rigid peg insertion tend to fail on soft cables. RoboScience's wager is that an object-centric intermediate representation, one that ignores which machine is doing the work and instead tracks how the thing being worked on moves through space, can generalize across embodiments, objects, and tasks. It is, in effect, a format-war pitch, not a product launch.
The proposed spine of the model is an architecture the company labels VLOA, short for Vision-Language-Object-Action, which sits between two independently trained engines (36kr). The first is an embodied world model pre-trained on web video to learn object state, 3D trajectory, contact force, and physical causality. The second is a general manipulation model trained on physics-engine simulation data, outputting robot control signals that, in principle, can drive rigid, articulated, or deformable objects. Visics takes in visual, tactile, and force input and runs closed-loop control on top.
Founder and CEO Tian Ye (田野) is named as the architectural lead in the 36kr announcement and in prior coverage. The company is also reported to have an ICRA best-paper track record, a separate 163.com piece claims, though that claim has not been independently verified against conference proceedings and should be treated with caution.
The technical lineage is real, but the framing is not new. Researchers have been pursuing object-centric representations for manipulation for years. A separate, unrelated group at the National University of Singapore recently proposed T(R,O) Grasp, a graph-diffusion approach that learns the spatial relationship between a robot and the object it is grasping as a way to transfer grasping skills across embodiments. The two efforts share a sensibility, an insistence that the object, not the arm, should be the unit of analysis, but they are different teams, different papers, and different problems. Visics is not a continuation of that work.
Three caveats apply to the announcement. The company says Visics cuts single-trajectory data costs to between roughly 1/20 and 1/200 of conventional approaches; that figure is a company claim, not an independently reproduced benchmark. The stated goal of achieving mass production in 2026 is a target, not a delivered result. The demoed scenarios, furniture assembly, dexterous grasping, and dynamic assembly lines, are described rather than independently verified, and English-language coverage of this specific release is thin.
What to watch next is straightforward. The format-bet story turns on whether RoboScience publishes, or allows third parties to test, cross-embodiment transfer results that hold up outside the company's own evaluations. If Object Trajectory generalizes the way text tokens did, the architecture question will look obvious in hindsight. If it does not, the company will join a long list of foundation-model announcements whose most enduring contribution was the vocabulary.