A Chinese robotics startup says teaching robots to speak won't teach them to touch. It's training on physics instead.

A Chinese robotics startup says teaching robots to speak won't teach them to touch. It's training on physics instead. — type0 | type0

PREVIEWA Chinese robotics startup says teaching robots to speak won't teach them to touch. It's training on physics instead. · MD

A small Chinese robotics company says the dominant approach to general-purpose robots, grafting motion onto language models, has a structural ceiling. Its alternative: train on physics first.

Guangxiang Technology (光象科技), a Tsinghua University spin-out, is the first to publicly argue that VLA, the vision-language-action paradigm that has come to define embodied AI, borrows more from language than it can deliver to physical work. The argument comes with an industrial demo rather than just a paper: a four-wheeled robot called Phi-Bot X1 that the company says ran 21.5 hours without error on a car welding line at the 2026 ATC exhibition. Whether the demo proves the thesis or just packages it, it lands at a moment when VLA has become the default lane for almost every serious embodied-AI lab.

VLA stacks do their heavy lifting by attaching a motion head to a large vision-language model, the same kind of system that powers general chatbots. The strength is real: a robot inherits semantic generality for free, so "pick up the red mug" works even when the mug is unfamiliar. The company's pitch, attributed to CEO Zhang Tao, is that semantic generality is not physical generality. A VLA model remains a perception-to-action mapper that has to be fine-tuned per task. It does not know about mass, inertia, friction, deformation, or contact, the properties that decide whether a part seats correctly in a fixture or a weld lands where it should. Language scale cannot fix that, because language has nothing to say about it.

The other mainstream alternative Guangxiang is rejecting is the video-prediction world model. These systems learn to forecast the next frame. They are visually fluent and impressive in demos, but predicting pixels is not the same as predicting forces. A model trained to anticipate the appearance of a falling box still has no handle on the contact forces that will decide whether a gripper catches it.

Guangxiang's answer is a "physics-native base model," built on three pillars it calls Phi-RL Matrix (a reinforcement-learning algorithm stack), Phi-Space (a high-fidelity, interactive physics data asset), and Phi-Arch (a development platform for training and deployment). The model is meant to learn explicit physical laws like dynamics, contact, constraints, and conservation, alongside implicit ones like stochasticity and long-horizon consequences, by trial and error inside simulated and real environments. Reinforcement learning is treated not as a fine-tuning tool, the company says, but as the growth engine for the model itself.

That is a credible position to hear from. Co-founder Li Shengbo (李升波) is a tenured professor at Tsinghua's School of Vehicle and Mobility and a reinforcement-learning researcher whose Google Scholar profile lists more than 30,000 citations across roughly 250 papers. CEO Zhang Tao previously led mass-production deployment of spatial perception and positioning technology across millions of vehicle terminals, and his commercialization team includes alumni of Alibaba, Tencent, Huawei, Kuka, and Geek+. The technical bench is 100% PhDs from Tsinghua, Zhejiang, and peer institutions.

The welding demo is the part the company wants readers to read past. Phi-Bot X1 is a four-steered-wheel omnidirectional platform with a force-controlled dual-arm upper body, designed for narrow factory aisles and continuous operation. According to the company, the 21.5-hour run covered three days of welding loading with zero errors and zero interruptions, and a dual-hole alignment task held millimetre-level accuracy and 0.3° angular precision throughout. These are company-reported figures from a self-run exhibition demo, not independent benchmarks, and the article does not assert they generalize.

What the demo is meant to argue, the founders say, is that a physics-native stack, not a VLA stack, is what made the deployment generalizable to a real production task in weeks rather than months. That is the testable claim. The financing is the part the company is not leaning on for thesis weight. Guangxiang has now completed cumulative angel-round funding of several hundred million yuan, building on a March 2026 tranche led by IDG Capital and adding investors including Zhuhai Sci-Tech Industry Group, Xingzheng Capital, Songhe Capital, Shunxi Fund, and listed firm Xingyun Tech. The latest total is described as cumulative across the angel program, not a single oversize check.

The open question is not whether Guangxiang wins. It is whether the mechanism behind the demo, RL-grown physics priors learned in interactive simulation, scales to the next task the way VLA scales to the next instruction. The company says a second industrial deployment is already in the pipeline. Watch whether the second task is closer to welding or further from it.

A Chinese robotics startup says teaching robots to speak won't teach them to touch. It's training on physics instead.

Sources