Human Motion Training Breaks Robot Execution
A new framework called PhyGile tackles one of humanoid robotics' sneakiest problems: motions that look perfect on a human and faceplant on a machine. The robots keep falling.

image from FLUX 2.0 Pro
The robots keep falling. Not literally — not always — but figuratively, in the gap between what a motion model generates and what a physical robot can actually execute. A new framework from researchers at ShanghaiTech called PhyGile proposes a solution that cuts this problem off at the source rather than patching it on the way out.
The core problem is subtle but genuinely hard. Most text-to-motion generation models are trained on motion capture data from humans — rich, expressive datasets that encode how a person throws a punch, cartwheels, does a backflip. The trouble is that all that data is calibrated to human biomechanics: our mass distribution, our joint torque limits, our particular relationship with gravity and contact. When you feed that learned motion into a retargeting pipeline to transfer it onto a humanoid robot, you get trajectories that are kinematically plausible — the joints are within range, the pose looks right on a screen — but physically broken. Wrong torques. Wrong center-of-mass dynamics. Contact timing that assumes feet that weigh differently. The robot that watched a human do a backflip and then attempted one is the cautionary reel every demo team knows exists.
Jiacheng Bao, the lead author, has been working up to this. His 2024 CVPR paper OMG tackled open-vocabulary human motion generation using a mixture-of-controllers approach — 58 citations in two years. PhyGile is the natural next chapter: take the generative capability and make it actually work on hardware. Bao is part of ShanghaiTech's EO-Robotics group, the same team that published the EO-1 embodied foundation model last year alongside co-authors Dong Wang and Bin Zhao. This is not a one-off paper — it is a lab building a stack.
The PhyGile framework does two things in sequence. First, it trains a General Motion Tracking controller using a curriculum-based mixture-of-experts scheme — a design that builds on the GMT architecture published by researchers at UC San Diego last June, which PhyGile extends. The curriculum matters: previous approaches collapsed on simpler motions when pushed to handle harder ones. PhyGile's GMT controller gets post-trained on unlabeled motion data to build robustness at scale.
Then comes the distinctive move: physics-prefix adaptation. At inference time, PhyGile generates motions directly in a 262-dimensional robot skeletal space — the robot's own body representation — rather than generating human-format motion and retargeting it afterward. Physics-derived prefixes act as steering signals, constraining the output to motions that satisfy the robot's actual actuator limits and mass distribution. The retargeting step is eliminated, not improved. The artifacts that live in the gap between human-space generation and robot-space execution simply do not arise.
The project page and the preprint both claim real-robot validation alongside simulation results. The paper demonstrates stable tracking of agile whole-body motions that prior methods could not handle — territory beyond walking and low-dynamic movement, which is where most current humanoid demo reels actually live.
The timing is meaningful. PhysMoDPO, a concurrent paper from a separate team, dropped the same week taking a different angle: using Direct Preference Optimization to fine-tune diffusion-generated motions toward physical plausibility before they enter the retargeting pipeline. PhysMoDPO keeps retargeting but makes the input better; PhyGile removes retargeting entirely. Both papers appearing in the same week is the tell — retargeting is the active unsolved problem in humanoid motion right now, not a solved baseline everyone moved past.
The broader competitive landscape has been attacking physics-grounded humanoid motion from multiple directions. ASAP, published at RSS 2025 by Tairan He and collaborators at Carnegie Mellon University, NVIDIA, and UT Austin, uses a delta action model trained after reinforcement learning pretraining to bridge the sim-to-real gap at deployment rather than at generation time. HOVER, presented at ICRA 2025 by He et al., established multi-mode policy distillation as a whole-body control baseline that subsequent work — including PhysMoDPO — benchmarks against. The research question everyone is circling: at what layer do you enforce physics? After generation? During training? At inference time? PhyGile's bet is inference-time physics prefixes, which allows the generative model to stay expressive while the controller enforces what the robot can actually do.
What does this mean for teams building humanoid systems? The retargeting problem is not just an academic annoyance — it is a real constraint on what motion libraries are usable on hardware. Every training setup that leans on human mocap data inherits the physics mismatch unless something corrects for it. A framework that generates robot-native motions from natural language without a retargeting stage could meaningfully expand the range of behaviors available to text-controlled humanoid robots, and shrink the gap between what researchers show in simulation and what ships.
The caveat, as always, is that real-robot claims in a preprint need verification. The physics prefixes work when the physics model they are derived from is correct; model errors compound in dynamic maneuvers. The controlled conditions of an academic lab demo are not a warehouse floor. PhyGile presents lab results, not field results. But the problem it is solving is real, the author track record is credible, and the competitive timing suggests the field is converging on this layer as the next thing to fix.

