Tsinghua Lab Solves the Latency Problem Every Robot Builder Knows
The faster your robot's processor, the more out-of-date its information becomes. Tsinghua's new framework finally breaks that paradox.
The faster your robot's processor, the more out-of-date its information becomes. Tsinghua's new framework finally breaks that paradox.

image from Gemini Imagen 4
Tsinghua University researchers developed F2F-AP (Flow-to-Future Asynchronous Policy), a framework that predicts where objects will be when a robot's action executes rather than where they are when planning begins. The system uses SAM-based keypoint segmentation and heatmap flow prediction to generate temporally aligned synthetic future observations, training the policy encoder via contrastive learning to trust these predictions. This addresses the 300-400ms latency gap that causes real-world robots to act on stale visual information.
A robot that can see is useful. A robot that can see ahead is something else entirely.
Researchers at Tsinghua University's Department of Automation have built a system that gives manipulation policies the ability to anticipate. Their paper, published April 2, 2026 on arXiv (cs.RO), describes a framework called Flow-to-Future Asynchronous Policy — F2F-AP — that predicts where moving objects will be when the robot's planned action actually executes, rather than where they are when the policy starts thinking.
It's a deceptively simple idea wrapped in dense math. But the underlying problem is one every robotics engineer running real-world systems has felt: by the time a model finishes inferring what to do, the world has moved on.
Modern robotic manipulation increasingly relies on asynchronous inference pipelines. Rather than stopping to compute each action, the robot starts executing one action while already computing the next. This keeps arm movements smooth and throughput high.
But there is a cost. Total system latency — the combined delay from sensor acquisition, model inference, and controller execution — typically runs 300 to 400 milliseconds on a well-tuned system. During that window, the world keeps moving. The policy, meanwhile, is planning against stale information.
Existing approaches have tried to work around this. VLASH, a recent method, conditions the policy on anticipated future robot proprioceptive states — essentially telling the model where the robot's own joints will be at execution time. This fixes part of the problem: the actions are aligned to the right moment. But the visual input is still lagging. The policy sees the world as it was, not as it will be.
F2F-AP goes further. It predicts what the visual scene will actually look like when execution begins — not just where the robot will be, but where the objects being tracked will have moved to.
The core of the system is a flow predictor that tracks how objects move between frames. Rather than running a heavy video generation model to synthesize future images — too slow for real-time control — the team uses a heatmap-based predictor to estimate the displacement of key object points over the prediction horizon.
The architecture draws on SAM (Segment Anything Model) segmentation to identify stable keypoints on target objects, then predicts where those points will be after H timesteps, where H is calculated from the measured system latency. The predicted flow is rendered back onto the current observation frame, producing a synthetic future view that is temporally aligned with the action execution window.
A contrastive learning objective then trains the policy encoder to treat these flow-synthesized observations as equivalent to actual future observations. The policy learns to trust the prediction without needing ground truth at training time.
The team tested F2F-AP on two distinct platforms: a UR5e fixed-base robotic arm and a Unitree Go2 quadruped robot equipped with a Hexfellow Saber arm.
On a fixed-base arm, F2F-AP achieved full task success rates on dynamic interception tasks — grasping objects moving on unpredictable trajectories — where baseline methods including VLASH showed significant failure rates. The average execution time for successful trials was substantially lower, because the system no longer wastes the first H action steps that have gone stale before execution begins.
On the quadruped mobile manipulator, the same approach generalized to a platform with whole-body dynamics and additional locomotion latency. The system explicitly models the combined latency budget across sensing, inference, and low-level control, selecting H accordingly. This explicit latency modeling is the practical contribution that most directly translates to other real-world systems.
Measured latency on the UR5e setup: approximately 125ms observation delay, 200ms inference, and under 50ms controller latency, yielding H=4 steps at 100ms per step.
F2F-AP is a research result on a well-scoped problem. It demonstrates that predicting and synthesizing future visual context improves dynamic manipulation performance under async inference. The results are real and the ablations are thorough.
What the paper does not show: deployment in unstructured environments outside the five experimental tasks, long-horizon reliability, sim-to-real transfer at scale, or integration with commercial manipulation pipelines. The data was collected using UMI (Universal Manipulation Interface) devices in lab conditions. The flow predictor was trained on specific object categories and task structures.
The authors are transparent about limitations. The flow predictor struggles with heavy occlusion — they handle this by training on gripper-free sequences and compositing gripper masks during data synthesis. The contrastive learning framework requires careful temporal masking to avoid treating adjacent frames as negative samples.
This is a well-executed academic paper solving a real sub-problem in real-world robot control. Whether the specific architectural choices generalize to other domains — warehouse picking, surgical assistance, field robotics — remains to be tested.
The reason this matters beyond the lab is that the async inference paradigm is everywhere in deployed robotics. Physical robot control at any meaningful speed requires it. Latency is physics, not a software bug.
If the flow-based future-synthesis approach scales, it changes the calculus for any dynamic manipulation task where objects move independently of the robot: collaborative human-robot handover, mobile manipulation in cluttered spaces, any grasping task where the object arrives on a conveyor rather than sitting still.
Tsinghua's Jiwen Lu group has published a credible path forward. The question now is whether the approach holds up outside the specific hardware and task configurations tested — and whether anyone picks it up for real-world deployment.
Primary source: Wei et al., "F2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation," arXiv:2604.02408, April 2026.
Story entered the newsroom
Assigned to reporter
Research completed — 1 sources registered. F2F-AP predicts object flow to compensate async policy latency in dynamic manipulation. Uses flow-based contrastive learning to synthesize temporally-
Draft (948 words)
Approved for publication
Published (931 words)
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Robotics · 2h 10m ago · 2 min read
Robotics · 7h 52m ago · 4 min read