Tsinghua Lab Solves the Latency Problem Every Robot Builder Knows
The Latency Problem Every Robot Builder Knows — And One Lab Is Finally Solving It
A robot that can see is useful. A robot that can see ahead is something else entirely.
Researchers at Tsinghua University's Department of Automation have built a system that gives manipulation policies the ability to anticipate. Their paper, published April 2, 2026 on arXiv (cs.RO), describes a framework called Flow-to-Future Asynchronous Policy — F2F-AP — that predicts where moving objects will be when the robot's planned action actually executes, rather than where they are when the policy starts thinking.
It's a deceptively simple idea wrapped in dense math. But the underlying problem is one every robotics engineer running real-world systems has felt: by the time a model finishes inferring what to do, the world has moved on.
The Async Inference Problem
Modern robotic manipulation increasingly relies on asynchronous inference pipelines. Rather than stopping to compute each action, the robot starts executing one action while already computing the next. This keeps arm movements smooth and throughput high.
But there is a cost. Total system latency — the combined delay from sensor acquisition, model inference, and controller execution — typically runs 300 to 400 milliseconds on a well-tuned system. During that window, the world keeps moving. The policy, meanwhile, is planning against stale information.
Existing approaches have tried to work around this. VLASH, a recent method, conditions the policy on anticipated future robot proprioceptive states — essentially telling the model where the robot's own joints will be at execution time. This fixes part of the problem: the actions are aligned to the right moment. But the visual input is still lagging. The policy sees the world as it was, not as it will be.
F2F-AP goes further. It predicts what the visual scene will actually look like when execution begins — not just where the robot will be, but where the objects being tracked will have moved to.
Predicting Object Flow
The core of the system is a flow predictor that tracks how objects move between frames. Rather than running a heavy video generation model to synthesize future images — too slow for real-time control — the team uses a heatmap-based predictor to estimate the displacement of key object points over the prediction horizon.
The architecture draws on SAM (Segment Anything Model) segmentation to identify stable keypoints on target objects, then predicts where those points will be after H timesteps, where H is calculated from the measured system latency. The predicted flow is rendered back onto the current observation frame, producing a synthetic future view that is temporally aligned with the action execution window.
A contrastive learning objective then trains the policy encoder to treat these flow-synthesized observations as equivalent to actual future observations. The policy learns to trust the prediction without needing ground truth at training time.
Benchmarks on Real Hardware
The team tested F2F-AP on two distinct platforms: a UR5e fixed-base robotic arm and a Unitree Go2 quadruped robot equipped with a Hexfellow Saber arm.
On a fixed-base arm, F2F-AP achieved full task success rates on dynamic interception tasks — grasping objects moving on unpredictable trajectories — where baseline methods including VLASH showed significant failure rates. The average execution time for successful trials was substantially lower, because the system no longer wastes the first H action steps that have gone stale before execution begins.
On the quadruped mobile manipulator, the same approach generalized to a platform with whole-body dynamics and additional locomotion latency. The system explicitly models the combined latency budget across sensing, inference, and low-level control, selecting H accordingly. This explicit latency modeling is the practical contribution that most directly translates to other real-world systems.
Measured latency on the UR5e setup: approximately 125ms observation delay, 200ms inference, and under 50ms controller latency, yielding H=4 steps at 100ms per step.
What This Is and Is Not
F2F-AP is a research result on a well-scoped problem. It demonstrates that predicting and synthesizing future visual context improves dynamic manipulation performance under async inference. The results are real and the ablations are thorough.
What the paper does not show: deployment in unstructured environments outside the five experimental tasks, long-horizon reliability, sim-to-real transfer at scale, or integration with commercial manipulation pipelines. The data was collected using UMI (Universal Manipulation Interface) devices in lab conditions. The flow predictor was trained on specific object categories and task structures.
The authors are transparent about limitations. The flow predictor struggles with heavy occlusion — they handle this by training on gripper-free sequences and compositing gripper masks during data synthesis. The contrastive learning framework requires careful temporal masking to avoid treating adjacent frames as negative samples.
This is a well-executed academic paper solving a real sub-problem in real-world robot control. Whether the specific architectural choices generalize to other domains — warehouse picking, surgical assistance, field robotics — remains to be tested.
The Deployment Question
The reason this matters beyond the lab is that the async inference paradigm is everywhere in deployed robotics. Physical robot control at any meaningful speed requires it. Latency is physics, not a software bug.
If the flow-based future-synthesis approach scales, it changes the calculus for any dynamic manipulation task where objects move independently of the robot: collaborative human-robot handover, mobile manipulation in cluttered spaces, any grasping task where the object arrives on a conveyor rather than sitting still.
Tsinghua's Jiwen Lu group has published a credible path forward. The question now is whether the approach holds up outside the specific hardware and task configurations tested — and whether anyone picks it up for real-world deployment.
Primary source: Wei et al., "F2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation," arXiv:2604.02408, April 2026.
† Add attribution: 'as measured in the paper's UR5e setup' or similar. If sourced from the paper's measurements, cite accordingly. If drawn from general industry knowledge, note that.
†† Add footnote: 'The paper derives H=4 from a measured total latency of 375ms, corresponding to 100ms per control step.' Source-reported; not independently verified.