CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation
The problem with robot navigation is that it's never really portable. Train a policy on one robot — specific camera, specific chassis height, specific wheel spacing — and that knowledge doesn't transfer cleanly to the next robot, even one with nearly identical specs. Swap in a different camera, change the mounting height by a few centimeters, and the agent that was gracefully threading a warehouse corridor suddenly misjudges distances and starts scraping walls. Deploying a fleet with mixed hardware has meant either collecting expensive multi-robot training data or accepting that you'll spend weeks fine-tuning each new addition.
A team of eight researchers at Shanghai Jiao Tong University and Eastern Institute of Technology, Ningbo, a relatively new Chinese research university, argues there's a simpler path. Their framework, which they call CeRLP (Cross-embodiment Robot Local Planning), was posted as a preprint to arXiv on March 20, 2026. The core idea is elegant in a way that feels obvious in retrospect: if the reason navigation policies fail to generalize is that they see camera-specific representations, the fix is to stop showing them camera images at all.
Here's how it works. A robot's camera produces a depth image — each pixel encodes how far away something is. Monocular depth systems have a known flaw, which is that they're scale-ambiguous: a monocular camera can't tell you whether that shelf is one meter away or three. CeRLP's first contribution is a calibration step, done once per robot configuration, that solves for two parameters using the geometry of the ground plane. Those parameters convert relative depth estimates into real metric distances. The step is robot-specific, but you only run it once.
From there, the depth image — now in actual meters — gets projected into a virtual laser scan, adapted to the height of the camera mounting. The key detail is that this scan looks the same regardless of which camera produced the depth image. A wheeled robot running a fisheye lens and a quadruped with a standard RGB-D sensor, both facing the same corridor, would produce the same virtual scan. The navigation policy never sees the camera. It sees geometry.
The third piece is inherited from prior IEEE Robotics and Automation Letters work from the same lab, a LiDAR-based planner called DRL-DCLP that treated robot body dimensions as explicit inputs to the policy. Wheeled robot, quadruped, wide chassis, narrow chassis — the policy takes the virtual scan plus the dimensions [front length, rear length, width] and navigates accordingly. Different robot means different numbers fed as inputs, not a different model. The policy itself is trained in simulation only and applied to the real world without fine-tuning.
That combination — offline calibration, virtual scan abstraction, dimension-configurable RL — is what the authors say enables zero-shot transfer across robot types. In real-world experiments, they tested point-to-point navigation and vision-language navigation tasks across multiple robot platforms. The paper claims CeRLP outperforms baseline methods in simulation across success rates and obstacle avoidance metrics.
The team also addressed something that doesn't get enough attention in robot navigation research: the gap between high-level language commands and physical obstacle awareness. Most vision-language navigation systems, as the paper notes, treat robots as point masses and ignore collision risk at the local level. CeRLP is designed to plug in beneath any VLN planner as the physical safety layer — the part that actually keeps the robot from walking into things while the language model is thinking about where it wants to go.
Context matters for evaluating the "first" framing in the paper. The corresponding author, Wei Zhang, an assistant professor at EIT Ningbo who received his PhD from National University of Singapore in 2021, and his collaborators are working in territory that has gotten crowded fast. The General Navigation Models (GNM) project out of Stanford and UC Berkeley took the opposite philosophical approach — curate 60-plus hours of navigation data across six robots and train a general policy directly on the diversity. GNM works, but it can't reliably account for unrepresented robot footprints in tight spaces, since the policy has no explicit geometry model.
More directly comparable: a team at Alibaba's AMAP computer vision lab published CE-Nav in October 2025 — five months before CeRLP — which also targets cross-embodiment local navigation without large-scale multi-robot data. CE-Nav uses a two-stage approach combining imitation learning and reinforcement learning with normalizing flow, and has released its code publicly. CE-Nav tests on quadrupeds, bipeds, and quadrotors. The claim that CeRLP is the "first visual-based local planning framework" for cross-embodiment deployment without large datasets deserves some scrutiny given that overlap, even if the technical approaches are architecturally distinct.
What CeRLP doesn't have yet: a public code repository, deployment partnerships, or third-party validation. The calibration step, while lightweight, still requires per-robot setup — the framework isn't zero-configuration so much as one-time-configuration. And the experiments, while real-world, were conducted under controlled conditions; how the system handles warehouse-scale clutter, dynamic obstacles, and variable lighting remains an open question.
The deployment pressure is real regardless. A January 2026 analysis by Logistics Viewpoints noted that autonomous mobile robots in warehouses are not plug-and-play — orchestration and per-robot tuning remain the operational differentiator. A navigation approach that genuinely reduces that tuning burden per robot addition has industrial value that doesn't require waiting for humanoids to arrive.
Three research groups now have published approaches to the same problem in roughly 18 months. That's not a coincidence — it's a signal about where the real friction in robot deployment sits.