A warehouse floor is a demanding environment. Human workers can identify a clear plastic tote or a brushed-metal bracket without breaking stride. Robot arms, for the most part, still cannot — not because their manipulators are weak, but because their eyes lie to them.
Depth sensors are the industry standard for bin-picking robots. LiDAR, structured light, stereo cameras — all of them measure the world by sending something into it and timing how long the return takes. It works fine on matte cardboard and rough plastic. It fails on anything that is smooth, clear, or shiny. Light does not bounce back from a glass container the way it bounces back from a cardboard box. It scatters. The sensor reports a hole where there is a surface, or a surface where there is air. The robot reaches for something that is not there.
HEAPGrasp, a system developed by researchers at Tokyo University of Science's Department of Mechanical and Aerospace Engineering, takes a different approach. Rather than trying to make depth sensors work on objects that defeat them, the system uses only a standard RGB camera — the kind found on any industrial robot arm — and infers three-dimensional shape from multiple viewpoints. The research was published in IEEE Robotics and Automation Letters in January 2026.
The method has three stages. First, a convolutional neural network called DeepLabv3+ with a ResNet-50 backbone performs semantic segmentation on each image, separating each object in the scene so the system knows what it is looking at, not just where something is. Second, Shape from Silhouette reconstruction builds a 3D model from the object outlines across multiple camera positions, tracing the object's edges as the camera moves and inferring its volume from what is visible from each angle. Third, an active perception planner chooses where to move the camera next, selecting viewpoints that will reduce the most uncertainty in the reconstruction — the system is not circling the bin for its own sake, it is hunting for the viewpoints that will most improve its shape estimate.
The approach was evaluated across 20 scenes with 5 objects each, spanning transparent-only, opaque-only, specular-only, and mixed configurations. HEAPGrasp achieved a 96% grasp success rate on transparent, specular, and opaque objects. The active perception planner also reduced the hand-eye camera trajectory length by 52% compared to a baseline that simply circles the scene for full coverage, and reduced execution time by 19% against the same baseline.
The 52% figure is not incidental. The baseline circling method gathers redundant information at many viewpoints while missing the specific angles that resolve shape ambiguity. HEAPGrasp's planner treats the camera path as part of the inference problem: it chooses viewpoints that maximize information gain about object shape, which means less total movement and faster execution.
The paper positions HEAPGrasp directly against existing approaches. ClearGrasp, published in 2020, uses RGB-D data and known surface geometry to reconstruct depth on transparent objects — but ClearGrasp depends on the depth channel being recoverable, which breaks down when specular reflection corrupts the measurement. GraspNeRF, from ICRA 2023, and ASGrasp, which achieves over 90% success on transparent objects using an RGB-D active stereo camera and was presented at ICRA 2024, both achieve strong results but require depth information: GraspNeRF uses multi-view depth fusion, and ASGrasp requires an RGB-D active stereo camera. HEAPGrasp's choice to abandon depth entirely sidesteps the failure mode that limits all of these approaches on specular surfaces.
The paper frames it as something fundamentally different: rather than measuring depth, HEAPGrasp infers it — reconstructing a three-dimensional model from two-dimensional silhouettes by tracking object edges across viewpoints. The approach draws on Shape from Silhouette methods used in computer graphics and photogrammetry, but adapted here for real-time grasp planning on a robot arm.
The authors are Ginga Kennis, a recent M.S. graduate from Tokyo University of Science, and Associate Professor Shogo Arai, whose prior work includes visual servoing and bin-picking systems. HEAPGrasp will be presented at ICRA 2026 in Vienna.
The paper does not claim to have solved warehouse-scale mixed-SKU picking. The experiments were conducted in a lab environment with a single robot arm — controlled lighting, structured scenes, objects placed for unambiguous evaluation. Scaling to the variability of a real distribution center — unpredictable lighting, cluttered bins, objects in partial occlusion, speed requirements — is a separate challenge the authors acknowledge remains open. Lab validation and production deployment are different territories, and the gap between them is where most robotics research quietly dies.
What the paper does claim is that the sensing modality itself — RGB-only, with active viewpoint planning — is a viable and more robust alternative to depth-based sensing for this specific problem. The 96% success rate across object categories, the reduction in camera movement, and the explicit comparison against depth-dependent approaches all point in the same direction: for the objects that break depth sensors, RGB with smart motion planning may be the answer.
The question is whether it scales. Real warehouses are not 20-scene evaluations. They are chaotic, fast, and full of edge cases that did not appear in the lab. ICRA 2026 will give the research community a chance to scrutinize the method in detail. The Tokyo team will have to show it holds up under pressure. For the warehouses that have been waiting for a robot that can actually see through the mess — this is the most concrete proposal in recent memory.