Two New Papers Tackle Robot Grasping's Hardest Problem: When You Can't See What You're Grabbing
Cross-view fusion and energy-based models offer different solutions to occlusion, but both papers reveal how far we still are from solved grasping.
By
·Yesterday·9 min read
Robot grasping is not a solved problem. I know this claim might seem obvious to anyone who has watched a robot arm fumble with a coffee mug, but it bears repeating because the field has a habit of declaring victory prematurely. Two recent papers on arXiv, both addressing the specific challenge of grasping under occlusion, remind us that even seemingly basic manipulation tasks remain genuinely difficult when you move beyond carefully staged laboratory conditions.
The papers take different approaches to the same fundamental issue: what happens when a robot cannot see the object it needs to grasp? One proposes a cross-view fusion framework that combines information from multiple camera angles. The other uses an energy-based model to guide active view selection. Both are interesting contributions, though neither is the breakthrough that press releases might suggest. To be precise, they represent solid incremental progress on a well-defined subproblem.
Before diving into the technical details, it is worth understanding why occlusion matters so much for robotic grasping. When a robot arm reaches toward an object, its own gripper often blocks the camera's view of the grasp point. In cluttered environments (think: a dishwasher full of plates, or a warehouse bin packed with products), other objects compound this problem. The robot needs to estimate where to place its fingers on a surface it cannot directly observe.
Humans solve this problem through a combination of tactile feedback, spatial memory, and the ability to mentally rotate objects. We have spent decades learning how things feel and how shapes continue around corners we cannot see. Robots, for the most part, are working with single RGB-D camera views and whatever geometric priors their training data provided.
Related coverage
More in Research
New research uses reinforcement learning in a shared mathematical space to let soft robots adapt across wildly different body configurations without starting from scratch.
Sarah Williams · 5 hours ago · 6 min
ROP-RAS3 and VOPP represent genuine algorithmic progress for partially observable planning, though the robotics community should temper its excitement until we see more diverse benchmarks.
Aisha Patel · 5 days ago · 9 min
New research suggests robots could maintain orientation awareness with far less sensor data than conventional wisdom demands.
James Chen · 5 days ago · 4 min
New research on curriculum learning reveals why your favorite humanoid demo probably won't scale to the real world.
The standard benchmark for this task is GraspNet-1Billion, which contains over a billion grasp annotations across 88 objects. Both papers evaluate on this benchmark, which is good for comparability but worth noting that benchmark performance does not always translate to real-world robustness. The sample sizes in real-robot experiments tend to be small (I will get to this later), and the objects are typically rigid, convex, and cooperative.
The first paper, "A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation" from researchers whose code is available on GitHub, takes what I would call the geometric intuition approach. If one camera angle has occlusion, add another camera angle. Simple enough in principle.
The interesting contribution here is not the multi-view idea itself (this has been explored extensively) but how they fuse the information. Previous multi-view approaches often relied on explicit 3D reconstruction, which is computationally expensive and introduces its own errors. This paper proposes a "post-fusion" strategy that avoids full reconstruction.
The key technical innovation is a self-supervised contrastive learning scheme for point cloud features. The idea is elegant: if two points from different camera views correspond to the same physical 3D location, their learned features should be similar. If they represent different grasp directions, their features should be distinct. This regularization encourages the network to learn spatially consistent representations without requiring explicit correspondence labels.
They also propose what they call a "cross-view-aligned cylinder integration module," which, I know I am being picky here, but the name is a mouthful that obscures a reasonable idea. The module aligns point features across views based on similarity, then projects everything into a cylindrical coordinate frame. The cylindrical representation emphasizes rotation-symmetric geometry, which makes sense for grasping cylindrical or rotationally symmetric objects. Whether this helps for, say, a crumpled piece of paper remains unclear.
The results on GraspNet-1Billion show improvements over prior methods, particularly in what they call "corner views" where occlusion is severe. The real-robot experiments are less extensively reported, and the paper does not specify how many grasps were attempted or what the failure modes looked like.
This is the active perception framing, and it has a long history in robotics. The core idea is that robots should not passively accept whatever sensor data arrives; they should actively seek information that reduces uncertainty about the task at hand. In the context of grasping, this means moving the camera to see occluded regions before committing to a grasp.
The technical contribution here is twofold. First, they use an energy-based model (EBM) to represent the distribution of valid grasp poses. Energy-based models are having something of a moment in robotics, and for good reason: they can represent multi-modal distributions without the mode-collapse issues that plague some generative models. When there are multiple valid ways to grasp an object, the EBM can capture all of them rather than averaging to a single (possibly invalid) grasp.
Second, and this is the part I find more interesting, they calibrate the energy levels to correspond to actual grasp success rates. This means the model's confidence is, in theory, meaningful. A low-energy grasp is not just "likely under the training distribution" but actually "likely to succeed." This calibration allows them to estimate information gain in a principled way: the next best view is the one that most reduces uncertainty about which grasps will work.
The experiments show improvements over baseline active grasping methods, and they claim to work with "limited view budgets," which is practically important. Every camera movement takes time, and in many applications speed matters. The simulated environment they developed could be useful for future research, though I would want to see more details about the sim-to-real gap.
Let me be direct about what these papers contribute. Neither is a paradigm shift (and I am using that phrase deliberately to mock its overuse). Both are solid engineering contributions that improve on prior work in measurable ways.
The cross-view fusion paper's contrastive learning scheme for point cloud regularization is, I think, genuinely clever. It sidesteps the need for explicit correspondence supervision while still encouraging geometric consistency. The cylindrical coordinate representation is less novel; similar ideas have appeared in prior work on rotationally symmetric objects.
The ActiveGrasp paper's calibrated energy-based model is the more interesting theoretical contribution. The idea that energy levels should correspond to success probabilities seems obvious in retrospect, but actually achieving this calibration requires careful design choices. The information-theoretic view selection builds on a substantial literature in active perception, but applying it specifically to grasp distributions on SE(3) (the manifold of 3D rotations and translations) is a meaningful extension.
Neither paper addresses what I consider the harder open problems in grasping: deformable objects, transparent or reflective surfaces, novel object categories with no training examples, or grasping in truly dynamic environments where objects move during the grasp attempt. Both evaluate on rigid objects from known categories. This is not a criticism exactly (you have to scope your work somehow) but it is worth noting that the "grasping problem" these papers solve is a particular, well-defined version of a much larger challenge.
I should note some limitations that the papers themselves acknowledge to varying degrees.
For the cross-view fusion work, the auxiliary view is assumed to be available and correctly calibrated. In practice, adding a second camera to a robot setup introduces its own challenges: calibration drift, synchronization, and cost. The paper does not discuss how performance degrades if the auxiliary view is poorly positioned or if the camera calibration is slightly off.
For ActiveGrasp, the energy-based model's calibration depends on having access to ground-truth grasp success labels during training. This is straightforward in simulation but harder in real-world data collection. The paper mentions that the source code will be made public, but it has not been released yet as of this writing, so I cannot verify the implementation details.
Both papers report real-robot experiments, but the details are sparse. How many grasp attempts? What objects? What was the failure rate, and what caused failures? The sample sizes appear to be small (this is standard in the field, but it limits what conclusions we can draw). Real-robot grasping experiments are expensive and time-consuming, which is precisely why simulation benchmarks like GraspNet-1Billion exist, but simulation results do not always transfer.
If I were reviewing follow-up work in this area, here is what would excite me:
First, systematic studies of failure modes. Both papers report success rates, but understanding why grasps fail is often more informative than knowing how often they succeed. Is it perception errors? Planning errors? Execution errors? Control errors during contact?
Second, evaluation on more diverse object categories. The GraspNet-1Billion benchmark is useful but limited. Household objects, deformable packaging, cluttered real-world scenes with unknown objects (these are the conditions where current methods struggle most).
Third, integration with tactile sensing. Both papers focus entirely on vision-based grasp estimation. Humans rely heavily on touch to adjust grasps in real-time. Some recent work has started combining visual and tactile information, but it remains underexplored.
Fourth, computational efficiency analysis. How fast do these methods run? Can they operate in real-time on a robot with limited onboard compute? The papers do not provide timing information, which makes it hard to assess practical deployability.
The broader question these papers raise, without fully answering, is whether the path to robust grasping runs through better perception, better planning, or better reactive control. The cross-view fusion approach bets on perception: if you can see the object well enough, grasp planning becomes easier. The ActiveGrasp approach bets on intelligent information gathering: if you can efficiently reduce uncertainty, you can grasp with less total sensing.
Both assumptions have limits. Some objects cannot be fully perceived no matter how many views you take (think: a bag of chips, which deforms unpredictably). Some environments do not allow the luxury of multiple views (think: grasping an object as it passes on a conveyor belt).
It remains unclear, and I mean this genuinely, whether the grasping problem will be solved through continued refinement of these perception-heavy approaches or whether it requires fundamentally different architectures that integrate perception, planning, and control more tightly. The field has been making steady progress on benchmarks, but benchmark performance has not translated to the kind of robust, general-purpose grasping that would enable, say, a robot that can unload any dishwasher.
These two papers are good examples of the current state of the art: technically sophisticated, carefully evaluated on standard benchmarks, incrementally better than prior work. They are not the end of the story, but they are useful chapters in it.