Apparently yes, and it turns out the answer is more nuanced than either camp would prefer.
Two papers caught my attention this week, both addressing object pose estimation for robotic manipulation but from notably different angles. The first, "Object Pose and Shape Estimation for Grasping: Does it Work?" from arXiv, asks a question that sounds almost heretical in the current deep learning zeitgeist: maybe we should go back to modular pipelines? The second, "ComPose: A Unified Completion-Pose Framework" from the same repository, takes the opposite tack, arguing for tighter integration between shape completion and pose estimation.
What makes these papers worth discussing together is that they're both responding to the same underlying problem (incomplete point clouds make pose estimation hard) but reaching conclusions that pull in different directions. I think there's something genuinely useful to extract from reading them side by side.
To be precise, the question is this: when a robot needs to grasp an object, should it use a single neural network that takes in sensor data and outputs grasp poses directly (end-to-end), or should it first estimate the object's shape and pose, then use classical antipodal sampling to generate grasps (modular)?
関連記事
More in Research
Two new papers show reinforcement learning works better when we stop pretending AI can figure everything out alone.
Mark Kowalski · 1 hour ago · 6 min
Two new papers show hexapods and transformable drones doing whole-body manipulation, which is the kind of unsexy problem that actually matters.
Robert "Bob" Macintosh · 3 hours ago · 4 min
One uses formal verification to know when exploration is safe; the other asks whether we can skip the safety oracle entirely. Both are wrestling with the same fundamental problem.
Aisha Patel · 3 hours ago · 8 min
Millimeter-accurate fisheye depth data and a clever low-light navigation hack both point to the same uncomfortable truth: we've been training robots on the wrong data.
The end-to-end approach has dominated recent research. The logic is straightforward: fewer hand-designed components means fewer places for errors to compound, and the network can learn whatever intermediate representations are actually useful rather than what we assume should be useful.
But the arXiv paper on grasp synthesis (I'll call it the "Does it Work" paper for brevity) challenges this assumption directly. The authors implemented a state-of-the-art end-to-end grasp synthesis method and compared it against three modular approaches that first reconstruct object geometry, then sample grasps from the reconstruction.
Their finding: the modular methods outperformed end-to-end in all experiments.
Now, I know I'm being picky here, but "all experiments" requires some unpacking. The study was scoped specifically to parallel jaw grippers, 7-DoF grasps, and single-view RGB-D input. That's a reasonable scope, but it's not the full space of manipulation problems. The paper is careful about this, actually, the authors explicitly note their experimental boundaries.
The paper reports that modular methods were particularly strong for small objects, where end-to-end methods frequently failed to synthesize any valid grasps at all. This is an interesting failure mode that deserves more attention than it typically gets in the literature.
The authors tested three modular approaches using two different paradigms for pose and shape estimation: encoder-decoder models (like SAM3D, LRM, and CRISP) and diffusion-based models (like InstantMesh and Zero123). Both paradigms showed strong performance relative to end-to-end baselines.
However (and this is important), the effectiveness of modular methods degraded in cluttered scenes. The paper attributes this to limitations in current pose and shape estimation methods when objects are partially occluded by other objects. It's worth noting that this is precisely the scenario where end-to-end methods were supposed to shine, learning implicit representations that handle occlusion gracefully.
So we have a situation where modular methods win on isolated objects and small objects, but the advantage narrows or potentially reverses in clutter. The paper doesn't provide enough data points for me to say exactly where the crossover happens, and I'd want to see this replicated before drawing strong conclusions.
The ComPose paper from the same week takes a different approach to the same underlying problem. Rather than asking "modular or end-to-end," it asks "can we make pose estimation itself more robust by integrating shape completion?"
The core insight is that observed point clouds are always incomplete (sensors can only see the visible surface), and this incompleteness fundamentally limits pose estimation accuracy. Previous approaches have treated shape completion as a preprocessing step: first complete the point cloud, then estimate pose from the completed shape.
The ComPose authors argue this is suboptimal because errors in completion propagate to pose estimation. Instead, they propose a unified framework where completion and pose estimation share representations and are trained jointly.
Actually, the research shows something more specific than just "joint training helps." The key contribution is what they call a "keypoint-based progressive completion module." Rather than trying to predict a complete dense point cloud in one shot, the system first predicts a sparse set of keypoints that capture the object's overall geometry, then fills in dense points around each keypoint.
This is genuinely new, at least in this specific formulation. The progressive, keypoint-first approach is reminiscent of some earlier work on coarse-to-fine prediction, but the integration with pose estimation and the specific geometric relation consistency loss they introduce appears novel.
ComPose reports state-of-the-art results on standard category-level pose estimation benchmarks without relying on category-level shape priors. That last part matters because many previous methods assume access to a canonical 3D model for each object category, which limits real-world applicability.
The benchmarks used are standard ones in the field, which is good for comparability but also means we should be cautious about overfitting to benchmark quirks. I haven't seen independent replication yet, though the paper only just appeared.
One methodological concern: the paper compares against "state-of-the-art approaches," but the specific baselines and their implementation details will matter a lot. It's common in this literature for different papers to report different numbers for the same baseline methods depending on hyperparameter choices and evaluation protocols. I'd want to see the code and run it myself before fully trusting the magnitude of improvements claimed.
Here's where it gets interesting. The "Does it Work" paper found that modular approaches work well when pose and shape estimation is accurate, but degrade when estimation fails. The ComPose paper proposes a method to make pose and shape estimation more robust.
If ComPose's improvements hold up, it would strengthen the case for modular grasp planning. Better upstream estimation means better downstream grasps. But this also reveals a dependency that makes me slightly uncomfortable: the modular approach's success is contingent on continued progress in pose estimation research.
The end-to-end approach, whatever its current limitations, at least has a clear path to improvement: more data, bigger models, better architectures. The modular approach requires coordinated progress across multiple components.
The "Does it Work" paper includes a section that feels almost like a bonus finding. The authors demonstrate that their modular pose and shape estimation pipeline can be augmented with vision-language models to enable language-conditioned grasping from single-view RGB-D input.
They compare against LERF-TOGO, which is currently the standard baseline for this task, and report "comparable performance." It's worth noting that "comparable" is doing some work here. The paper doesn't claim to beat LERF-TOGO, just to match it while using a simpler pipeline.
This is potentially significant because it suggests that explicit geometric reasoning (pose and shape estimation) can serve as a foundation for higher-level semantic tasks. The robot doesn't need to learn everything end-to-end; it can build on geometric primitives.
But I remain unclear on exactly how the language grounding works in their system. The paper mentions using vision-language models but doesn't provide extensive details on the integration. This feels like it deserves its own paper rather than being a late-section addition.
Several things I'd want to see addressed in follow-up work:
First, the cluttered scene degradation in the modular approaches needs more investigation. How bad does it get? Is there a principled way to predict when modular methods will fail? Could you build a system that switches between modular and end-to-end based on scene complexity?
Second, the ComPose paper's improvements need replication. The method sounds promising, but single-paper results in this field have a mixed track record of holding up.
Third, neither paper addresses computational constraints seriously. The "Does it Work" paper mentions runtime analysis but doesn't make it central to the evaluation. For real-time manipulation, inference speed matters as much as accuracy. Some of the diffusion-based methods they test are notoriously slow.
Fourth, both papers focus on parallel jaw grippers. Dexterous hands present different challenges, and it's not obvious that the same conclusions would hold. The relationship between shape estimation accuracy and manipulation success might be different when you have more degrees of freedom in the gripper.
Honestly, I'd want to see a more systematic comparison that controls for computational budget. If you give both approaches the same inference time budget, which wins? The current comparisons don't address this.
I'd also like to see these methods tested on a wider range of objects, particularly deformable objects and transparent objects. Both papers focus on rigid, opaque items where geometric reasoning is most applicable. The real world is messier.
And (this is perhaps too much to ask) I'd appreciate more honesty about failure modes. Both papers emphasize where their methods succeed. A detailed analysis of where they fail would be more useful for practitioners trying to decide which approach to use.
The broader implication of this week's papers is that the modular versus end-to-end debate isn't settled, and probably shouldn't be. Different approaches have different strengths, and the right choice depends on the specific application constraints. That's a less exciting conclusion than "X is the future," but it has the advantage of being true.
(For what it's worth, my prior going into this was that end-to-end methods would dominate within a few years. These papers have made me update toward modular approaches having more staying power than I expected, though I'm not ready to reverse my position entirely.)