The Grasping Problem: Why Your Robot Still Can't Pick Up a Coffee Mug
Two new papers tackle robotic grasping from opposite directions, and honestly, both approaches reveal how far we still have to go.
By
·Yesterday·読了 4 分
86.4% grasp stability sounds pretty good until you remember that means roughly one in seven attempts, your robot drops whatever it's holding.
I've been thinking about this a lot lately. We've got humanoids that can do backflips, AI models that pass the bar exam, and yet the simple act of picking something up remains genuinely hard for robots. Two papers dropped this week that approach the problem from completely different angles, and I think they're worth examining together because they reveal something interesting about where the field is stuck.
The first paper, GraspFoM from a team working with 3D foundation models, takes what I'd call the "understand the object first" approach. The core insight is that robots often fail at grasping because they're working with incomplete information. You see half a mug, you guess where the handle is, you miss.
GraspFoM uses something called SAM3D (a 3D foundation prior) to build what the researchers call a "shared 3D object latent." Basically, the robot reconstructs a full 3D model of the object from partial observations, then uses that reconstruction to predict grasp poses. The clever bit is that these two tasks (reconstruction and grasp prediction) share the same underlying representation, so improvements in one help the other.
I initially thought this was just adding complexity for complexity's sake. But after reading through the ablation studies, I'm less sure. The reconstruction-aware scorer they introduce does seem to provide grounded geometric cues that improve grasp success. Though I should note, the paper doesn't provide real-world success rates, only simulation benchmarks. That gap always makes me nervous.
関連記事
More in Humanoids
New research shows robots learning manipulation skills directly from watching humans, no expensive teleoperation required. I'm cautiously optimistic, but let's look at what's actually happening here.
Sarah Williams · 5 hours ago · 4 min
Three new papers suggest we're finally figuring out how to make humanoid robots move without programming every gesture by hand.
Aisha Patel · 7 hours ago · 9 min
Two new papers show robots recovering from falls on rough terrain. I've been waiting 15 years for this.
Robert "Bob" Macintosh · 8 hours ago · 4 min
New work from separate teams tackles the same problem from opposite directions, and the results reveal something important about where humanoid control is actually headed.
The second paper, SynManDex, takes a fundamentally different approach. Instead of trying to understand objects better, it asks: what if we just copied how humans grasp things?
This sounds obvious, but there's a catch. Human hands and robot hands are different. You can't just motion-capture someone picking up a teacup and replay that on a Shadow Hand. The morphology is wrong, the contact points don't transfer, the reachability constraints are different.
SynManDex's solution is to generate synthetic human "pre-grasps" (the approach phase before contact), use those as proposals, then optimize the final contacts for the specific robot hand. It's a pipeline approach: sample human-like poses, retarget to robot, optimize for force-closure, validate the trajectory.
The results are interesting. 86.4% grasp stability in simulation, 83.3% success rate on a real 36-DOF bimanual platform (that's 25 out of 30 attempts). And here's the part that surprised me: human evaluators rated the grasps 4.67 out of 5 for "human-likeness," which corresponds to 93.4% human-like appearance.
Why does human-likeness matter?
You might be wondering why we care if a robot grasp looks human. I think there are two reasons, one practical and one speculative.
The practical reason: human-like grasps often encode functional intent. When you pick up a hammer, you don't grab it randomly. You grab it in a way that lets you use it. If robots can learn these affordance-aware grasps, they're not just picking things up, they're picking them up in ways that enable downstream tasks.
The speculative reason: if robots are going to work around humans, there might be value in their movements being predictable. Tbh, I'm not sure how much this matters in practice, but it's worth considering.
Here's what I keep coming back to: both papers are working in relatively controlled conditions. GraspFoM's benchmarks are simulation-only (as far as I can tell from the abstract). SynManDex tested on a high-end research platform with 30 real-world trials.
Neither paper addresses what happens when the lighting is bad, the object is wet, the surface is cluttered, or the robot has to grasp while also maintaining balance. These aren't edge cases. They're the actual conditions robots will face in homes and warehouses.
I also couldn't find information on computational requirements for either approach. GraspFoM mentions "a small number of additional trainable parameters," which is encouraging but vague. SynManDex uses VLM agents for task design, which suggests significant compute overhead.
It remains unclear whether the path to reliable grasping runs through better 3D understanding, better human imitation, or something else entirely. Maybe it's end-to-end learning with massive datasets. Maybe it's compliant hardware that makes precise planning less necessary. Maybe it's some combination we haven't figured out yet.
What I do think is that 83-86% success rates, while impressive for research, aren't good enough for deployment. If your warehouse robot drops one in six packages, that's a problem. If your home robot drops one in six dishes, that's a lot of broken ceramics.
Both papers represent real progress. The reconstruction-guided approach in GraspFoM and the human-prior approach in SynManDex are pushing the field forward. But honestly, we're still far from solved. The gap between "works in the lab" and "works in your kitchen" remains stubbornly wide.