Teaching Robots to Listen Is Harder Than It Sounds. Two New Papers Are Taking It Seriously.
A pair of recent research papers tackle language-guided robot manipulation from different angles. One uses vision-language models to parse verbal instructions. The other adds hand gestures into the mix.
By
Think about how you'd train a new guy on the warehouse floor. You don't hand him a 400-page manual. You point at something and say "grab that, the one on the top shelf, not the other one." Maybe you wave your hand a bit. He figures it out. That's basically the interaction model that two new robotics papers are trying to replicate in software, and I'll be honest, reading through them reminded me of just how far we still have to go, and how much closer we're getting at the same time.
The first paper, out of what appears to be an academic group working with tabletop manipulation setups, introduces a framework called GRASP, which stands for Grounded Reasoning and Symbolic Planning. The core idea is that you feed the robot a natural-language instruction, "pick up the red mug on the top shelf" or something similar, and a pretrained vision-language model translates that into what the researchers call a neuro-symbolic goal state. Rather than relying on hard-coded colour lists or fixed coordinate grids (and yes, I've worked with systems that did exactly that, painful doesn't cover it), GRASP uses a bounding-box detection pipeline to physically ground the instruction in whatever the camera is actually seeing.
They ran 90 real-robot trials across three difficulty levels and hit 73.3% overall success. No task-specific training required. The arXiv paper is light on some implementation details I'd want to know, like cycle time and what happens when the scene is cluttered or the lighting is poor, but 73.3% with zero fine-tuning is a number worth paying attention to. When I was at Kuka, we spent months tuning vision systems for tasks far simpler than open-vocabulary grasping. The idea that you could skip that whole process is, sort of, remarkable if it holds up outside a lab.
Cobertura relacionada
More in Industrial
Everyone bought robots. Now nobody's robots talk to each other. Welcome to the multi-vendor mess that's quietly strangling warehouse automation.
Mark Kowalski · 19 hours ago · 7 min
Google DeepMind is funding research into what happens when millions of AI agents interact unsupervised. The industrial automation world should be paying close attention.
Robert "Bob" Macintosh · Yesterday · 4 min
Rising inflation erodes more than paychecks. For warehouse operators already squeezing margins, it changes the math on automation investment in ways that aren't obvious at first glance.
Robert "Bob" Macintosh · Yesterday · 4 min



