VLAs Can't See in 3D, and Two New Papers Finally Quantify the Problem
Researchers have put numbers on what roboticists suspected: vision-language-action models have a serious geometry problem.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Vision-language-action models don't understand 3D space nearly as well as we need them to. Two new papers from arXiv make this painfully clear, and honestly, it's about time someone did the homework.
The numbers
The first paper, from researchers working with NVIDIA's GR00T-N1.5, does something I wish more academic work would do: it actually measures the problem instead of just gesturing at it. Using linear probing (a standard technique for figuring out what a neural network has actually learned), they quantified what they call the "geometric gap" between VLAs and dedicated geometric foundation models like VGGT.
I'll be honest, when I was at Kuka we spent years on spatial calibration for industrial arms, and the idea that these new models might just, sort of, figure out geometry from language and images always seemed optimistic to me. Now we have data showing the gap is real and measurable.
The second paper from a separate team identifies three specific failures: VLAs can't enforce multi-view consistency (meaning they don't understand that two camera angles show the same object), they struggle with instance-level understanding (knowing that this box is different from that box), and they fall apart when things get occluded. Anyone who's watched a robot arm knock something over while reaching for something behind it won't be surprised.
So what
Look, here's the thing. We've had mature 3D perception methods for years. Structured light, time-of-flight, stereo vision with proper calibration. The Kuka LBR iiwa I worked on in 2016 could do sub-millimetre positioning because we didn't ask it to hallucinate geometry from RGB images.
À lire aussi
More in AI Models
The company just raised its outlook by a staggering amount, and honestly, I'm trying to figure out if this is real momentum or a peak we're about to fall off.
Sarah Williams · 2 hours ago · 5 min
A $65 billion raise that eclipses OpenAI. I've seen big valuations before, but this one's got me scratching my head.
Robert "Bob" Macintosh · 2 hours ago · 3 min
The private equity giants are seeking additional investors for what would be one of the largest AI infrastructure financing deals to date.
James Chen · 3 hours ago · 4 min
The company that once prided itself on vertical integration is outsourcing its AI brain to a competitor. That's not a pivot, it's a concession.



