VLAs Can't See in 3D, and Two New Papers Finally Quantify the Problem

Researchers have put numbers on what roboticists suspected: vision-language-action models have a serious geometry problem.

29 May 20263 min de lecture

Vision-language-action models don't understand 3D space nearly as well as we need them to. Two new papers from arXiv make this painfully clear, and honestly, it's about time someone did the homework.

The numbers

The first paper, from researchers working with NVIDIA's GR00T-N1.5, does something I wish more academic work would do: it actually measures the problem instead of just gesturing at it. Using linear probing (a standard technique for figuring out what a neural network has actually learned), they quantified what they call the "geometric gap" between VLAs and dedicated geometric foundation models like VGGT.

I'll be honest, when I was at Kuka we spent years on spatial calibration for industrial arms, and the idea that these new models might just, sort of, figure out geometry from language and images always seemed optimistic to me. Now we have data showing the gap is real and measurable.

The second paper from a separate team identifies three specific failures: VLAs can't enforce multi-view consistency (meaning they don't understand that two camera angles show the same object), they struggle with instance-level understanding (knowing that this box is different from that box), and they fall apart when things get occluded. Anyone who's watched a robot arm knock something over while reaching for something behind it won't be surprised.

So what

Look, here's the thing. We've had mature 3D perception methods for years. Structured light, time-of-flight, stereo vision with proper calibration. The Kuka LBR iiwa I worked on in 2016 could do sub-millimetre positioning because we didn't ask it to hallucinate geometry from RGB images.

More in AI Models

Chipmakers swung wildly this week, from a Tuesday 'chip-wreck' to a Micron-led surge after hours. What's actually going on with AI's hardware backbone?

Sarah Williams · 26 Jun · 5 min

The original Creator Studio was shut down in 2023. Now it's back, rebuilt around an AI assistant that promises to grow your audience and reply to comments in your voice.

Sarah Williams · 26 Jun · 5 min

At its annual Config conference, Figma announced coding layers, AI-generated motion graphics, and a reimagined canvas that blurs the line between design and full-stack development.

Sarah Williams · 26 Jun · 5 min

Everyone talks about chips and models. The memory bottleneck is the part of the AI buildout that keeps getting underestimated, and Micron's latest earnings make that case hard to ignore.

VLAs Can't See in 3D, and Two New Papers Finally Quantify the Problem

The numbers

So what

More in AI Models

What happens next

Sources