Why VLMs Keep Fumbling Spatial Reasoning (And What's Actually Working)

A wave of new research tackles the gap between what vision-language models can see and what they can actually do with that information.

By Sarah Williams

1 hour ago7 min de lectura

Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Why do robots still struggle with tasks a toddler finds trivial?

I've been thinking about this a lot lately. You can show a vision-language model a cluttered kitchen counter and it'll describe every object with impressive accuracy. Ask it to tell a robot how to stack those objects without knocking anything over, and suddenly things fall apart. Sometimes literally.

A batch of recent papers from robotics researchers is converging on the same uncomfortable truth: our best VLMs are great at seeing and talking, but the spatial reasoning that connects perception to physical action remains, honestly, kind of a mess. The good news? People are finding workarounds. The bad news? Those workarounds reveal just how far we still have to go.

The Gap Between Seeing and Doing

Let me start with what I think is the most revealing study of the bunch. Researchers tested VLMs on a collaborative structure-building task, basically a robot version of describing how to rebuild a Lego tower to someone who can't see it. According to their paper published on arXiv, multi-turn dialogue between AI agents improved performance on spatial reasoning. But here's the kicker: only barely.

The finding that stuck with me was this: detailed text descriptions of a target structure actually worked better than showing the model images of it. Think about that for a second. You'd assume a vision model would prefer, you know, vision. But when it comes to spatial reasoning, words apparently beat pictures. That's weird, right?

What this tells us is that VLMs can process visual information, but they struggle to extract the precise spatial relationships needed for manipulation. They see a cup on a table but can't reliably tell you it's 15 centimeters from the edge or that rotating it 30 degrees would clear the obstacle behind it.

Cobertura relacionada

More in AI Models

A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.

James Chen · 1 hour ago · 5 min

A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.

James Chen · 1 hour ago · 4 min

SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.

Mark Kowalski · 1 hour ago · 5 min

Everyone's talking about the new reasoning model, but the real story might be what Microsoft isn't saying about developer trust.

The Gap Between Seeing and Doing

Fuentes