Two Papers Tackle the Same Problem in VLM-Driven Autonomy: Spatial Reasoning Still Isn't Solved

New research from separate teams identifies why vision-language models struggle with 3D space, but their solutions reveal how far we still have to go.

27 May 20267 min de leitura

Most coverage of vision-language models in autonomous driving focuses on the impressive demos. A car that can respond to natural language commands! A system that "understands" traffic scenes! What tends to get buried, or ignored entirely, is the persistent failure mode that makes these systems dangerous: they cannot reliably reason about three-dimensional space.

Two papers released this week, from independent research teams, both zero in on this problem. And while their approaches differ substantially, reading them together reveals something important about the current state of VLM-based driving systems. We are not close to solving spatial reasoning. We are, to be precise, still diagnosing what is actually going wrong.

The problem nobody wants to talk about

The core issue is what one of the papers calls "spatial hallucinations." When you flatten continuous 3D spatial information into discrete tokens (the standard approach for feeding visual data into language models), you lose geometric structure. The model can describe a scene in words, but it cannot accurately predict where objects will be in three seconds. This is, it's worth noting, not a minor limitation for a system meant to control a vehicle.

The TPS-Drive paper, titled "Task-Guided Representation Purification for VLM-based Autonomous Driving," identifies a second failure mode they term "representation interference." Dense visual methods that preserve spatial topology (the alternative to text-flattening) overwhelm tokenizers with irrelevant background information. The model gets distracted by textures, shadows, road markings that don't matter, while missing the pedestrian stepping off the curb.

Cobertura relacionada

More in Autonomy

A startup called REO says it will sell a pickup truck for $21,500. The price is striking. The evidence for it is less so.

Aisha Patel · 24 Jun · 9 min

Researchers are patching the 'trajectory scoring gap' in sidewalk robots with VLMs and human attention modeling. The ideas are clever. The caveats are real.

Mark Kowalski · 20 Jun · 6 min

Two new papers tackle one of robotics' most stubborn problems: getting a robot to figure out its location using LiDAR, without needing to have visited the place before.

Sarah Williams · 19 Jun · 5 min

The defense tech startup is moving from drones to full autonomous fighters, and it raises questions about where the line between AI autonomy and human oversight actually sits.

Two Papers Tackle the Same Problem in VLM-Driven Autonomy: Spatial Reasoning Still Isn't Solved

The problem nobody wants to talk about

More in Autonomy

Two approaches, one underlying concern

Why this matters (and why it's too early to celebrate)

Open questions

What I'd want to see next

The bottom line

Fontes