The Spatial Hallucination Problem Nobody Wants to Talk About

Two new papers tackle the same fundamental issue: vision-language models for autonomous driving can't actually see the world the way they need to.

By Robert "Bob" Macintosh

3 hours ago読了 5 分

画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Why do self-driving systems still get confused by parked cars?

I've been asking this question for years. When I was at Kuka, we had a running joke about the warehouse AGVs that would occasionally try to drive through a pallet jack someone left in the aisle. The sensors saw it fine. The system just... didn't know what to do with the information.

That was fifteen years ago. You'd think we'd have figured this out by now.

Two papers dropped on arXiv this week that suggest we haven't, not really. But they're taking interesting swings at the problem, and I think they deserve more attention than they'll probably get buried in the usual flood of academic work.

The core issue: seeing versus understanding

Look, here's the thing. Modern autonomous driving systems using vision-language models have a fundamental problem. They're either good at understanding what they're looking at ("that's a pedestrian crossing the street") or good at knowing exactly where things are in 3D space. Doing both at once? That's where it falls apart.

The TPS-Drive paper from arXiv puts it bluntly. Text-aligned methods that convert visual information into language tokens suffer from what they call "spatial hallucinations." The system knows there's a car, but it's basically guessing where that car actually is in three-dimensional space. Meanwhile, dense visual methods that preserve all the spatial information create "representation interference," which is a fancy way of saying the system gets overwhelmed by irrelevant background noise.

I called my old colleague at Siemens last week about something unrelated, and we ended up talking about this exact problem. He's been consulting for an AV startup (wouldn't say which one), and his take was that most production systems are basically papering over these issues with redundant sensor fusion and very conservative behaviour. Works fine until it doesn't.

More in Autonomy

New research finds that when autonomous driving models tell you why they're doing something, there's a coin-flip chance they're making it up.

Sarah Williams · 3 hours ago · 6 min

New research shows the reasoning that autonomous vehicles give for their actions often doesn't match what they're actually doing.

Sarah Williams · 3 hours ago · 4 min

New research from separate teams identifies why vision-language models struggle with 3D space, but their solutions reveal how far we still have to go.

Aisha Patel · 3 hours ago · 7 min

A Raspberry Pi project for Starlink and solar control might seem niche, but it reveals something important about how we're starting to think about smart systems at the edge.

The Spatial Hallucination Problem Nobody Wants to Talk About

Why do self-driving systems still get confused by parked cars?

The core issue: seeing versus understanding

More in Autonomy

AnyScene: generating the scenarios you can't collect

TPS-Drive: purifying the signal

What this actually means

出典