I've been asking this question for years. When I was at Kuka, we had a running joke about the warehouse AGVs that would occasionally try to drive through a pallet jack someone left in the aisle. The sensors saw it fine. The system just... didn't know what to do with the information.
That was fifteen years ago. You'd think we'd have figured this out by now.
Two papers dropped on arXiv this week that suggest we haven't, not really. But they're taking interesting swings at the problem, and I think they deserve more attention than they'll probably get buried in the usual flood of academic work.
Look, here's the thing. Modern autonomous driving systems using vision-language models have a fundamental problem. They're either good at understanding what they're looking at ("that's a pedestrian crossing the street") or good at knowing exactly where things are in 3D space. Doing both at once? That's where it falls apart.
The TPS-Drive paper from arXiv puts it bluntly. Text-aligned methods that convert visual information into language tokens suffer from what they call "spatial hallucinations." The system knows there's a car, but it's basically guessing where that car actually is in three-dimensional space. Meanwhile, dense visual methods that preserve all the spatial information create "representation interference," which is a fancy way of saying the system gets overwhelmed by irrelevant background noise.
I called my old colleague at Siemens last week about something unrelated, and we ended up talking about this exact problem. He's been consulting for an AV startup (wouldn't say which one), and his take was that most production systems are basically papering over these issues with redundant sensor fusion and very conservative behaviour. Works fine until it doesn't.
関連記事
More in Autonomy
New research finds that when autonomous driving models tell you why they're doing something, there's a coin-flip chance they're making it up.
Sarah Williams · 3 hours ago · 6 min
New research shows the reasoning that autonomous vehicles give for their actions often doesn't match what they're actually doing.
Sarah Williams · 3 hours ago · 4 min
New research from separate teams identifies why vision-language models struggle with 3D space, but their solutions reveal how far we still have to go.
Aisha Patel · 3 hours ago · 7 min
A Raspberry Pi project for Starlink and solar control might seem niche, but it reveals something important about how we're starting to think about smart systems at the edge.
The first paper, AnyScene, takes a different approach to the problem. Instead of trying to fix how vehicles perceive real-world data, it focuses on generating synthetic training scenarios.
The pitch is straightforward. You can't train a self-driving system on rare edge cases if you don't have footage of rare edge cases. A kid chasing a ball into traffic. A ladder falling off a truck. A deer. These things happen, but not often enough to build good datasets.
AnyScene uses what they call a Spatial-Temporal Occupancy Diffusion Transformer (and yes, that's a mouthful) to generate driving scenes from bird's-eye-view layouts. You sketch out where you want objects to be, and it generates realistic multi-view video of that scenario.
What caught my attention is the "reference-free" generation. Most existing systems need a real video clip to work from, which limits what you can create. This one, at least in theory, can generate scenarios from arbitrary layouts. Want to test how your system handles three cyclists, a double-parked delivery truck, and a jaywalking pedestrian all at once? Draw it up and generate it.
I'll be honest, I'm skeptical about how well this translates to actual training improvements. Synthetic data has always had a gap with real-world performance. The paper claims "measurable benefits for downstream tasks," but the benchmarks they use aren't exactly the wild streets of Boston in February.
The second paper tackles the perception problem more directly. TPS-Drive introduces what they call "Task-Guided Representation Purification," which is essentially a way to filter out the noise before the vision-language model has to deal with it.
Their Agent-Centric Tokenizer (another mouthful, these academics) uses the limited capacity of the system's codebook, think of it as a vocabulary for visual concepts, and deliberately allocates more of it to dynamic objects like other vehicles and pedestrians. Static backgrounds get compressed. The system pays attention to what matters for driving.
This reminds me of something we struggled with on industrial pick-and-place systems back in the day. The vision system would occasionally get confused by reflections on the conveyor belt, seeing phantom parts that weren't there. The solution was basically the same: teach the system what to ignore.
The results look promising. They're claiming reduced collision rates on nuScenes benchmarks and "new safety records" on the NAVSIM closed-loop tests. Though I'd note that benchmark performance and real-world safety aren't the same thing, and anyone who tells you otherwise is selling something.
Neither of these papers is going to revolutionise autonomous driving tomorrow. That's not how this works. But they're addressing a real problem that the industry has been sort of dancing around.
The spatial hallucination issue is, I think, more fundamental than most people realise. When your autonomous vehicle "sees" a pedestrian but places them three metres to the left of where they actually are, all the semantic understanding in the world doesn't help. And when your system is so busy processing tree textures and building facades that it misses the motorcycle lane-splitting behind you, that's not a software bug. That's an architecture problem.
I don't know if occupancy-centric approaches like AnyScene or purification methods like TPS-Drive are the answer. We've seen a lot of promising approaches in this space come and go. But at least they're asking the right questions.
The real test will be whether any of this makes it into production systems. Academic benchmarks are one thing. Keeping a two-ton vehicle from hitting people on public roads is another. I've seen too many "state-of-the-art" results that fell apart the moment they left the lab.
Still, progress is progress. And after fifteen years of watching this field, I'll take incremental improvements over hype any day.