Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Most coverage of vision-language models in autonomous driving focuses on the impressive demos. A car that can respond to natural language commands! A system that "understands" traffic scenes! What tends to get buried, or ignored entirely, is the persistent failure mode that makes these systems dangerous: they cannot reliably reason about three-dimensional space.
Two papers released this week, from independent research teams, both zero in on this problem. And while their approaches differ substantially, reading them together reveals something important about the current state of VLM-based driving systems. We are not close to solving spatial reasoning. We are, to be precise, still diagnosing what is actually going wrong.
The core issue is what one of the papers calls "spatial hallucinations." When you flatten continuous 3D spatial information into discrete tokens (the standard approach for feeding visual data into language models), you lose geometric structure. The model can describe a scene in words, but it cannot accurately predict where objects will be in three seconds. This is, it's worth noting, not a minor limitation for a system meant to control a vehicle.
The TPS-Drive paper, titled "Task-Guided Representation Purification for VLM-based Autonomous Driving," identifies a second failure mode they term "representation interference." Dense visual methods that preserve spatial topology (the alternative to text-flattening) overwhelm tokenizers with irrelevant background information. The model gets distracted by textures, shadows, road markings that don't matter, while missing the pedestrian stepping off the curb.
Cobertura relacionada
More in Autonomy
New research finds that when autonomous driving models tell you why they're doing something, there's a coin-flip chance they're making it up.
Sarah Williams · 3 hours ago · 6 min
Two new papers tackle the same fundamental issue: vision-language models for autonomous driving can't actually see the world the way they need to.
Robert "Bob" Macintosh · 3 hours ago · 5 min
New research shows the reasoning that autonomous vehicles give for their actions often doesn't match what they're actually doing.
Sarah Williams · 3 hours ago · 4 min
A Raspberry Pi project for Starlink and solar control might seem niche, but it reveals something important about how we're starting to think about smart systems at the edge.
I know I'm being picky here, but these are genuinely distinct failure modes that require different solutions. Most industry discussion conflates them. The research community, at least, is starting to be more precise.
The TPS-Drive team proposes what they call an "Agent-Centric Tokenizer." The idea is to use a frozen 3D detection head to supervise vector quantization, essentially forcing the codebook to allocate capacity to dynamic agents (cars, pedestrians, cyclists) rather than static backgrounds. They then run a decoupled reasoning pipeline: scene understanding, future forecasting, action generation, in sequence.
The results look promising. They report reduced collision rates in open-loop nuScenes evaluations and claim "new safety records" on the NAVSIMv1 and NAVSIMv2 benchmarks. But, and this matters, the paper relies heavily on reward-driven refinement in its final training stage. This is where things get murky. Reward shaping in driving is notoriously tricky. What exactly are they rewarding? The paper describes it as surpassing "pure imitation learning," but the specifics of the reward function aren't fully detailed in the abstract.
The AnyScene paper takes a different angle entirely. Rather than fixing how VLMs process real sensor data, they focus on generating synthetic training data. Their Spatial-Temporal Occupancy Diffusion Transformer creates semantic occupancy sequences from BEV (bird's eye view) layouts, which can then be rendered into multi-view driving videos.
This is actually addressing a related but distinct problem: the long tail of rare scenarios. You cannot collect enough real-world data of near-crashes, unusual road configurations, or edge-case weather conditions. So you generate them. The AnyScene framework claims state-of-the-art performance in both occupancy and video generation, with strong generalization to "unseen and customized layouts."
The reason I find these papers worth discussing together is that they represent two halves of a research agenda that hasn't been unified yet. TPS-Drive tries to fix how models reason about space. AnyScene tries to give them more diverse spatial scenarios to learn from. Both are necessary. Neither is sufficient.
Consider the TPS-Drive claim about reducing collision rates. In open-loop evaluation, the model sees sensor data and predicts what action it would take, but it doesn't actually execute that action and see the consequences. Closed-loop evaluation (which they also report on) is more rigorous, but NAVSIM benchmarks, while useful, are still simulated environments with known physics and predictable agent behaviors. The gap between benchmark performance and real-world deployment remains unclear.
AnyScene's contribution is valuable for a different reason. If you can generate controllable synthetic data with accurate 3D geometry, you can potentially train models to handle scenarios they would never encounter in logged driving data. But the paper's claim that this "provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction" is not the same as demonstrating improved driving performance. It's one step removed.
Actually, the research shows something more subtle here. AnyScene's Geometry-Grounded View Expansion module synthesizes videos in a "reference-free and autoregressive fashion," which means it doesn't need a specific camera rig configuration at inference time. This flexibility is genuinely new, not incremental over prior occupancy-to-video methods that required fixed camera setups. But whether the generated data is realistic enough to transfer to real-world driving, well, that remains unclear.
Several things I'd want to see before getting excited about either approach:
First, on TPS-Drive: what happens when the 3D detection head (which is frozen during training) makes errors? The entire tokenization scheme depends on accurate agent detection. If the detector fails in an unusual scenario, does the whole pipeline degrade gracefully or catastrophically? The paper doesn't address this, at least not in the abstract.
Second, on AnyScene: the claim of "cross-dataset" generalization is interesting but needs scrutiny. They trained on one or more driving datasets and tested on layouts from others. But driving datasets share significant overlap in terms of road types, vehicle classes, and geographic regions. True generalization would mean handling scenarios that are structurally different from anything in the training distribution. This hasn't been replicated yet, as far as I can tell.
Third, and this applies to both: neither paper engages seriously with the question of what happens when spatial reasoning fails silently. A VLM that produces confident but wrong predictions about where a vehicle will be in two seconds is arguably more dangerous than one that refuses to predict at all. The failure modes of these systems are not well characterized.
The most productive direction, in my view, would be combining these approaches. Use AnyScene-style generation to create diverse training scenarios that specifically target the failure modes TPS-Drive identifies. If "spatial hallucinations" occur when geometric structure is lost during tokenization, generate synthetic data that stress-tests exactly those situations. If "representation interference" happens with cluttered backgrounds, generate scenarios with varying levels of visual complexity and measure where the model breaks.
This kind of targeted, failure-mode-driven data generation is not what either paper does. AnyScene focuses on general controllability and fidelity. TPS-Drive focuses on architectural improvements. The synthesis would require a research program that treats safety-critical failures as the primary optimization target, rather than benchmark metrics.
(I realize this is asking for a lot. Benchmark metrics exist because they're measurable. "Doesn't fail catastrophically in novel situations" is much harder to quantify.)
One more thing worth noting: both papers are from academic teams, not industry labs. This is somewhat surprising given how much money is flowing into autonomous driving. It suggests that the fundamental research on spatial reasoning is still happening in universities, while industry focuses on scaling and deployment. Whether that's a healthy division of labor or a warning sign, I'm not sure.
Vision-language models are being deployed in autonomous driving systems today. They are not ready. The spatial reasoning problem is not solved, and these two papers, while valuable, demonstrate that we are still in the diagnostic phase. We are identifying failure modes, proposing partial fixes, and hoping the benchmarks translate to reality.
They might not. The sample sizes in both papers are limited to standard academic datasets. The evaluation metrics, while rigorous by research standards, do not capture the full complexity of real-world driving. And the fundamental question of whether VLMs can ever reliably reason about 3D space, or whether we need entirely different architectures, remains open.
I'm not saying this research isn't important. It is. But the gap between "achieves state-of-the-art performance on NAVSIM" and "safe to deploy on public roads" is larger than most coverage suggests. These papers are honest about their limitations, which is refreshing. The industry discourse around VLM-based driving, less so.