画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Forty-two point five percent. That's the number that's been stuck in my head since I read a new study on autonomous driving AI this week. It's the "reasoning fidelity" rate for a vision-language-action model called Alpamayo-R1-10B, and what it means is this: when the AI explains why it's making a driving decision, its explanation matches reality less than half the time.
Let me say that differently, because I think it's important. These systems can now generate detailed, confident-sounding explanations for their behavior. They'll tell you they stopped because they saw a pedestrian, or they changed lanes because of an obstacle. And according to researchers who tested 300 inferences across 100 driving scenarios, those explanations are basically a coin flip.
I initially thought this was maybe a niche concern, something for AI safety researchers to worry about in academic papers. But then I started thinking about where we're headed. Autonomous vehicles are increasingly being pitched as trustworthy because they can explain themselves. Regulators are asking for interpretability. Insurance companies want to know why a car did what it did. And now we have evidence that the explanations might be, well, fabricated.
The study, published on arXiv, introduces what the authors call "Chain-of-Causation" analysis. Basically, they're checking whether the AI's stated reasoning actually corresponds to what's happening in the scene. The results are honestly kind of alarming. In one-third of scenarios involving pedestrians, the model missed 94 pedestrians total. Just didn't register them in its reasoning at all, even when they were clearly present.
You might be wondering: okay, but does it still drive safely even if its explanations are wrong? That's where it gets worse. The researchers found only 48.3% mean reasoning-action consistency. More than half of inferences showed low consistency between what the model said it would do and what it actually did. Here's the stat that really got me: in 37.9% of cases where the model claimed it was stopping, it actually continued driving.
関連記事
More in Autonomy
Two new papers tackle the same fundamental issue: vision-language models for autonomous driving can't actually see the world the way they need to.
Robert "Bob" Macintosh · 3 hours ago · 5 min
New research shows the reasoning that autonomous vehicles give for their actions often doesn't match what they're actually doing.
Sarah Williams · 3 hours ago · 4 min
New research from separate teams identifies why vision-language models struggle with 3D space, but their solutions reveal how far we still have to go.
Aisha Patel · 3 hours ago · 7 min
A Raspberry Pi project for Starlink and solar control might seem niche, but it reveals something important about how we're starting to think about smart systems at the edge.
I should be clear about what this is and isn't. This is one model, tested in simulation (PhysicalAI-AV scenarios), not on real roads. The researchers are upfront that this is a systematic study of faithfulness, not a comprehensive safety audit. But it's the first study of its kind, and the methodology seems solid. They formalize faithfulness information-theoretically, which, tbh, is beyond my technical depth, but the verification criteria for entity and action fidelity seem reasonable.
What makes this particularly interesting is that it's landing at the same moment as another paper proposing a solution to exactly this problem. Researchers from PKU Smart City published a framework called Reason-Imagine-Act, or RIA, that tries to close the gap between what an AI says it's doing and what's physically safe to do.
The RIA paper acknowledges the core problem directly: large language models are promising for autonomous driving, but "semantics-only decision policies can yield physically unsafe behavior." In other words, an AI can have a perfectly reasonable-sounding plan that would get you killed in reality.
Their solution is to add a verification layer. The LLM proposes an action, a world model simulates what would happen, and a safety scorer picks the option least likely to result in catastrophe. It's basically a reality check for AI reasoning. In their testing (1000 episodes in CARLA simulation), they achieved an 80.05% route completion rate, 51.10% arrival rate, and, crucially, only 0.20% collision rate.
Those numbers sound good, but I want to be careful here. Simulation results don't always translate to real-world performance. The 51.10% arrival rate means the car failed to reach its destination about half the time. And we don't have faithfulness metrics for RIA comparable to what the first study measured. Does the car's explanation match its behavior? Does the world model actually catch the problems? The paper doesn't say, and I couldn't find follow-up work addressing this.
What I find genuinely interesting is the philosophical tension between these two papers. The faithfulness study suggests that current AI driving systems are fundamentally unreliable narrators. They confabulate. They make up reasons. They're like a friend who always has an explanation for why they're late, and the explanation is never "I just didn't feel like leaving on time."
The RIA paper tries to solve this by adding external verification, but it's still relying on the LLM to propose actions in the first place. If the base model is prone to missing pedestrians or generating inconsistent reasoning, does adding a world model fix that, or just catch some of the errors?
I don't have a good answer. Honestly, I'm not sure anyone does yet.
Here's what I keep coming back to. The autonomous vehicle industry has spent years telling us that AI drivers will be safer than humans because they don't get tired, don't get distracted, don't drive drunk. All true. But humans have something these systems apparently lack: our explanations for our behavior, while imperfect, are generally connected to what we actually perceived and intended. When I say I stopped for a pedestrian, I probably saw a pedestrian.
The 97.7% trajectory fragility stat from the faithfulness study deserves mention here too. Under "mild visual perturbations," the model's planned trajectory changed almost every time. That's not necessarily bad (maybe it's being appropriately cautious), but combined with the low reasoning fidelity, it suggests these systems are operating on shaky foundations.
So where does this leave us? I think we're at an awkward moment in autonomous driving development. The technology is advanced enough to generate sophisticated explanations that sound trustworthy. It's not advanced enough for those explanations to actually be trustworthy. And the gap between appearance and reality is exactly the kind of thing that could erode public trust if it's not addressed.
The RIA approach of adding verification layers seems promising, but it's also an admission that we can't just trust the AI's reasoning. We need to check its work. That's probably fine for autonomous vehicles, where we can add redundant safety systems. It's more concerning for other domains where verification is harder.
I should note that both of these papers are recent (the faithfulness study was just updated, the RIA paper is from this month) and neither has been through peer review in a traditional journal. The code for RIA is available on GitHub, which is good for reproducibility, but I haven't seen independent replication yet.
What I'd want to see next: someone running the faithfulness analysis on a system using the RIA architecture. Does the verification layer actually improve reasoning fidelity, or does it just make the outputs safer without making them more honest? Those are different things, and I think the distinction matters.
For now, I'm left with that 42.5% number. Less than half. When an autonomous driving AI tells you why it's doing something, it's wrong more often than it's right. That's the state of the art in 2025, and I think we should be honest about it.