65% of Trajectories Had Safety Threats. The Planner Still Scored Well.
Two new papers expose a quiet problem in autonomous driving AI: the metrics we use to judge these systems may not actually tell us if they're safe.
By
65%. That's the share of trajectories from a state-of-the-art autonomous driving planner that a new evaluation pipeline flagged as introducing additional safety threats, even while the planner was posting respectable scores on standard benchmarks.
I had to read that twice.
The paper, from researchers behind a new framework called FluidTest, isn't saying these planners are obviously broken. It's saying something more unsettling: our standard ways of measuring whether an autonomous driving AI is safe might not actually be measuring that.
The metric problem is real, and it's been hiding in plain sight.
Here's the basic issue. The field has largely converged on two headline metrics for evaluating driving planners: Average Displacement Error (ADE), which measures how far a predicted trajectory strays from an expert reference, and collision rate. Both are easy to compute. Both are intuitive. And according to a growing body of research, both can be gamed, not intentionally, but structurally.
The FluidTest paper on arXiv makes this concrete. The researchers tested planners from the Waymo Open Dataset E2E benchmark and found that even models with high Rater Feedback Scores and low ADE were exhibiting what they call "additional threats," meaning the planner's trajectory introduced unsafe behaviors that weren't present in the expert reference. 51% of RAP planner trajectories had them. 65% of Poutine trajectories. These aren't obscure edge cases being cherry-picked. These are aggregate numbers across their test set.
FluidTest's approach is more structured than what most pipelines do. They built a taxonomy of 32 semantic threat types, things like unsafe lane changes, inadequate gap acceptance, failure to yield. They pair that with a human annotation protocol and a three-agent AI verification system that cross-checks decisions for consistency. The goal is evaluations that are, in their framing, human-aligned, safety-aware, verifiable, and explainable all at once. Most current pipelines, they argue, hit maybe one or two of those.
Related coverage
More in Autonomy
A pair of arXiv preprints tackle interpretability in autonomous driving from opposite ends: one shapes how AV systems predict motion, the other judges whether the result was any good.
James Chen · 9 hours ago · 5 min
A new GPU-first framework can train a robot navigation policy faster than you can make coffee. That's impressive. It's also not the whole story.
Mark Kowalski · 9 hours ago · 6 min
A drone landing paper and a Honda-backed HD map dataset both tackle the same stubborn problem: getting AI trained in fake environments to work in real ones.
Mark Kowalski · 9 hours ago · 7 min
A wave of fresh research tackles the gap between solo AV perception and true multi-agent coordination, and the numbers aren't flattering for current models.