65% of Trajectories Had Safety Threats. The Planner Still Scored Well.

Two new papers expose a quiet problem in autonomous driving AI: the metrics we use to judge these systems may not actually tell us if they're safe.

16 June 20264 min read

65%. That's the share of trajectories from a state-of-the-art autonomous driving planner that a new evaluation pipeline flagged as introducing additional safety threats, even while the planner was posting respectable scores on standard benchmarks.

I had to read that twice.

The paper, from researchers behind a new framework called FluidTest, isn't saying these planners are obviously broken. It's saying something more unsettling: our standard ways of measuring whether an autonomous driving AI is safe might not actually be measuring that.

The metric problem is real, and it's been hiding in plain sight.

Here's the basic issue. The field has largely converged on two headline metrics for evaluating driving planners: Average Displacement Error (ADE), which measures how far a predicted trajectory strays from an expert reference, and collision rate. Both are easy to compute. Both are intuitive. And according to a growing body of research, both can be gamed, not intentionally, but structurally.

The FluidTest paper on arXiv makes this concrete. The researchers tested planners from the Waymo Open Dataset E2E benchmark and found that even models with high Rater Feedback Scores and low ADE were exhibiting what they call "additional threats," meaning the planner's trajectory introduced unsafe behaviors that weren't present in the expert reference. 51% of RAP planner trajectories had them. 65% of Poutine trajectories. These aren't obscure edge cases being cherry-picked. These are aggregate numbers across their test set.

FluidTest's approach is more structured than what most pipelines do. They built a taxonomy of 32 semantic threat types, things like unsafe lane changes, inadequate gap acceptance, failure to yield. They pair that with a human annotation protocol and a three-agent AI verification system that cross-checks decisions for consistency. The goal is evaluations that are, in their framing, human-aligned, safety-aware, verifiable, and explainable all at once. Most current pipelines, they argue, hit maybe one or two of those.

Related coverage

More in Autonomy

A startup called REO says it will sell a pickup truck for $21,500. The price is striking. The evidence for it is less so.

Aisha Patel · 24 Jun · 9 min

Researchers are patching the 'trajectory scoring gap' in sidewalk robots with VLMs and human attention modeling. The ideas are clever. The caveats are real.

Mark Kowalski · 20 Jun · 6 min

Two new papers tackle one of robotics' most stubborn problems: getting a robot to figure out its location using LiDAR, without needing to have visited the place before.

Sarah Williams · 19 Jun · 5 min

The defense tech startup is moving from drones to full autonomous fighters, and it raises questions about where the line between AI autonomy and human oversight actually sits.

Sources