Two New Papers Tackle the Same Self-Driving Problem. One Uses Brute Force, the Other Uses Brains.
Researchers are finally admitting that training autonomous vehicles on human driving data creates mushy, indecisive systems. The fixes are clever, but I've seen this movie before.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Here's the thing about autonomous vehicle research that nobody wants to admit: we've been training these systems wrong for years. Two papers dropped this week on arXiv that basically say the same thing, which is that when you train an end-to-end driving model on lots of human demonstrations, you get a system that drives like the average of all those humans. Not the best human. Not even a competent human. The average. And the average, it turns out, is pretty mushy.
The researchers call this the "style-averaging" dilemma, which is a polite way of saying the car can't commit to a decision. Should it merge aggressively or wait? The training data contains both behaviors, so the model splits the difference and does something weird that neither an aggressive nor a cautious driver would do. Sometimes that weird thing is kinematically unsafe, which is a very academic way of saying the car tries to do something physically impossible.
The first paper, D³-MoE from arXiv, takes what I'd call the brute force approach. Instead of generating one trajectory and hoping it's right, the system generates multiple trajectories in parallel, each representing a different driving "style." Then a downstream module picks the best one based on whatever criteria you care about. Want aggressive? Pick that trajectory. Want grandma-safe? Pick that one instead.
The clever bit is how they handle the physics. They've decoupled longitudinal motion (speeding up, slowing down) from lateral motion (steering left, steering right) and trained separate expert networks for each. These experts don't need manual labels because they learn from the ground truth kinematics themselves, which is elegant if you think about it. The whole thing uses Diffusion Transformers, which are the hot architecture of the moment, and achieves what they claim is state-of-the-art performance on the NAVSIM benchmark: 88.2 PDMS by default, or 91.3 if you let it generate three options and pick the best.
À lire aussi
More in Autonomy
Three papers that actually matter for getting robots and cars to move smarter, not just faster.
Mark Kowalski · 5 hours ago · 6 min
Researchers are using multi-agent self-play to teach cars how to park reactively, and honestly, the results are more impressive than I expected.
Sarah Williams · 5 hours ago · 4 min
Two new papers tackle the oldest problem in autonomous systems, and for once, the solutions might actually work on hardware you can afford.
Mark Kowalski · 19 hours ago · 5 min
New research on multi-task learning, point cloud sampling, and generative world models reveals the real bottlenecks in self-driving systems, and some genuinely clever solutions.
The second paper, CLEAR from arXiv, is more interesting to me because it actually tries to think about the problem rather than just throwing compute at it. Their core insight is that diffusion models are too slow for real-world deployment. The iterative denoising process that makes them good at generating diverse outputs also makes them unacceptably slow when you need to make split-second decisions about, you know, not hitting things.
So CLEAR replaces the multi-step denoising with a single-step conditional drift in a VAE latent space (I had to read that sentence three times, and I'm still not sure I fully get it, but the upshot is: faster). They also fine-tune a small language model, Qwen 3.5 0.8B, on driving question-answer pairs to extract what they call "scene-aware hidden states." These states guide an Adaptive Scheduler that picks the right parameters for each situation.
The result? 93.7 PDMS on the same benchmark, which beats the first paper handily.
Call me old-fashioned, but I've seen this pattern before. Back in the early 2010s, everyone was convinced that deep learning would solve autonomous driving within five years. Then we hit the long tail problem: the systems worked great 95% of the time and failed catastrophically the other 5%. The response was to collect more data, train bigger models, add more sensors. Brute force.
What I'm seeing in these papers is the field finally admitting that brute force has limits. You can't just average your way to good driving. You need systems that can reason about context, that can adapt their behavior to the situation, that can, in a way, understand what they're doing rather than just pattern-matching against training data.
The CLEAR paper is particularly interesting because it's using a language model not to chat with passengers (please, no) but to extract semantic understanding of the driving scene. That's a meaningful shift. Whether it actually works in the real world, with all its messy edge cases and weird situations that don't appear in benchmarks, remains unclear.
Here's where I get skeptical. Both papers tout their NAVSIM scores like they're definitive proof of superiority. But NAVSIM is a simulation benchmark, and simulation benchmarks have a long history of not predicting real-world performance. I've watched companies ace simulation tests and then struggle with basic scenarios on actual roads.
The D³-MoE paper admits that their "Best-of-Three ensemble strategy" is what gets them to 91.3 PDMS. That means they're generating three trajectories and picking the best one, which is fine for a benchmark where you have time to compute, but what about when a kid runs into the street? You don't get three tries.
CLEAR's single-step approach is more promising for real-time deployment, but they don't provide actual latency numbers in the abstract. How fast is "ultra-fast"? 10 milliseconds? 100? It matters!
What these papers represent, and what I think is genuinely encouraging, is a growing recognition that end-to-end autonomous driving needs more than just scale. The kids building these systems are starting to think about architecture in more sophisticated ways, about how to structure the problem so that the model can actually learn useful behaviors rather than just memorizing the average of its training data.
The D³-MoE approach of decoupling longitudinal and lateral control makes physical sense. Cars do operate in two somewhat independent dimensions. The CLEAR approach of using semantic reasoning to guide planning makes cognitive sense. Human drivers don't just react to pixels, we understand what's happening in a scene.
But I've been covering this field long enough to know that clever research papers don't automatically translate to cars that work. The gap between 93.7 PDMS on a benchmark and a system you'd trust with your family in the back seat is enormous, and we don't have great tools for measuring it.
Both papers are worth reading if you're in the field. D³-MoE is a solid engineering contribution that shows how to get style-controllable trajectories out of a diffusion model. CLEAR is more ambitious, trying to combine fast generation with genuine scene understanding, and its results suggest the approach has legs.
But what do I know. I've been watching autonomous vehicle promises for over a decade now, and I'm still not riding in a robotaxi. These papers represent real progress on a real problem, but they're also operating in the comfortable world of benchmarks and simulations. The hard part, the part where you have to handle every weird thing that happens on real roads, is still ahead.
If you want to argue about any of this, my email's on the about page. I actually read those.