The End of 'Generate-Then-Filter': How Diffusion Policies Are Finally Getting Fast Enough for Real Robots

Four new papers tackle the same problem from different angles, and the results suggest we're closer to real-time robot learning than the hype cycles would have you believe.

By Mark Kowalski

6 hours ago5 min de lecture

Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

110 milliseconds. That's how fast a new diffusion policy can run on a Jetson Orin Nano, which is basically the hardware equivalent of a fancy smartphone strapped to a robot. For context, the previous state of the art needed 273 milliseconds for the same task. If you've been covering robotics long enough (and I have, unfortunately), you know that a 2.5x speedup in inference time is the difference between a robot that works in a lab and one that works in your warehouse.

I've seen this movie before. Back in the early 2010s, deep learning for vision was too slow for real-time anything. Then came a wave of optimization papers, architectural tweaks, and hardware improvements, and suddenly your phone could identify your dog in photos. We're watching the same compression happen right now with diffusion-based robot policies, and four papers that dropped recently tell the story better than any press release could.

The problem nobody wants to talk about

Here's the dirty secret of diffusion policies in robotics: they're slow. Really slow. The iterative denoising process that makes them so good at capturing complex, multi-modal action distributions is also what makes them impractical for anything that needs to react in real time. A robot arm doesn't have the luxury of thinking for half a second about whether to grab the cup or not.

The workaround that most labs have been using is what one paper calls the "generate-then-filter" pipeline. You generate a bunch of candidate trajectories, then use some auxiliary selector to pick the best one. It works, sort of, but it's computationally wasteful and adds another component that can fail. Call me old-fashioned, but I've never liked solutions that require more moving parts to fix the problems created by the first set of moving parts.

More in AI Models

A wave of new research tackles the gap between what vision-language models can see and what they can actually do with that information.

Sarah Williams · 47 mins ago · 7 min

A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.

James Chen · 47 mins ago · 5 min

A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.

James Chen · 47 mins ago · 4 min

SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.

The End of 'Generate-Then-Filter': How Diffusion Policies Are Finally Getting Fast Enough for Real Robots

The problem nobody wants to talk about

More in AI Models

Four approaches, one goal

What this actually means

The bigger picture

Sources