The End of 'Generate-Then-Filter': How Diffusion Policies Are Finally Getting Fast Enough for Real Robots
Four new papers tackle the same problem from different angles, and the results suggest we're closer to real-time robot learning than the hype cycles would have you believe.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
110 milliseconds. That's how fast a new diffusion policy can run on a Jetson Orin Nano, which is basically the hardware equivalent of a fancy smartphone strapped to a robot. For context, the previous state of the art needed 273 milliseconds for the same task. If you've been covering robotics long enough (and I have, unfortunately), you know that a 2.5x speedup in inference time is the difference between a robot that works in a lab and one that works in your warehouse.
I've seen this movie before. Back in the early 2010s, deep learning for vision was too slow for real-time anything. Then came a wave of optimization papers, architectural tweaks, and hardware improvements, and suddenly your phone could identify your dog in photos. We're watching the same compression happen right now with diffusion-based robot policies, and four papers that dropped recently tell the story better than any press release could.
Here's the dirty secret of diffusion policies in robotics: they're slow. Really slow. The iterative denoising process that makes them so good at capturing complex, multi-modal action distributions is also what makes them impractical for anything that needs to react in real time. A robot arm doesn't have the luxury of thinking for half a second about whether to grab the cup or not.
The workaround that most labs have been using is what one paper calls the "generate-then-filter" pipeline. You generate a bunch of candidate trajectories, then use some auxiliary selector to pick the best one. It works, sort of, but it's computationally wasteful and adds another component that can fail. Call me old-fashioned, but I've never liked solutions that require more moving parts to fix the problems created by the first set of moving parts.
À lire aussi
More in AI Models
A wave of new research tackles the gap between what vision-language models can see and what they can actually do with that information.
Sarah Williams · 47 mins ago · 7 min
A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.
James Chen · 47 mins ago · 5 min
A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.
James Chen · 47 mins ago · 4 min
SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.
What's interesting about the current moment is that multiple research groups are attacking this bottleneck simultaneously, from completely different angles. That's usually a sign that a problem is both important and tractable.
The Self-Imitated Diffusion Policy work, which produced that 110ms number I mentioned, takes a clever approach: instead of imitating expert demonstrations directly, the policy learns to imitate its own best outputs. The researchers use reward signals to guide which self-generated trajectories are worth learning from. It's a bit like a student who grades their own homework and only studies from the problems they got right, which sounds like it shouldn't work but apparently does. The arXiv paper shows real-world results across multiple robot platforms, not just simulation.
Then there's Closed-Form Diffusion Policies, which might be the most radical of the bunch. The researchers at, well, I couldn't find the institution clearly listed, propose skipping neural network training entirely. Instead, they derive the diffusion score directly from the demonstration dataset in closed form. The result runs on a mobile CPU in real time. In milliseconds! The tradeoff is that you're limited by what's in your demonstration data, but for many practical applications that's fine. The arXiv preprint claims competitive performance against neural baselines that require hours of training.
Implicit Drifting Policy takes yet another angle, focusing on the mathematical structure of the problem. The core insight is that iterative diffusion sampling provides useful action correction during training, but you don't necessarily need to keep that expensive iteration at deployment time. They extract what they call "conditional expert geometry" from local variations in similar expert actions, then use that to constrain a one-step generator. I'll be honest, the math gets dense and I'm not sure I fully follow their geometric intuitions, but the results on manipulation tasks look solid according to their paper.
Finally, there's State-Conditional Adversarial Learning, which tackles a related but distinct problem: how do you transfer a policy trained in one visual domain to another? This matters because most policies are trained in simulation, where you can generate unlimited data, but need to run in the real world, where everything looks different. The SCAL paper shows that you can do this transfer with very little target-domain data, as long as you're clever about aligning latent distributions conditioned on system state. They tested on autonomous driving scenarios in CARLA.
If you're a robotics company trying to ship product, this research wave is genuinely good news. The gap between "works in the lab" and "works on your hardware" has been one of the biggest obstacles to deploying learned policies. These papers suggest that gap is closing, and closing fast.
But I want to be careful here because I've been burned before. The self-driving car hype cycle taught me that academic benchmarks and real-world deployment are different beasts entirely. These papers show impressive results on simulation benchmarks and some real-robot experiments, but we don't know yet how they'll perform across the full diversity of environments and edge cases that actual deployment requires. The SIDP paper does show results on multiple physical platforms, which is encouraging, but the sample sizes are still small.
There's also a question of composability that remains unclear. Can you mix and match these techniques? Could you use closed-form policies as a fast baseline and then fine-tune with self-imitation? The CFDP paper actually hints at this, showing how their approach can be used to edit pre-trained neural policies at inference time. But these combinations haven't been tested thoroughly.
What strikes me about this moment is how quickly the field is moving from "diffusion policies are promising but impractical" to "here are four different ways to make them practical." That's the kind of convergent progress that usually precedes real commercial deployment.
The kids working on this stuff (and yes, I know I'm old enough to call them that) are solving problems that seemed intractable five years ago. The combination of better theoretical understanding, clever algorithmic tricks, and improved hardware is creating a window where learned policies might actually be viable for real-time robot control.
I'm not saying we're there yet. I'm saying we're closer than the hype cycles would have predicted, and the progress is coming from actual technical advances rather than just better marketing. That's worth paying attention to.
If you want to argue about any of this, my email's on the about page. I still check it, unlike some of these kids who think everything should be a Discord thread.