Diffusion policies are having a moment, but I've seen this movie before
A wave of papers promises to make robot learning faster, cheaper, and more robust. Some of it might even be true.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
So here's the question everyone in robotics should be asking: are diffusion policies actually the breakthrough we've been waiting for, or is this another case of the field falling in love with a hammer and seeing nails everywhere?
I've been covering tech long enough to remember when neural networks were going to solve everything (they didn't), when deep learning was going to solve everything (closer, but still no), and when transformers were going to solve everything (jury's still out). Now diffusion models, which started life making pretty pictures, have wandered into robotics and everyone's losing their minds. A bunch of recent papers suggest they might actually deserve some of the hype this time, but call me old-fashioned, I want to see the receipts.
The pitch
The basic idea is elegant enough that even I can explain it. Instead of training a robot to output a single action, you train it to generate a whole distribution of possible actions and then denoise your way to something useful. It's borrowed from image generation, where diffusion models learned to turn static into art. The promise is that robots trained this way can handle ambiguity better, generalize to new situations, and learn from messier demonstrations.
A paper from researchers working on something called DIPOLE claims their approach outperforms six baselines by 39.1% on average across 18 simulated and 4 real-world tasks. That's a big number! They're fusing vision and geometry through what they call "modality-wise dropout," which basically means they randomly blind the robot to one input stream during training so it learns to rely on either one. The gains under visual distractors (41.5% improvement) and randomized object placement (15.2%) are the numbers that matter here, because that's where robots actually fail in the real world.
Then there's SIDP, which tackles a different problem. Standard diffusion policies apparently have this annoying habit of producing trajectories of "inconsistent quality," which means you need a "generate-then-filter" pipeline where you make a bunch of candidates and pick the best one. That's computationally expensive, and the SIDP folks claim they've cut inference time from 273ms to 110ms on a Jetson Orin Nano. For the non-hardware people, that's the difference between a robot that hesitates awkwardly and one that moves with something approaching fluidity.
The really interesting stuff
Okay, so faster inference and better generalization are nice. But two other papers caught my attention because they're trying to solve the actual hard problem in robotics, which is that collecting training data is miserable.
出典
- Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt· arXiv — cs.RO (Robotics)
- Training-Free Imitation Learning with Closed-Form Diffusion Policies· arXiv — cs.RO (Robotics)
- RoboDream: Compositional World Models for Scalable Robot Data Synthesis· arXiv — cs.RO (Robotics)
- DIPOLE: Fusing Vision and Geometry for Robust Visuomotor Generalization· arXiv — cs.RO (Robotics)
- Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation· arXiv — cs.RO (Robotics)
関連記事
More in AI Models
Everyone's talking about the new reasoning model, but the real story might be what Microsoft isn't saying about developer trust.
Sarah Williams · 11 mins ago · 6 min
The company promises new reasoning capabilities and a Copilot 'super app,' but the technical details remain frustratingly sparse.
Aisha Patel · 12 mins ago · 5 min
The Stargate project breaks ground in Michigan, and I've got questions about what all that compute actually gets used for.
Robert "Bob" Macintosh · 1 hour ago · 3 min
The AI company's rapid expansion of access to its vulnerability-finding model raises questions about what changed, and what we still don't know.



