Single-Step Diffusion for Robot Actions: Finally, Someone Asked the Right Question
Two new papers suggest we've been overcomplicating robot action generation. Turns out the image synthesis playbook doesn't always apply.
Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I'll be honest, when diffusion models started taking over robot learning a couple years back, I had my doubts. Not about whether they'd work (they clearly do), but about whether we were importing solutions to problems we didn't actually have.
Turns out I wasn't entirely wrong.
Why were we doing ten denoising steps anyway?
Here's the thing about diffusion models in robotics: they came from image generation. DALL-E, Midjourney, all that. And generating a photorealistic image from noise genuinely requires multiple careful denoising steps. You're reconstructing millions of pixels with complex spatial relationships.
But robot actions? A typical action chunk is maybe 7 dimensions over 16 timesteps. That's 112 numbers. Not millions. When I was at Kuka, we used to joke that the hardest part of trajectory planning wasn't the math, it was getting the damn sensor data clean enough to use. The action space itself was never the bottleneck.
So when these vision-language-action models started doing 10, 20, even 50 denoising steps to output a simple movement command, I kept thinking: why?
What do these new papers actually show?
Two papers dropped recently that address this head-on, and they're worth reading together.
The first, Flash-WAM from a team working with the Unitree G1, tackles a genuinely tricky problem. They're generating video predictions AND actions simultaneously, which means dealing with two different noise schedules. Their solution (modality-aware distillation, if you want the jargon) compresses inference from 8.1 seconds down to 348 milliseconds. That's a 23x speedup. On real hardware, they recovered 60% task success on the humanoid, compared to 24% for naive approaches.
Cobertura relacionada
More in Industrial
New research tackles the trust problem in AI-generated robot skills, and honestly, it's about time someone did.
Robert "Bob" Macintosh · 2 days ago · 5 min
Two new solvers tackle long-horizon planning under uncertainty, and I'm cautiously optimistic we might actually use this stuff in real warehouses.
Robert "Bob" Macintosh · 2 days ago · 4 min
Two new papers tackle the 'where am I' problem without needing environment-specific training. I've been waiting for this since 2011.
Robert "Bob" Macintosh · 2 days ago · 3 min
A senior Goldman executive says AI investment is a fundamental market force. The real question is whether that capital will flow to hardware or stay stuck in software.
