Diffusion Models Are Everywhere in Robotics Now. Are They Actually Better?
Three new papers tackle the same problem from different angles, and the results suggest we're still figuring out when diffusion planning actually helps.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
If you've been following robotics research over the past eighteen months, you've probably noticed a pattern: diffusion models are showing up everywhere. Path planning, manipulation, locomotion, you name it. The question I keep asking myself is whether this represents genuine progress or whether we're watching the field chase a trendy architecture because it worked well for image generation.
Three recent papers caught my attention this week, not because any single one is revolutionary, but because together they reveal something interesting about where diffusion-based planning actually struggles, and what it takes to fix those struggles.
To be precise, the issue isn't that diffusion models can't generate plausible-looking trajectories. They absolutely can. The problem, as the SAGE paper (arXiv) puts it, is that value-guided selection can favour trajectories that "score well yet are locally inconsistent with the environment dynamics." In plain English: the planner generates a path that looks good on paper but falls apart the moment a robot tries to execute it.
This is a genuinely important observation. If you've ever watched a diffusion planner confidently propose a trajectory that violates basic physics (I have, more times than I'd like to admit), you know exactly what they're describing. The generated plan might have high value according to the learned reward model, but individual transitions within that plan might be dynamically infeasible.
À lire aussi
More in AI Models
The new 'omnimodal' system combines vision, language, video, audio, and robot actions in one architecture. It's impressive work, but the hype cycle feels awfully familiar.
Mark Kowalski · 4 hours ago · 4 min
Vision-language-action models can follow instructions, but they still can't reliably tell when they're done. New research from separate teams offers competing solutions.
Aisha Patel · 5 hours ago · 9 min
The company says it might hit its revenue goal early, but the interesting question is what this signals about the broader AI hardware landscape.
Aisha Patel · 8 hours ago · 5 min
Vision-Language-Action models are the new hotness in robotics research, but I've seen this movie before.
SAGE's solution is to train a Joint-Embedding Predictive Architecture encoder on offline state sequences, then use the latent prediction error as a "feasibility score" at inference time. It's worth noting that this doesn't require any environment rollouts or policy retraining. You just bolt it onto an existing diffusion planning pipeline.
The results across locomotion, navigation, and manipulation benchmarks show improvements in both performance and robustness. Though I should note the paper doesn't provide detailed ablations on when this approach helps most versus when it's unnecessary overhead. That's the kind of analysis I'd want to see before recommending this as a default addition to diffusion planners.
The SPADE paper (arXiv) tackles a related but distinct problem. How do you incorporate human preferences into path planning without either complex reward engineering or expensive hardware setups?
Their approach combines an overhauled ROS 2 annotation tool with diffusion-based augmentation for behavioural cloning models. The numbers they report are striking: 39.1% lower Absolute Pose Error and 33.5% lower Fréchet Inception Distance compared to state-of-the-art methods, while using 93.8% fewer trainable parameters.
Actually, let me be more careful here. Those percentage improvements are impressive, but the comparison baseline matters enormously. The paper claims to outperform "state-of-the-art methods," but the specific baselines and evaluation conditions determine whether those numbers are meaningful. I haven't had time to dig into their experimental setup in detail, so take those figures as promising rather than definitive.
What I find more interesting is the architectural insight. They're getting "diffusion-level generalization" (their term) while maintaining real-time, on-edge inference properties. If that holds up, it suggests you can get some of the benefits of diffusion models without paying the full computational cost at deployment time. The diffusion component is used for data augmentation during training, not for inference.
Here's where things get, well, complicated. Action chunking has become a standard approach in Learning from Demonstration, and for good reason. Modelling multi-step action chunks rather than single-step actions genuinely helps capture the temporal structure of expert behaviour. But it comes with a cost that the field has sort of been dancing around.
The Temporal Action Selection paper (arXiv) states the problem directly: because action chunking makes decisions only after a complete action block has been executed, it "restricts the utilization of real-time observations, impairing reactivity in dynamic or noisy environments."
This is a fundamental tension. You want chunking for better policy modelling. But chunking means you're committed to a sequence of actions even when new observations suggest you should change course. Previous solutions have tried to trade off reactivity against decision consistency, but you really want both.
Their proposed solution, TAS, caches predicted action chunks from multiple timesteps and uses a lightweight selector network to dynamically choose the optimal action. The key insight is that you're not throwing away the benefits of chunking; you're just being smarter about which chunk (or which portion of which chunk) to execute at any given moment.
I know I'm being picky here, but the phrase "lightweight selector network" is doing a lot of work in that description. How lightweight? What's the latency overhead? The paper reports improved success rates across multiple tasks and base policy architectures, and they've validated on physical robots, which is good. But the computational overhead analysis is something I'd want to see in more detail.
Let me try to separate genuine novelty from incremental progress, because I think the distinction matters.
SAGE's contribution is primarily methodological. Using JEPA-style encoders to detect dynamically inconsistent plans is a reasonable idea, and making it work as a plug-in module for existing pipelines is useful engineering. But the core insight (that diffusion planners can generate infeasible trajectories) isn't new. The solution is clever but not surprising.
SPADE's contribution is more about the full system than any single component. The ROS 2 annotation tool, the augmentation strategy, the efficiency gains. None of these are individually groundbreaking, but the combination addresses a real pain point in deploying learned planners.
TAS addresses a problem that's been acknowledged but underexplored. The tension between action chunking and reactivity is real, and their solution is genuinely novel in how it resolves that tension. Whether it's the right solution remains unclear; the sample size of tasks they've tested on is relatively small, and I'd want to see this replicated across a wider range of scenarios.
What strikes me about these three papers together is that they're all, in different ways, trying to patch problems created by the diffusion planning paradigm itself. SAGE fixes the feasibility problem. SPADE tries to get diffusion-like benefits without diffusion-like costs. TAS addresses the reactivity problem that action chunking (often used with diffusion models) creates.
This isn't a criticism, exactly. All paradigms have failure modes, and fixing those failure modes is legitimate research. But it does make me wonder whether we're at the "patching the paradigm" stage of diffusion planning, where the core approach is established and we're now dealing with its limitations, or whether some of these limitations point to more fundamental issues.
It's too early to say. Diffusion models have genuine advantages for multi-modal trajectory generation, and the field hasn't yet exhausted the space of possible improvements. But I'd encourage researchers to ask, for each new problem they encounter, whether the right solution is to fix the diffusion planner or to consider whether a different approach might avoid the problem entirely.
A few open questions that these papers raise but don't fully answer:
First, when does SAGE-style feasibility checking actually matter? There must be environments or tasks where the base diffusion planner is already good enough and the overhead isn't worth it. Characterising those conditions would be valuable.
Second, can SPADE's augmentation strategy transfer to other domains? They've demonstrated it for path planning, but the general idea of using diffusion for training-time augmentation rather than inference-time generation seems broadly applicable. Or maybe there's something specific about path planning that makes it work particularly well there.
Third, how does TAS interact with different base policy architectures? They show results across "multiple tasks with diverse base policy architectures," but I'd want to understand which architectures benefit most and why.
And finally, a question none of these papers address: what's the failure mode distribution? When these methods fail, how do they fail? Understanding the shape of failures is often more informative than aggregate success metrics.
All three papers rely primarily on benchmark evaluations (simulation and, in some cases, physical robots). This is standard practice, and I'm not suggesting there's anything wrong with it. But it's worth remembering that benchmark performance doesn't always translate to real-world deployment, and none of these papers include long-term deployment studies.
This is a limitation of the field, not of these specific papers. We don't have great methodology for evaluating whether a planning improvement that helps on benchmarks will actually help when a robot is operating continuously in an unstructured environment. That remains an open problem.
Diffusion-based planning is clearly here to stay, at least for now. These three papers represent solid incremental progress on known problems. SAGE makes diffusion planners more robust to feasibility issues. SPADE offers an efficiency-conscious approach to incorporating human preferences. TAS addresses the reactivity limitations of action chunking.
None of them are paradigm shifts. But paradigm shifts are rare, and most research progress happens through exactly this kind of careful, incremental improvement. The question I'll be watching is whether these patches accumulate into something that makes diffusion planning genuinely reliable for deployment, or whether they're adding complexity to a paradigm that will eventually be superseded by something simpler.
I genuinely don't know which it will be. And anyone who tells you they do know is probably selling something.