Diffusion Models Are Getting Better at Planning. Here's Why That Matters for Robots
Two new papers tackle one of the messiest problems in robot motion planning: keeping trajectories stable and physically believable over time.
By
·5 hours ago·6 Min. Lesezeit
Picture a self-driving car that's been cruising smoothly for ten minutes, then suddenly twitches. Not a big swerve, just a small, weird correction that wasn't necessary. You'd notice it. It'd feel wrong. That small twitch is actually a symptom of a much deeper problem in how learning-based planners work, and two recent papers from the robotics research community are trying to fix it.
Both papers use diffusion models as their core planning mechanism. I've been following diffusion-based robotics work for a while now, and honestly, the pace of progress here is starting to feel real. Not hype-real. Actually real.
Learning-based motion planners, the kind that use neural networks to figure out where a robot or vehicle should go next, have a consistency problem. Small errors compound. A tiny miscalculation in frame one influences frame two, which influences frame three, and before long you've got a trajectory that wobbles or drifts in ways that are uncomfortable at best and unsafe at worst.
The obvious fix is to feed the planner its own history. Tell it what it just did, so it can stay consistent. The problem, as researchers at arXiv point out in the first paper, is that this backfires. When you give a planner its history as a static conditioning signal, it starts copying patterns instead of actually responding to what's happening in the environment. It's like a driver who keeps making the same turn because that's what they did last time, rather than because it's the right move right now.
The second problem is different but related. Diffusion models are generative. They're really good at producing plausible-looking trajectories. But plausible-looking isn't the same as physically possible. A trajectory can look smooth and reasonable on paper and still violate the actual dynamics of the system it's supposed to control. For a robot dog, that might mean planning a motion its legs literally cannot execute.
Verwandte Beiträge
More in Humanoids
Four new papers on visual robot navigation dropped this week, and together they're pointing at something important: the hardest problem isn't seeing the world, it's knowing what body you're in.
Sarah Williams · 1 hour ago · 6 min
Two new papers push humanoid robots into high-speed, contact-heavy physical tasks. The results are genuinely impressive, and they point to something bigger.
Sarah Williams · 6 hours ago · 7 min
A bumper crop of arXiv papers this week suggests the field is quietly solving some of robotics' most stubborn problems, from data collection ergonomics to teaching robots to feel how heavy things are.
The first paper, from arXiv, introduces something called the Diffusion Forcing Planner, or DFP. The core idea is clever. Instead of treating history as a fixed input, DFP assigns different noise levels to different parts of the trajectory: history, current state, and future. It then denoises all of these jointly, which forces the model to reason about how the past connects to the future rather than just copying it.
At inference time, they use a technique called classifier-free guidance to steer future trajectory generation using what they call "annealed history." The annealing part is key. It means the influence of history gradually decreases as you look further into the future, which is actually how sensible planning should work. What you did ten seconds ago matters a lot for the next second. It matters less for what you should do in thirty seconds.
They tested this on nuPlan, a large-scale autonomous driving benchmark, and the results show stable, continuous motion plans in complex scenarios. I initially thought this sounded like an incremental tweak, but after reading through the approach more carefully, the joint denoising architecture is doing something genuinely different from prior work.
The second paper, also on arXiv, tackles the dynamics problem. Their system, called MPDiffuser (Model Predictive Diffuser), combines two diffusion models: one for planning, one for dynamics. During the sampling process, these two models take turns correcting each other. The planner proposes trajectories. The dynamics model checks whether they're physically feasible. They go back and forth, progressively nudging the output toward something that's both task-relevant and actually executable.
There's also a lightweight ranking module at the end that picks the best trajectory from the candidates generated. The whole thing runs on offline data, meaning the robot doesn't need to be actively exploring the environment to learn.
Critically, they deployed this on a real quadrupedal robot, not just in simulation. That matters. A lot of diffusion planning papers stay in sim, tbh, so seeing real hardware results is a meaningful step.
You might be wondering why I'm writing about autonomous driving and quadruped robots on a humanoid beat. Fair question.
The honest answer is that the underlying planning challenges are the same. Humanoids need stable, temporally consistent motion plans. They need trajectories that respect their physical constraints. A bipedal robot falling over because its planner generated a technically smooth but dynamically impossible gait sequence is the same category of failure as a self-driving car twitching at highway speed.
The techniques in both papers, history-annealed guidance and compositional dynamics correction, are architecture-level ideas. They're not locked to cars or quadrupeds. Whether they translate cleanly to humanoid control loops with higher degrees of freedom and more complex contact dynamics is still an open question. It's too early to say whether either approach scales to full-body humanoid planning without significant modification.
But the direction is right. The field has been grappling with the gap between "generates plausible motion" and "generates motion a real robot can actually execute" for years. These papers are at least attacking that gap directly.
First, both papers are evaluated on specific benchmarks, nuPlan for DFP and D4RL/DSRL plus one real robot test for MPDiffuser. Benchmark performance is useful, but it's a constrained view. Real-world deployment involves distribution shift, sensor noise, and edge cases that benchmarks don't fully capture.
Second, the computational cost of running two diffusion models in an interleaved loop, as MPDiffuser does, is non-trivial. The paper describes the dynamics model as enabling "adaptability" by learning from diverse data independently, which is a real advantage. But I should know this better than I do: whether the inference latency is practical for real-time control on resource-constrained hardware is something the paper doesn't fully address, at least not in ways I found satisfying.
Third, and this applies to both papers, the gap between offline decision-making and closed-loop deployment on complex embodied systems is still large. MPDiffuser's quadruped demo is encouraging. It's also one robot, one set of tasks.
Diffusion-based planning is maturing. That's the honest read. The early excitement around diffusion for robotics was real but also a bit unfocused, lots of "look what it can generate" and not enough "look what it can reliably control." These papers are part of a shift toward the latter.
The history-annealing idea in DFP is the kind of thing that, once you read it, seems obvious. Of course you want the influence of history to decay as you plan further ahead. The fact that it took a specific architectural innovation to make that work properly says something about how hard this problem actually is.
I think the next interesting question is whether these approaches hold up when the action space gets more complex, which is to say, when you strap them onto a robot with arms and legs that has to manipulate objects while staying balanced. That's where the real test will be. This raises questions about... well, multiple things, including how history-annealed guidance behaves when contact dynamics change rapidly mid-trajectory.
The chip giant's latest numbers look like an AI infrastructure story. But if you're watching humanoids, there's something more interesting buried in there.