Can AI-Generated Robot Movements Actually Work in the Real World? Two New Papers Suggest a Path Forward
Diffusion models are getting good at imagining robot movements, but 'imaginable' and 'physically possible' aren't the same thing. Researchers are starting to close that gap.
By
·4 hours ago·6 min read
How do you teach a robot to grab something it's never seen before, moving in a way it's never practiced, in a space it's never been in? That's basically the central problem of embodied AI right now, and honestly, it's one I keep coming back to because the gap between what these systems can imagine doing and what they can actually execute is still pretty wide.
Two new papers out of the robotics research community are tackling different slices of this problem, and taken together they paint an interesting picture of where robot manipulation is heading. Neither is a silver bullet. But both are pointing at something real.
Let me start with the one that caught my attention first. A team of researchers has proposed a framework they're calling optimization-guided diffusion, and the core idea is worth unpacking carefully. Diffusion models, if you're not deep in the ML weeds, are the same family of models behind image generators like Stable Diffusion. They're remarkably good at sampling from complex, high-dimensional distributions, which in plain English means they're good at generating plausible-looking outputs from a huge space of possibilities. Applied to robotics, that means generating plausible grasps, waypoints, or movement trajectories.
The problem is that "plausible" and "physically executable" are not the same thing. A diffusion model might generate a grasp that looks totally reasonable in abstract task-space terms but is actually unreachable given the robot's specific arm geometry, or that would cause a collision, or that the robot's controller simply can't track in real time. The researchers at describe this as the "embodiment gap," and it's a good name for it. The behavior might transfer fine in theory, but the specific robot body can't pull it off.
Related coverage
More in Humanoids
Sometimes the sources don't pan out. Here's what happened when I tried to write a humanoids story this week and ended up with Samsung deals instead.
Sarah Williams · 4 hours ago · 3 min
A batch of fresh robotics research tackles the same underlying problem from different angles: robots that can see but don't really understand where things are.
Sarah Williams · 5 hours ago · 7 min
The new Section 232 tariff rules for steel and aluminum aren't just a manufacturing story. For anyone building metal-bodied robots at scale, the supply chain math just got harder.
Sarah Williams · Yesterday · 5 min
A new technique from arXiv mirrors robot demonstrations to double usable training data without collecting a single extra example, and it's simpler than it sounds.
Their fix is to treat the diffusion sampling process itself as a constrained optimization problem. Instead of just letting the model generate outputs and then checking whether they're feasible after the fact (and throwing out the bad ones), they inject physical constraints directly into the backward diffusion process. The key move is replacing what's called the "sampling perturbation" with an optimized correction, which lets them impose hard constraints or soft penalties during generation without retraining the model from scratch.
I initially thought this was just a fancier version of rejection sampling, where you generate a bunch of candidates and keep the ones that pass your constraints. But after reading the paper more carefully, it's meaningfully different. Rejection sampling wastes compute and tends to produce outputs that are technically feasible but have drifted far from what the model actually learned. The optimization-guided approach keeps the outputs close to the learned prior while still satisfying constraints. That distinction matters for grasp quality, not just feasibility.
The numbers they report are striking. Task success improved by up to 20 percentage points on dexterous grasping and 23 percentage points on visuomotor manipulation compared to the best existing baseline methods. That's a substantial jump. Though I should note this is based on their own evaluation setup, and it remains to be seen how these gains hold up across a wider range of real-world conditions and robot platforms.
The second paper is attacking a related but distinct problem. Grabbing a static object is hard enough. Grabbing a moving one is a whole other level of difficulty, because now the robot has to predict where the object will be by the time its arm actually gets there, not just where it is right now.
The DynaMOMA framework, detailed in a separate arXiv preprint, is designed specifically for mobile manipulation of dynamic objects. "Mobile manipulation" here means the robot isn't fixed in place; it has a base that can move around, plus an arm. Coordinating both while chasing a moving target is genuinely hard. The robot has to think about where to drive, where to reach, and where the object will be, all at the same time.
Their approach couples two things: a trajectory predictor that uses an anchor-based diffusion model to generate short-horizon grasp trajectories from historical observations, and a whole-body reinforcement learning policy that actually controls the robot. The predictor outputs compact features that feed directly into the RL policy, rather than the policy having to reason from raw predictions. And there's a clever reward structure they call an "anticipation-guided reward" that shifts the target from the current observed position toward the predicted future position, nudging the robot to plan ahead rather than just react.
You might be wondering why they're using diffusion specifically for the prediction side rather than something simpler. The answer seems to be temporal consistency. When you're predicting grasp trajectories from a moving object, you need consecutive predictions to be coherent with each other, not just individually plausible. The anchor-based diffusion setup is designed to enforce that consistency across the short prediction horizon.
They tested this in Isaac Gym simulation and then ran real-world experiments to check generalizability. The real-world results look promising, though the paper is appropriately careful about not overclaiming. Simulation-to-real transfer in manipulation is still a genuine challenge, and it's too early to say how DynaMOMA would perform across a truly diverse set of real environments and object types.
So what connects these two papers? Both are, at root, about the same problem: generative models are powerful but they don't automatically respect the physical world. The optimization-guided diffusion work is about making generated robot behaviors physically executable on specific hardware. The DynaMOMA work is about making predictions temporally coherent and actionable when the world itself is changing. Different angles, same underlying tension between what a model can generate and what a robot can actually do.
Tbh, this is the part of embodied AI research that I find most interesting right now, more interesting than the headline-grabbing demos of humanoids walking around. The demos are impressive, but the unglamorous work of closing the embodiment gap is where the real progress is happening. Teaching a model to generate physically grounded behavior, not just visually plausible behavior, is a genuinely hard problem and it's not solved yet.
There are open questions here that neither paper fully addresses. How do these approaches scale when the constraints become more complex, say, a cluttered kitchen counter rather than a controlled lab setup? How much does performance degrade when the robot encounters objects or dynamics outside its training distribution? Some argue that inference-time optimization is the right layer to enforce physical constraints, others counter that you really need the constraints baked into training to get robust behavior. Both positions have merit and the field hasn't settled this.
What I keep coming back to is the framing of the first paper's core insight: you don't have to retrain the model to make it physically aware. You can intervene at inference time. That's practically significant because retraining large generative models is expensive and slow. If you can adapt a trained policy to a new robot embodiment just by changing the constraints at inference time, that's a much more tractable path to generalization than training a separate model for every robot configuration.
Whether that holds up at scale, honestly, I'm not sure yet. The experiments in both papers are promising but they're still fairly controlled. The jump from "works in our evaluation" to "works reliably in deployment" has humbled a lot of robotics research before. But the direction feels right, and the underlying ideas are sound enough that I'll be watching to see where this goes.