Two New Papers Try to Fix What Imitation Learning Gets Wrong About Robot Planning
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
By
·6 hours ago·9 min de lectura
Picture a robot arm that has watched thousands of demonstrations of a pick-and-place task. It has learned, statistically, what movements tend to follow what observations. And then, at test time, it fails to pick up an object it has seen dozens of times before, because the lighting shifted slightly, or the object landed at an angle just outside the training distribution. This is not a hypothetical. It is the central, persistent problem with imitation learning in robotics, and it is the problem that two new preprints, both posted to arXiv in the last week, are trying to address.
Behavior cloning, at its core, is supervised learning over demonstrations. You collect expert trajectories, you train a policy to imitate them, and you hope the policy generalizes. The problem is that it often does not, for reasons that are well-understood theoretically. The policy never learns to recover from its own mistakes because it was never trained on its own mistake distributions. This is the DAgger problem, identified by Ross, Gordon, and Bagnell back in 2011, and it has never fully gone away.
Cobertura relacionada
More in Research
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
Mark Kowalski · 8 hours ago · 7 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
Mark Kowalski · 8 hours ago · 7 min
A cluster of preprints from this week's arXiv suggests the field is converging on a shared bottleneck: retargeting human demonstrations faithfully enough that downstream RL policies actually benefit.
Aisha Patel · 8 hours ago · 10 min
Flow Matching (FM) is a more recent approach to behavior cloning that handles multimodal action distributions better than earlier methods like behavior cloning with mean squared error loss. Rather than collapsing the distribution to a single mode, FM learns to transform a simple noise distribution into the action distribution through a learned vector field. It is genuinely useful for tasks where multiple valid actions exist for a given state. Jiang et al. (2025) demonstrated its effectiveness in multimodal action spaces, and it has attracted real attention in the manipulation community.
But, to be precise, FM is still trained to imitate, not to maximize return. The policy has no explicit mechanism for evaluating whether a proposed action sequence is actually going to work. It produces plausible actions given what it has seen. That is different from producing good actions given the current state.
WAM-RL is addressing a related but distinct limitation. World-Action (WA) models, which couple a world model with an action model in a unified architecture, have shown strong generalization and data efficiency. The catch is that they rely on expert trajectories for training. They cannot, by design, improve beyond the demonstration distribution. WAM-RL's central claim is that this does not have to be the case.
FlowMPC, authored by a single researcher, introduces a framework that pairs an imitation-learned FM policy with a learned world model for test-time planning. The planning mechanism is Model Predictive Path Integral (MPPI), which is a sampling-based approach that evaluates candidate action sequences by rolling them out through the world model and selecting the one with the highest predicted return. The world model component builds on TD-MPC2 (Hansen et al., 2024), which is a strong prior-work baseline in latent-space model-based RL.
The key design choice here is that the FM policy acts as a proposal distribution for MPPI. Instead of sampling action sequences randomly, which is inefficient in high-dimensional spaces, you sample from the FM policy and then filter those samples through the world model. The FM policy narrows the search space; the world model evaluates quality. It is a sensible combination, and I would say it is incremental over TD-MPC2 rather than a wholly new idea, but the specific integration with flow matching as the proposal mechanism is a meaningful contribution.
Results are reported on ManiSkill manipulation tasks (Tao et al., 2025), specifically PickCube and PickSingleYCB. Adding the world model improved performance over the FM policy alone, with the paper noting especially clear gains at end-of-episode success. End-of-episode success is the right metric here, by the way. Task completion rate is what matters in manipulation, not intermediate reward.
WAM-RL is more ambitious in scope. The paper introduces a reinforcement learning framework for jointly optimizing both the world model and the action model through online interaction. The architecture consists of a WA model with a world model and an actor, trained with a hierarchical optimization scheme. The reconstruction rewards and online video supervised fine-tuning (SFT) components are designed to keep the world model grounded while the actor explores.
The most interesting empirical finding in WAM-RL is the ablation result: optimizing only the actor yields improvements on short-horizon tasks but fails to generalize to long-horizon tasks. Joint optimization of both the world model and the actor is what drives gains in long-horizon settings. This is actually the research showing something non-obvious. You might expect that just improving the actor would be sufficient, since the world model is already pre-trained on demonstrations. The fact that it is not sufficient, especially for long-horizon tasks, suggests that the world model's internal representations need to update as the policy distribution shifts during RL training. That is a useful insight.
The paper claims to be the first to introduce reinforcement learning into the World-Action paradigm. I have no reason to dispute that framing based on what I found in the literature, though it is worth noting that the broader category of combining world models with RL over imitation-learned priors has a longer history.
I want to be honest about what we can and cannot conclude from these results.
FlowMPC reports improvements on two tasks: PickCube and PickSingleYCB. Both are tabletop manipulation benchmarks within ManiSkill. The paper does not, as far as I can tell, report results across a wider suite of tasks or environments. The sample size is small in the sense that two tasks is a limited basis for broad claims about the method's generality. Whether these gains hold for tasks requiring longer horizons, more complex contact dynamics, or substantially different object geometries remains unclear.
WAM-RL's experimental setup is broader in the sense that it explicitly tests short-horizon versus long-horizon task performance, which gives the ablation results more interpretive value. But the specific tasks and success metrics are not described in the abstract in enough detail for me to assess how representative they are of real-world manipulation difficulty. The full paper would need to be read carefully to evaluate this.
Both papers are preprints. Neither has gone through peer review. I am not saying that to dismiss them. Preprints are how the field moves quickly, and both of these appear to be serious technical contributions. But it is too early to say whether the results will replicate across different implementations, hardware setups, or task distributions.
One methodological point worth raising about FlowMPC specifically: MPPI planning at test time adds computational overhead. The paper does not, in the abstract, address inference latency or whether the planning loop is fast enough for real-time robot control. For simulation benchmarks this may not matter. For deployment on physical hardware it matters a great deal. I would want to see timing results before drawing conclusions about practical applicability.
Taken together, these two papers are pushing on the same conceptual lever: the idea that a policy trained purely on demonstrations is leaving performance on the table, and that world models are a promising mechanism for recovering some of that gap.
This is not a new idea. Model-based RL has been arguing this for years. What is shifting is the integration point. Earlier work tended to treat the world model and the policy as separate components that were trained independently and then combined. What FlowMPC and WAM-RL are both doing, in different ways, is making the interaction between the two components tighter and more principled. FlowMPC uses the FM policy to guide world-model-based search at test time. WAM-RL uses RL to co-evolve the world model and the actor during online training.
I know I am being picky here, but the framing of these papers as addressing the same problem is my interpretation, not something either paper claims explicitly. FlowMPC is focused on test-time improvement without modifying training. WAM-RL is focused on online training improvement beyond the demonstration distribution. These are related but distinct interventions, and conflating them would be imprecise.
What I find genuinely interesting is the WAM-RL finding about long-horizon tasks. Long-horizon manipulation is where most methods fall apart, because compounding errors accumulate over many steps and the policy never encountered its own error distribution during training. The result that joint world-model and actor optimization is necessary for long-horizon gains, while actor-only optimization suffices for short-horizon tasks, is the kind of empirical finding that could inform architecture decisions going forward. It suggests that the world model's representations are not static scaffolding but need to adapt as the policy evolves.
FlowMPC's contribution is more narrowly scoped but also more immediately practical in a specific sense. If you already have a trained FM policy and a world model, you can potentially improve test-time performance without retraining anything. That is a low-cost intervention, assuming the planning overhead is manageable.
For FlowMPC, the obvious next step is a broader task suite and, critically, real-robot experiments. Simulation-to-real transfer is where many manipulation methods quietly fail. The ManiSkill benchmark is useful but it is not a substitute for physical hardware results. I would also want to see an analysis of how sensitive the performance gains are to world model quality. If the world model has significant prediction errors, does MPPI planning still help, or does it hurt by exploiting model inaccuracies?
For WAM-RL, the hierarchical optimization scheme needs more transparency. The abstract describes it as coordinating improvement between the world model and actor, but the specific training procedure, including how often each component is updated, what the reconstruction rewards look like in practice, and how the online video SFT component interacts with the RL objective, are details that matter a great deal for reproducibility. I would also want to see the ablation results quantified more precisely, specifically the magnitude of the gap between actor-only and joint optimization on long-horizon tasks.
More broadly, both papers are operating in simulation. The field's ability to close the sim-to-real gap for manipulation has improved substantially over the past few years, but it remains an open problem. World models trained on simulated dynamics may not transfer cleanly to the contact-rich, partially observable conditions of real robot manipulation. That is not a reason to dismiss simulation research, but it is a reason to be measured about what the results imply for deployed systems.
The core question, which neither paper fully answers, is whether world-model-based planning and RL fine-tuning can make imitation learning robust enough for the kind of open-ended, unstructured manipulation that would actually be useful outside a lab. The results here are encouraging in bounded settings. Whether they scale is, well, the question the field has been asking for a while.