World Models Are Finally Learning to Replace Simulators. Here's What That Actually Means.
New research from multiple labs suggests we might be approaching a genuine inflection point in how robots learn from experience, though the caveats are significant.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
The most interesting paper I read this week argues that we can train robot policies entirely inside learned world models, bypassing physics simulators altogether. If you have been following this space, you know that is a genuinely significant claim.
The paper, "Coupled Local and Global World Models for Efficient First Order RL" (arXiv), introduces what the authors call a decoupled first-order gradient method. To be precise, they use a full-scale diffusion model to generate accurate forward trajectories while a lightweight latent-space surrogate handles gradient computation. The result is a system that can train reinforcement learning policies directly from real robot interactions, no hand-crafted physics simulator required.
This is not the only recent work pushing in this direction. A cluster of papers from the past few weeks suggests the field is converging on some important ideas about how robots should learn from experience. I want to walk through what is actually new here, what remains incremental, and where the open questions lie.
Physics simulators have been the backbone of robot learning for years. You build a mathematical model of your robot and its environment, run millions of simulated trials, and transfer the learned policy to the real world. This approach has produced impressive results in locomotion tasks, where the physics are relatively well understood.
Manipulation is harder. Contacts are discontinuous. Deformable objects behave in ways that are difficult to model analytically. Friction is notoriously tricky. The sim-to-real gap, that frustrating delta between what works in simulation and what works on actual hardware, tends to be larger for manipulation than for locomotion.
À lire aussi
More in AI Models
Jensen Huang confirms Samsung, SK Hynix, and Micron are all certified for next-gen memory supply, which tells us more about the AI chip market than the chips themselves.
Aisha Patel · 52 mins ago · 6 min
A $1.6 billion shortfall in projected AI chip revenue sounds small, but it tells us something important about where the semiconductor industry actually stands.
Aisha Patel · 52 mins ago · 8 min
Jensen Huang is making moves on two fronts this week, and I've seen this playbook before.
Mark Kowalski · 2 hours ago · 7 min
Two new papers suggest robots could get smarter after deployment, not just during training. I think this changes more than we're admitting.
World models offer an alternative. Instead of hand-crafting physics equations, you learn a predictive model directly from data. The robot interacts with the real world, observes what happens, and builds an internal model of how its actions affect its environment. In principle, this should capture dynamics that are hard to simulate, including contacts, non-rigidity, and complex sensory information like visual perception.
The catch has always been computational cost. World models, particularly the diffusion-based variety that produce high-fidelity predictions, are expensive to evaluate. Popular RL approaches need to query the model many times during training, and if each query takes too long, the whole enterprise becomes impractical.
The arXiv paper addresses this directly by splitting the problem in two. A global world model (the full diffusion model) generates complete forward trajectories. A local surrogate model (a lightweight network operating in latent space) approximates the dynamics for gradient computation. The global model ensures high-fidelity unrolling; the local model ensures tractable differentiation.
The authors demonstrate this on Push-T, a manipulation benchmark where the task is to push a T-shaped block into a target configuration. They report that their method significantly outperforms PPO in sample efficiency. They also evaluate on an ego-centric object manipulation task with a quadruped robot, which is a nice demonstration that the approach generalizes beyond tabletop manipulation.
It is worth noting that the sample size here is small. Push-T is a single task, and while the quadruped experiment adds breadth, we are still looking at a limited evaluation. The claim that this approach is "a promising pathway for solving hard-to-model RL tasks" is reasonable, but it has not been replicated yet, and the range of tested scenarios is narrow. I would want to see this evaluated on a much wider task distribution before drawing strong conclusions.
A separate paper from this week, "DLO-Lab: Benchmarking Deformable Linear Object Manipulations with Differentiable Physics" (arXiv), takes a different but related approach. The authors introduce a differentiable simulator designed specifically for deformable linear objects: ropes, cables, rubber bands, and similar materials.
This is interesting because it represents a middle ground between traditional simulators and pure world models. The simulator is still physics-based, but it is differentiable, meaning you can backpropagate gradients through the physics to train policies more efficiently. And it is designed to handle the specific challenges of deformable objects, including extensibility, elasticity, bending plasticity, and complex interactions with other objects.
The paper also introduces a benchmark suite and a specialized agent architecture that explicitly manages the topological complexity and grasp sensitivity inherent to deformable linear objects. The agent proposes strategic grasping points and decomposes long-horizon tasks to maximize control authority.
I know I am being picky here, but the phrase "maximize control authority" is doing a lot of work in that sentence. What the authors actually mean is that the agent tries to grasp the object in ways that give it more influence over the object's configuration. A grasp near the middle of a rope gives you less control than a grasp near the end, for certain tasks. The agent learns to reason about this.
The sim-to-real transfer experiments are preliminary but encouraging. The authors demonstrate that policies trained in their simulator can transfer to real robots manipulating real deformable objects. The gap is not zero, but it is small enough to be useful.
A third paper, "L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation" (arXiv), addresses a constraint that does not get enough attention in the academic literature: energy consumption.
The setting is intra-vehicular robots in spacecraft. These robots help reduce astronaut workload, but they operate under severe power budgets. Diffusion policies, which have become popular for their ability to model complex multimodal action distributions, are computationally expensive. Their iterative sampling process consumes too much energy for spacecraft applications.
The authors propose a spiking diffusion policy optimized with reinforcement learning. Spiking neural networks are inspired by biological neurons; they communicate through discrete spikes rather than continuous activations, which can be more energy-efficient on neuromorphic hardware. The paper also introduces what the authors call "state-dependent latency injection," which mimics biological neural delays to dynamically regulate the timing of input information.
The evaluation covers five representative intra-vehicular tasks, including hatch opening and precision container capping. The authors report higher success rates and lower energy consumption compared to state-of-the-art methods.
I should note that "state-of-the-art" is a fuzzy term here. The comparison baselines are reasonable but not exhaustive, and the tasks are specific to the spacecraft domain. Whether these results generalize to terrestrial applications with different energy constraints remains unclear.
The fourth paper I want to discuss, "VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies" (arXiv), tackles a different problem entirely: learning to execute tasks faster than the human demonstrations used for training.
This is actually a significant practical concern. When a human demonstrates a task for a robot, they often move slower than necessary, either for safety, for clarity, or simply because they are being careful. If the robot learns to replicate the demonstration at the same pace, it will be slower than it needs to be.
The naive solution is to uniformly downsample the demonstration trajectory. If the human took 10 seconds, train on a version that takes 5 seconds. But this is problematic because some parts of a task genuinely require slow, precise motion (object interactions, fine manipulation) while others can be safely accelerated (unconstrained motion through free space).
VOLT uses vision and language to segment trajectories and identify which parts can be sped up and which parts require careful precision. The method reasons over video demonstrations and leverages contextual cues to make these decisions. The resulting reformatted trajectories can then be used with standard imitation learning approaches.
The authors emphasize that segmentation quality is critical. Baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. This is, in a way, an obvious point, but it is worth stating explicitly: the bottleneck in this approach is not the learning algorithm, it is the ability to correctly identify which parts of the task are time-critical.
At first glance, these four papers seem to address different problems. World models, deformable objects, energy efficiency, and execution speed are not obviously related. But I think there is a common thread.
All four papers are grappling with the limitations of current simulation-based approaches. The world model paper bypasses simulators entirely. The deformable object paper builds a specialized differentiable simulator because general-purpose simulators do not handle deformable objects well. The spiking network paper addresses the computational cost of diffusion models, which matters when you cannot afford to run expensive simulations. And the VOLT paper deals with the fact that human demonstrations are not always the right speed, which is a form of sim-to-real gap (or, more precisely, demonstration-to-deployment gap).
The field is, I think, gradually recognizing that physics simulators are not a universal solution. They work well for some problems (locomotion, rigid body manipulation) and poorly for others (deformable objects, contact-rich manipulation, tasks where the physics are hard to model). The research community is exploring multiple paths forward: learning world models from data, building specialized differentiable simulators, reducing computational costs, and rethinking how we use demonstrations.
Several things remain unclear to me after reading this cluster of papers.
First, how do these approaches scale? The world model paper demonstrates results on Push-T and a single quadruped task. The deformable object paper introduces a benchmark, but the tasks are relatively constrained. We do not yet know whether these methods will work on the kind of diverse, open-ended manipulation tasks that robots will encounter in real deployments.
Second, how do the computational costs compare in practice? The world model paper claims efficient gradient computation, but "efficient" is relative. The spiking network paper explicitly targets low-energy settings, but neuromorphic hardware is not widely available. The actual wall-clock time and energy consumption for training and deployment across these methods has not been systematically compared.
Third, what happens when these approaches are combined? Could you train a spiking diffusion policy inside a learned world model? Could you use VOLT-style trajectory segmentation to improve sample efficiency in world model training? The papers do not address these combinations, and it is too early to say whether they would be complementary or redundant.
If I were advising a research group working in this area, I would push for three things.
First, systematic benchmarking across methods. The field needs a shared evaluation framework that covers a diverse range of manipulation tasks, from rigid to deformable, from tabletop to mobile, from slow to fast. The DLO-Lab benchmark is a step in this direction, but it is specific to deformable linear objects. We need something broader.
Second, explicit attention to failure modes. These papers report success rates, but they do not always analyze why the methods fail when they fail. Understanding failure modes is often more informative than understanding successes, and it is essential for knowing when these methods are safe to deploy.
Third, more real-world evaluation. Sim-to-real transfer experiments are present in some of these papers, but they are often brief. The world model paper is notable for training directly from real robot interactions, which is the right direction. But the evaluation is still limited. I would want to see extended deployments on real hardware, with careful analysis of how performance degrades over time and across environmental variations.
This cluster of papers represents genuine progress on some hard problems in robot learning. The world model work is particularly interesting because it demonstrates that training inside learned models is becoming practical, not just theoretically appealing. The deformable object work fills an important gap in simulation capabilities. The spiking network work addresses energy constraints that will become increasingly important as robots move into power-limited settings. And the VOLT work tackles a practical problem that has been somewhat neglected in the academic literature.
None of these papers is revolutionary. That word gets overused in robotics, and it usually signals that someone is overselling their results. What we have here is incremental progress on multiple fronts, which is exactly what the field needs. The hard work of building reliable, capable robot manipulation systems requires this kind of steady accumulation of techniques and insights.
I am cautiously optimistic that we are approaching a point where robots can learn manipulation skills from real-world experience without relying on hand-crafted simulators. But we are not there yet, and the caveats I have outlined (limited task diversity, unclear scaling, insufficient real-world evaluation) are significant. The next few years will tell us whether these approaches can deliver on their promise.