Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I've been watching robotics researchers chase the sim-to-real dream for the better part of a decade now, and every few years someone announces they've finally cracked it. Usually they haven't. But something different is happening in the labs right now, and I'll admit it's got my attention.
The latest batch of papers coming out of MIT, Stanford, and various AI labs are converging on an idea that would have sounded ridiculous five years ago: instead of building elaborate physics simulators to train robots, just use video generation models. The same technology that makes those weird AI videos of Will Smith eating spaghetti? Turns out it might actually be useful for something.
The core insight is deceptively simple. Video models have absorbed millions of hours of footage showing how objects move, fall, stack, and interact. They've learned intuitive physics not from equations but from watching the world. So why not use that knowledge to train robots?
A system called GE-Sim 2.0 is pushing this idea hard. It's a "closed-loop video world simulator" for robotic manipulation, which basically means it generates plausible video of what would happen if a robot took a particular action, then uses that generated video to train policies. The team retrained their model on thousands of hours of real robot footage (teleoperation, contact-rich interaction, actual policy deployment) and claims it now tops the public WorldArena leaderboard at only 2 billion parameters. That's notably smaller than many general video generators, and the researchers say policies trained against its simulated rollouts actually translate into real-world gains.
Call me old-fashioned, but I remain skeptical of leaderboard claims. We've seen this movie before with self-driving cars, where simulation results looked fantastic right up until the moment they didn't. But the approach itself is interesting.
À lire aussi
More in AI Models
The company just raised its outlook by a staggering amount, and honestly, I'm trying to figure out if this is real momentum or a peak we're about to fall off.
Sarah Williams · 2 hours ago · 5 min
A $65 billion raise that eclipses OpenAI. I've seen big valuations before, but this one's got me scratching my head.
Robert "Bob" Macintosh · 2 hours ago · 3 min
The private equity giants are seeking additional investors for what would be one of the largest AI infrastructure financing deals to date.
James Chen · 3 hours ago · 4 min
The company that once prided itself on vertical integration is outsourcing its AI brain to a competitor. That's not a pivot, it's a concession.
The real question is whether video models can serve as reliable reward signals. This is where a project called SOLE-R1 comes in, and it's attempting something genuinely ambitious. The researchers built a video-language reasoning model designed to be the sole reward signal for online reinforcement learning. Give it raw video and a natural language goal, and it performs what they call "per-timestep spatiotemporal chain-of-thought reasoning" to estimate task progress.
The results they're reporting are, frankly, hard to believe. They claim SOLE-R1 enables zero-shot online RL from random initialization, meaning robots learn manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. They tested it on 24 unseen tasks and say it substantially outperforms GPT-5, Gemini-3-Pro, and several purpose-built robot rewarders.
I should note that claims of outperforming GPT-5 and Gemini-3-Pro should be taken with appropriate salt. These comparisons are always tricky because you're comparing a specialized model against general-purpose systems being asked to do something they weren't optimized for. It's a bit like saying your custom racing drone beats a commercial passenger jet in a slalom course. True, perhaps, but what does it really tell us?
What's more interesting to me is their claim about "reward hacking." This is the perennial problem where RL agents find ways to game the reward signal rather than actually solving the task. SOLE-R1 apparently shows "markedly greater robustness" to this, which if true would be a meaningful advance. Remains unclear how well this holds up outside their test environments.
Another approach sidesteps the reward problem entirely. A project called VERA (Video-to-Embodied Robot Action Model) takes a different tack: leave the video planner completely untouched and just train an inverse dynamics model to translate video predictions into robot actions. The video model stays "embodiment-agnostic," meaning the same video planner can theoretically work across different robot bodies by swapping out the translation layer.
This decoupling is clever for a practical reason the researchers highlight: you can train the inverse dynamics model with self-play data, which is way easier to collect than expert demonstrations. The robot just flails around and records what happens, then learns to map observations to actions. They've demonstrated this on a Panda arm and a 16-degree-of-freedom Allegro hand doing cube manipulation, which is legitimately difficult!
But what do I know. These kids are building systems I couldn't have imagined when I started covering tech, and I'm still not entirely sure whether to be impressed or terrified.
The efficiency angle is where things get commercially interesting. A smaller model called ProgVLA is explicitly designed for "tight compute and memory budgets," which is code for "robots that don't have a data center attached." At 0.1 billion parameters, it's tiny by modern standards, yet the researchers claim it reaches success rates competitive with much larger models and actually exceeds them on long-horizon tasks.
The trick is a two-stage compression scheme that squashes visual, language, and proprioceptive data into a fixed set of tokens, plus auxiliary "progress heads" trained with offline RL to estimate how far along a task the robot is. Basically, the model learns to ask itself "am I making progress?" which turns out to help quite a bit on tasks that require multiple sequential steps.
I find the focus on long-horizon tasks particularly notable. Most robot learning demos show single-step manipulation (pick this up, put it there) because that's what current systems do well. Multi-step tasks are where things fall apart, and any approach that specifically targets this weakness is worth watching.
There's also work on using generated futures as exploration guides. A system with the unwieldy name LLM-Guided Future Hypotheses for Horizon-Aware Exploration conditions robot policies on short predicted videos of what should happen next. An LLM reasons about the task, a simulation generates the intended object motion, and a video diffusion model synthesizes what that would look like with a robot in frame.
The researchers tested this with correct futures, generated futures, and deliberately wrong futures. As you'd expect, correct futures help most, generated futures help less but still improve over no future conditioning, and wrong futures make everything worse. This last finding is actually important, it suggests the model is genuinely using the future predictions rather than ignoring them.
The 3D spatial reasoning gap is another active area. Standard vision-language-action models often lack explicit 3D understanding, which matters when you're trying to manipulate physical objects in three-dimensional space. GaussianDream addresses this by adding what they call "learnable GaussianDream Queries" that capture 3D structure and short-horizon future evolution.
The claimed results are impressive: 98.4% on LIBERO, 54.8% on RoboCasa Human-50, and 50.0% on real-robot tasks. Though I'll note that "real-robot tasks" in this case means toy-kitchen environments, which is a step toward real deployment but not quite the same as an actual kitchen with actual mess and actual unexpected situations.
So where does this leave us? I've seen enough hype cycles to know that impressive papers don't automatically translate into working products. The gap between "works in the lab" and "works in your house" has swallowed many promising technologies whole.
But the convergence here is striking. Multiple independent research groups, using different approaches, are all betting on video models as the path forward for robot learning. That kind of convergence usually means something real is happening, even if the timeline to practical deployment remains, well, unclear.
The optimistic read is that we're finally finding ways to transfer the massive investment in foundation models into physical robotics. The pessimistic read is that we're about to repeat the self-driving car trajectory: stunning demos followed by years of grinding through edge cases.
My guess? Probably both. These approaches will work for some tasks much sooner than skeptics expect, and fail at others for much longer than optimists hope. The interesting question is which tasks fall into which bucket.
If you want to argue about it, my email's on the about page.