Are Video Models the Future of Robot Control? Six New Papers Suggest We're Getting Closer
A wave of research is converging on vision-language-action models, but the field still can't agree on the best way to turn pixels into robot movements.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
What if your robot could learn to manipulate objects just by watching videos of humans doing the same thing?
This question has animated robotics research for years, but a cluster of recent papers suggests we might finally be approaching something like an answer. The catch, as always, is that the answer is complicated, contested, and comes with enough caveats to fill a dissertation. Six new preprints on arXiv are all wrestling with variations of the same fundamental problem: how do you turn a model that understands video into a model that can actually control a robot?
I've spent the past week reading through these papers, and what strikes me is not just the technical progress but the emerging fault lines in how researchers are approaching the problem. There are genuine disagreements here about architecture, training methodology, and what "generalization" even means in this context. Let me walk through what's actually new, what's incremental, and what remains frustratingly unclear.
To be precise, the challenge here is what researchers call the "modality gap." Video models are trained to predict pixels. Robot policies need to output joint angles, gripper commands, or end-effector positions. These are fundamentally different things, and bridging them is harder than it sounds.
The dominant approach right now involves Vision-Language-Action (VLA) models, which build on pretrained vision-language models (think: models that can look at an image and describe it) and add an action prediction head. The idea is that if a model understands what's happening in a scene and can follow language instructions, maybe it can also figure out what actions to take.
Related coverage
More in AI Models
The company just raised its outlook by a staggering amount, and honestly, I'm trying to figure out if this is real momentum or a peak we're about to fall off.
Sarah Williams · 1 hour ago · 5 min
A $65 billion raise that eclipses OpenAI. I've seen big valuations before, but this one's got me scratching my head.
Robert "Bob" Macintosh · 1 hour ago · 3 min
The private equity giants are seeking additional investors for what would be one of the largest AI infrastructure financing deals to date.
James Chen · 2 hours ago · 4 min
The company that once prided itself on vertical integration is outsourcing its AI brain to a competitor. That's not a pivot, it's a concession.
But there's a problem. Actually, several problems.
First, most VLA models require extensive fine-tuning with action-labeled data, which is expensive to collect and often specific to particular robot embodiments. Second, these models tend to lack genuine 3D understanding; they're working from 2D images and hoping that's enough. Third, they struggle to generalize to tasks they haven't seen before, which rather defeats the purpose of building a "foundation model."
The six papers I'm examining each take a different angle on these challenges. Some are genuinely novel. Others are, I know I'm being picky here, but incremental improvements dressed up in foundation model language.
The most architecturally interesting paper comes from MIT CSAIL. VERA (Video-to-Embodied Robot Action Model) takes a counterintuitive approach: instead of training a single model that jointly predicts video and actions, it keeps the video model completely frozen and trains a separate inverse dynamics model (IDM) to translate video predictions into robot commands.
This decoupling offers some genuine advantages. The video planner remains embodiment-agnostic, meaning you can theoretically use the same video model for a Panda arm, an Allegro hand, or any other robot. You just train a different IDM for each embodiment. The IDM itself can be trained with self-play data, which is much cheaper to collect than human demonstrations.
What's genuinely new here is the IDM design, which incorporates the robot's Jacobian (the mathematical relationship between joint velocities and end-effector velocities). This makes the IDM more data-efficient and allows it to scale to high-dimensional action spaces. The authors demonstrate this on a 16-DoF Allegro hand doing cube reorientation, which is a legitimately difficult task.
The results look strong: zero-shot transfer to Panda arm manipulation and successful dexterous manipulation. But I want to flag some limitations. The paper doesn't report failure modes in detail, and it's unclear how the approach handles situations where the video model generates physically implausible predictions. The sample size for real-world experiments appears to be modest, though the exact numbers are in the supplementary material.
One of the more significant contributions comes from 3DVLA, which directly addresses what the authors call a "critical limitation" of existing VLA models: their lack of 3D scene understanding.
It's worth noting that this isn't a new complaint. Roboticists have been pointing out for years that 2D vision models struggle with depth perception, occlusion, and spatial reasoning. What 3DVLA does is provide a plug-and-play framework for injecting 3D reasoning into pretrained VLAs without requiring expensive instance-level annotations.
The technical approach involves three components: multi-view consistency constraints across all modalities, an instance estimation module with high-level tokens for 3D instance awareness, and a masked self-supervised branch for handling occlusions. The authors integrate this with multiple VLA baselines and test on LIBERO-Plus and RoboTwin 2.0.
The results show "consistent and significant gains," though I'd want to see more detailed ablations before getting too excited. The paper doesn't fully address how much of the improvement comes from each component, and the computational overhead of the 3D encoding isn't clearly quantified.
CogVLA tackles a different problem: the computational cost of training and running VLA models. The authors argue, correctly, that extensive post-training requirements limit scalability and deployment.
Their solution draws inspiration from human multimodal coordination (their framing, not mine) and introduces instruction-driven routing and sparsification. In plain English: they use the task instruction to selectively compress visual information and prune irrelevant tokens, reducing the computational load.
The numbers are impressive on paper: 2.5-fold reduction in training costs and 2.8-fold decrease in inference latency compared to OpenVLA, while achieving 97.4% success rate on LIBERO and 70.0% on real-world tasks.
But I have questions. The LIBERO benchmark, while useful, has been criticized for being somewhat narrow. And the real-world evaluation, while showing strong results, doesn't specify the task diversity or environmental variations. Success rate alone doesn't tell us much about robustness.
Two papers focus specifically on generalization to unseen tasks, which remains the holy grail of robot learning.
VLA-Pro introduces what the authors call "procedural memory transfer." The idea is to store task-specific LoRA adapters (lightweight fine-tuning modules) during training and retrieve relevant ones at inference time based on the current context. It's basically a memory bank of learned skills that the model can draw on for new tasks.
The results are striking: up to 207% relative improvement in simulation and an increase in real-world success rate from 5.8% to 65.0%. That's a massive jump. But the 5.8% baseline is so low that I wonder about the experimental setup. What exactly is being measured here? The paper describes it as "cross-task generalization," but the definition of task similarity matters enormously.
MVP-LAM takes a different approach, focusing on learning better latent actions from multi-view videos. The key insight is that latent actions learned from single-view videos can become overly reliant on viewpoint-specific cues rather than capturing the underlying action. By training with a cross-viewpoint reconstruction objective, the model is forced to learn more action-centric representations.
This is incremental over prior work on latent action models, but the multi-view training objective is a sensible addition. The evaluation on Bridge V2 shows improved mutual information with ground-truth actions, which is a more rigorous metric than task success alone.
DUST (Dual-Stream Diffusion) represents the world-model augmented approach. The idea is that if your model can predict what the world will look like after an action, it can use that prediction to plan better actions.
The challenge, as the authors note, is jointly predicting states and actions despite the modality gap. Their solution is a multimodal diffusion transformer with separate streams for each modality and cross-modal knowledge sharing. They also introduce an asynchronous sampling method that improves performance through inference-time scaling.
The results show 6% gains over baselines on RoboCasa and GR-1, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks with a Franka Research 3, DUST outperforms baselines by 10% in success rate.
I find the inference-time scaling result particularly interesting. It suggests that these models have untapped capacity that can be accessed with more compute at test time, similar to what we've seen in language models. But 2-5% improvement isn't transformative, and the computational cost of inference-time scaling isn't fully characterized.
After reading these six papers, I'm left with more questions than answers. That's not a criticism, to be precise, it's the nature of an active research area. But here's what would help clarify the state of the field:
Standardized benchmarks with clearer metrics. Success rate on LIBERO tells us something, but it doesn't tell us enough. We need metrics that capture robustness, sample efficiency, computational cost, and generalization to out-of-distribution scenarios. The field is converging on certain benchmarks, but the evaluation protocols vary enough to make cross-paper comparisons difficult.
More detailed failure analysis. Every paper reports success rates, but almost none report systematic failure modes. When does VERA's inverse dynamics model break down? What kinds of occlusions defeat 3DVLA's approach? Without this information, it's hard to know which approach to use for which application.
Real-world evaluation at scale. The real-world experiments in these papers typically involve a single robot arm doing a handful of tasks in a controlled environment. That's a reasonable starting point, but it doesn't tell us much about deployment in actual homes, warehouses, or factories. The gap between lab success and real-world robustness remains, well, unclear.
Honest assessment of what's incremental versus novel. I appreciate that academic incentives push researchers toward claiming novelty, but some of these contributions are more incremental than the abstracts suggest. That's fine; incremental progress is how science works. But it makes the papers harder to evaluate when everything is framed as a breakthrough.
Stepping back, what do these papers tell us about the state of robot learning?
First, the field has largely converged on vision-language-action models as the dominant paradigm. The debates are now about architecture details, not fundamental approach. This is probably healthy; it allows for systematic comparison and incremental improvement.
Second, the gap between video understanding and robot control is narrowing but not closed. VERA's decoupled approach and 3DVLA's 3D injection both represent genuine progress, but neither solves the fundamental problem of turning rich visual understanding into precise physical actions.
Third, efficiency matters. CogVLA's focus on training cost and inference latency reflects a growing recognition that academic benchmarks are not the same as real-world deployment. If your model requires 8 GPUs and 3 seconds per action, it's not going to run on a mobile robot.
Fourth, and this is the most important point, generalization remains the hard problem. VLA-Pro's procedural memory approach and MVP-LAM's multi-view training both show improvements, but we're still far from robots that can handle truly novel tasks in truly novel environments.
I remain cautiously optimistic. The pace of progress in this area is genuinely impressive, and the architectural innovations are becoming more sophisticated. But I've been in this field long enough to know that impressive benchmark results don't always translate to real-world capability.
The question I started with, whether robots can learn from watching videos, is getting a more positive answer than it would have received five years ago. But the full answer is still: sort of, sometimes, under certain conditions, with significant caveats. That's progress. It's just not the revolution that some of the paper titles might suggest.