The VLA arms race is heating up, and I'm trying to keep track
Six new vision-language-action papers dropped this week. Here's what actually matters for humanoid robots.
Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
You might be wondering why your inbox is suddenly full of papers with acronyms like VLA, VLM, and DiT. I've been asking myself the same thing.
This week alone, I counted six significant vision-language-action model papers hitting arXiv, all promising to make robots better at understanding what we want and actually doing it. That's not normal. A year ago, this field was a curiosity. Now it feels like everyone with access to a robot arm and some GPUs is racing to crack the code on embodied AI.
So I spent the last few days reading through them. Honestly, some of this is over my head (the math around Perceiver resampling schemes, specifically), but the broad strokes are clear enough. And they're worth paying attention to.
The big picture: robots that see, understand, and act
Vision-language-action models are exactly what they sound like. Take a vision-language model (the kind that can look at an image and describe it), then bolt on the ability to generate actual robot movements. The dream is a robot that can hear "pick up the red cup, not the blue one" and just... do it.
The reality has been messier. These models are computationally expensive, slow to run, and often fail when the real world doesn't match their training data. But the papers I'm seeing suggest we're hitting an inflection point.
Take Qwen-VLA, which comes from the Alibaba research ecosystem. It's attempting something ambitious: one model that works across manipulation tasks, navigation, and different robot bodies entirely. The numbers are striking. 97.9% success on the LIBERO benchmark, 73.7% on Simpler-WidowX, and (this is the part that caught my eye) 76.9% out-of-distribution success in real-world ALOHA experiments.
That last number matters because out-of-distribution means "stuff the robot hasn't seen before." Variations in lighting, object positions, backgrounds. The messy reality of actual environments.
The efficiency problem
But here's the thing. These models are massive, and massive means slow. If your robot takes 500 milliseconds to decide what to do next, that's a problem when you're trying to catch a falling object or respond to a human's changing intent.
CogVLA tackles this head-on. The researchers claim they've reduced training costs by 2.5x and inference latency by 2.8x compared to OpenVLA, while actually improving performance. They do this through something called "instruction-driven routing and sparsification," which (as far as I can tell) means the model learns to ignore irrelevant visual information based on what it's been asked to do.
Fuentes
- ProgVLA: Progress-Aware Robot Manipulation Skill Learning· arXiv — cs.RO (Robotics)
- CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification· arXiv — cs.RO (Robotics)
- BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models· arXiv — cs.RO (Robotics)
- Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments· arXiv — cs.RO (Robotics)
- Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation· arXiv — cs.RO (Robotics)
- Contrastive Representation Regularization for Vision-Language-Action Models· arXiv — cs.RO (Robotics)
Cobertura relacionada
More in Humanoids
A wave of new research suggests we've been training robots to treat every movement the same. That's a problem.
Sarah Williams · 10 hours ago · 6 min
Behind the urgency marketing is a real question about whether big tech conferences still matter for robotics founders.
Sarah Williams · 16 hours ago · 3 min
Two separate research teams are using air pressure and electrical impedance to solve one of robotics' most stubborn problems, and the results are surprisingly practical.
Sarah Williams · 2 days ago · 4 min
New research shows vision-language-action models can learn to skip unnecessary computation, basically mimicking how humans handle routine vs. tricky movements.