The VLA arms race is heating up, and I'm trying to keep track

Six new vision-language-action papers dropped this week. Here's what actually matters for humanoid robots.

2 hours ago6 min de lectura

Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

You might be wondering why your inbox is suddenly full of papers with acronyms like VLA, VLM, and DiT. I've been asking myself the same thing.

This week alone, I counted six significant vision-language-action model papers hitting arXiv, all promising to make robots better at understanding what we want and actually doing it. That's not normal. A year ago, this field was a curiosity. Now it feels like everyone with access to a robot arm and some GPUs is racing to crack the code on embodied AI.

So I spent the last few days reading through them. Honestly, some of this is over my head (the math around Perceiver resampling schemes, specifically), but the broad strokes are clear enough. And they're worth paying attention to.

The big picture: robots that see, understand, and act

Vision-language-action models are exactly what they sound like. Take a vision-language model (the kind that can look at an image and describe it), then bolt on the ability to generate actual robot movements. The dream is a robot that can hear "pick up the red cup, not the blue one" and just... do it.

The reality has been messier. These models are computationally expensive, slow to run, and often fail when the real world doesn't match their training data. But the papers I'm seeing suggest we're hitting an inflection point.

Take Qwen-VLA, which comes from the Alibaba research ecosystem. It's attempting something ambitious: one model that works across manipulation tasks, navigation, and different robot bodies entirely. The numbers are striking. 97.9% success on the LIBERO benchmark, 73.7% on Simpler-WidowX, and (this is the part that caught my eye) 76.9% out-of-distribution success in real-world ALOHA experiments.

That last number matters because out-of-distribution means "stuff the robot hasn't seen before." Variations in lighting, object positions, backgrounds. The messy reality of actual environments.

The efficiency problem

But here's the thing. These models are massive, and massive means slow. If your robot takes 500 milliseconds to decide what to do next, that's a problem when you're trying to catch a falling object or respond to a human's changing intent.

CogVLA tackles this head-on. The researchers claim they've reduced training costs by 2.5x and inference latency by 2.8x compared to OpenVLA, while actually improving performance. They do this through something called "instruction-driven routing and sparsification," which (as far as I can tell) means the model learns to ignore irrelevant visual information based on what it's been asked to do.

Fuentes

ProgVLA: Progress-Aware Robot Manipulation Skill Learning· arXiv — cs.RO (Robotics)
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification· arXiv — cs.RO (Robotics)
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models· arXiv — cs.RO (Robotics)
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments· arXiv — cs.RO (Robotics)
Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation· arXiv — cs.RO (Robotics)
Contrastive Representation Regularization for Vision-Language-Action Models· arXiv — cs.RO (Robotics)

Cobertura relacionada

More in Humanoids

A wave of new research suggests we've been training robots to treat every movement the same. That's a problem.

Sarah Williams · 10 hours ago · 6 min

Behind the urgency marketing is a real question about whether big tech conferences still matter for robotics founders.

Sarah Williams · 16 hours ago · 3 min

Two separate research teams are using air pressure and electrical impedance to solve one of robotics' most stubborn problems, and the results are surprisingly practical.

Sarah Williams · 2 days ago · 4 min

New research shows vision-language-action models can learn to skip unnecessary computation, basically mimicking how humans handle routine vs. tricky movements.

The VLA arms race is heating up, and I'm trying to keep track

The big picture: robots that see, understand, and act

The efficiency problem

Fuentes

More in Humanoids

When language isn't enough

The small model that could

The gap between simulation and reality

The representation question

What does this mean for humanoids?