The VLA Gold Rush Is Here, and I've Seen This Movie Before

A wave of new Vision-Language-Action research promises robots that actually generalize. Color me cautiously optimistic, but let's talk about what's really happening.

By Mark Kowalski

17 hours ago5 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

If you've been covering tech as long as I have (and I've been doing this since modems made that horrible screeching sound), you start to recognize patterns. The breathless announcements. The benchmark improvements. The promises that this time, this time, the robots will actually work in your kitchen and not just in a carefully controlled lab. We're in another one of those moments right now, and it's centered on something called Vision-Language-Action models, or VLAs.

I've seen this movie before. I watched the self-driving car hype cycle promise fully autonomous vehicles by 2020. I watched chatbots get declared "solved" three separate times before GPT actually made them useful. So when I see six separate research papers drop in the same week, all claiming major advances in robot generalization, my first instinct is skepticism. But here's the thing, and call me old-fashioned, I actually read all of them. And something interesting is happening.

The actual problem these papers are trying to solve

Let me be blunt about what's broken with current robot learning. You train a robot to fold a towel, and it learns to fold that specific towel, in that specific room, with the lighting exactly how it was during training. Move the lamp, and the robot gets confused. Give it a slightly different towel, and it fails. This isn't a minor inconvenience, it's the fundamental reason we don't have useful household robots despite decades of promises.

The new wave of VLA research is attacking this from multiple angles simultaneously. A team from (I'm guessing) a major Chinese university released DeMaVLA, which tackles what they call "deformable manipulation," basically teaching robots to handle floppy things like clothes. They pre-trained on approximately 5,000 hours of real-world dual-arm demonstrations, which is a staggering amount of data. The key insight here is that they're not training separate policies for shirts versus pants versus towels. They're trying to build one model that handles the whole mess.

Related coverage

More in AI Models

The AI company is now officially the world's most valuable startup, and it's moving fast toward public markets.

James Chen · 5 hours ago · 3 min

The Claude maker beat OpenAI to the SEC paperwork, but I've seen enough tech IPO races to know this is really about runway, not rivalry.

Mark Kowalski · 5 hours ago · 5 min

The rush to report Anthropic's IPO filing missed the more interesting question: what does going public mean for a company built on AI safety research?

Aisha Patel · 7 hours ago · 7 min

Everyone's calling this a funding milestone. I think it's the moment Anthropic stopped being the 'responsible AI' company and became something else entirely.

The VLA Gold Rush Is Here, and I've Seen This Movie Before

The actual problem these papers are trying to solve

More in AI Models

The data problem nobody wants to talk about

What's actually new versus what's hype

So what does this mean for actual robots

Sources