Six New Papers Show VLA Models Still Can't Handle Long Tasks Without Help

A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.

By James Chen

1 hour ago5 min de leitura

Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Ninety-nine percent. That's the success rate one research team claims to have hit on the LIBERO benchmark using a technique called Mixture of Horizons. It's an impressive number, until you realize it took six separate research groups publishing within weeks of each other to address variations of the same fundamental problem: VLA models still can't reliably complete tasks that take more than a few seconds.

I've been tracking the arXiv preprint server for robotics submissions, and this past month has been unusual. Six papers, all focused on making vision-language-action models more robust, all tackling the failure modes that emerge when you ask a robot to do something that requires memory, planning, or recovery from mistakes. The convergence isn't coincidental. It reflects a field that's hit a wall.

Let me be precise about what VLA models are supposed to do. They take camera images and natural language instructions, then output motor commands. The promise is that language understanding from large models transfers to physical manipulation. The reality, based on these papers, is messier. The models work reasonably well on short, single-step tasks. String together multiple steps, introduce any variation from training conditions, or ask the robot to notice when something goes wrong, and performance craters.

The arXiv paper on Mixture of Horizons from a team including researchers from multiple institutions identifies a specific technical tradeoff. VLA models predict sequences of future actions, called "action chunks." Longer chunks give the model better foresight but worse precision on fine movements. Shorter chunks sharpen immediate control but lose track of longer-term goals. Their solution is to process multiple chunk lengths simultaneously and fuse the outputs. The 99% figure comes from a specific benchmark configuration after 30,000 training iterations, and I'd want to see how that holds up across different task distributions before getting too excited.

Cobertura relacionada

More in AI Models

A wave of new research tackles the gap between what vision-language models can see and what they can actually do with that information.

Sarah Williams · 1 hour ago · 7 min

A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.

James Chen · 1 hour ago · 4 min

SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.

Mark Kowalski · 1 hour ago · 5 min

Everyone's talking about the new reasoning model, but the real story might be what Microsoft isn't saying about developer trust.

Fontes