Six Papers, One Message: VLA Models Still Can't Handle Long Tasks Without Help

A wave of new research reveals that vision-language-action models need external scaffolding to work reliably, and that's actually fine.

By James Chen

1 hour ago4 min de leitura

Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

99% success rate sounds impressive until you read the fine print.

That's the headline number from a new paper on "mixture of horizons" training for vision-language-action models, and yes, it's real. But it only applies to the LIBERO benchmark under specific conditions, after 30,000 training iterations, using a particular policy architecture. The broader picture from this week's robotics research is less triumphant: VLA models remain fundamentally brittle on long-horizon tasks, and researchers are scrambling to bolt on fixes.

I've been tracking six recent papers that all circle the same problem. The pattern is striking.

What the numbers actually say

Let's start with the good news. The mixture of horizons approach from researchers at (the paper doesn't specify institutional affiliation clearly) achieves that 99% average success rate on LIBERO by processing action chunks at multiple time scales simultaneously. The insight is sound: longer action horizons give you better planning, shorter ones give you precision, so why not use both? The method adds minimal overhead and works as a plug-and-play modification.

But LIBERO is a simulation benchmark. The real test is production volume, as I like to say, or in this case, real-world deployment.

Another paper, MPVI, reports a 113% improvement in task progress over baseline VLAs on the BEHAVIOR-1K benchmark. That's a big number. It's also achieved by, basically, giving up on end-to-end learning and interleaving classical motion planning with the neural policy. The authors are explicit: "more data alone may not resolve the problem." From my time building hardware systems, I've seen this pattern before. When you can't make the core system reliable, you wrap it in scaffolding.

Cobertura relacionada

More in AI Models

A wave of new research tackles the gap between what vision-language models can see and what they can actually do with that information.

Sarah Williams · 1 hour ago · 7 min

A wave of research tackles the same problem: vision-language-action models break down on extended manipulation sequences, and everyone's proposing different band-aids.

James Chen · 1 hour ago · 5 min

SoftBank promises €75 billion for French data centers while the EU's own €20 billion plan stumbles. I've seen this pattern before.

Mark Kowalski · 1 hour ago · 5 min

Everyone's talking about the new reasoning model, but the real story might be what Microsoft isn't saying about developer trust.

Six Papers, One Message: VLA Models Still Can't Handle Long Tasks Without Help

What the numbers actually say

More in AI Models

The failure detection problem

Memory remains unsolved

The scaffolding pattern

What's missing from the research

Fontes