Bildnachweis: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
A robot arm hovers over a cluttered table, camera feed streaming into a neural network that must decide, in milliseconds, whether to grasp the red mug or the blue one. The model has seen thousands of mugs in training. But this one has a chip on the rim, the lighting is slightly different, and there's a shadow from the window that wasn't there before. The arm hesitates, then fails.
This scene, or something like it, plays out constantly in robotics labs around the world. Vision-Language-Action models (VLAs) are the current best hope for general-purpose robot control, combining the perceptual and linguistic capabilities of large foundation models with the ability to output continuous actions. The pitch is compelling: train one model that can understand natural language instructions, perceive the world through cameras, and translate both into precise motor commands. The reality, as a cluster of recent papers makes clear, is considerably messier.
To be precise, the problem isn't that VLAs don't work. They do, sometimes impressively. The problem is that they fail in ways that reveal fundamental gaps between what these models understand and what they can reliably do. A new benchmark called Colosseum V2, several architectural innovations, and a growing body of work on memory and efficiency are collectively painting a more honest picture of where we actually stand.
Colosseum V2, built on the ManiSkill simulator, is designed to be unforgiving. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering everything from simple pick-and-place to long-horizon manipulation sequences. What makes it useful, and frankly a bit depressing, is that it systematically tests what happens when you change things that shouldn't matter.
Verwandte Beiträge
More in AI Models
The company just raised its outlook by a staggering amount, and honestly, I'm trying to figure out if this is real momentum or a peak we're about to fall off.
Sarah Williams · 2 hours ago · 5 min
A $65 billion raise that eclipses OpenAI. I've seen big valuations before, but this one's got me scratching my head.
Robert "Bob" Macintosh · 2 hours ago · 3 min
The private equity giants are seeking additional investors for what would be one of the largest AI infrastructure financing deals to date.
James Chen · 3 hours ago · 4 min
The company that once prided itself on vertical integration is outsourcing its AI brain to a competitor. That's not a pivot, it's a concession.
The researchers evaluated state-of-the-art methods including Action Chunking Transformers (ACT) and Pi0.5, and the results reveal limitations in both base performance and generalization. This isn't entirely surprising to anyone who's worked with these systems, but having standardized metrics helps. The benchmark enables GPU-parallelized evaluation at scale, which matters because running thousands of rollouts on real hardware is prohibitively expensive.
It's worth noting that the authors claim strong correlations between simulation and real-world metrics, which would support the ecological validity of the benchmark. I'd want to see more independent replication before treating this as settled, but if it holds up, Colosseum V2 could become a standard evaluation protocol. The field desperately needs this. Right now, comparing VLA papers is nearly impossible because everyone uses different tasks, different robots, and different success criteria.
The core finding is that VLAs inherit zero-shot perception and language capabilities from their pretrained backbones, but this doesn't translate cleanly into robust behavior under distribution shifts. A model might correctly identify that you're asking it to "pick up the red cup," understand what a cup is, and still fail because the cup is positioned 3 centimeters to the left of where cups appeared in training. The high-level understanding is there. The low-level reliability is not.
One of the more interesting papers in this batch is AttenA+, which attacks a problem I know I'm being picky about, but which matters: the assumption that all timesteps in a trajectory are equally important.
The insight is almost obvious once stated. When a robot arm is moving quickly through free space, errors are relatively forgiving. When it's making contact with an object, positioning a gripper, or threading a needle, tiny mistakes are catastrophic. Current training approaches treat these moments identically, applying uniform loss weighting across the entire trajectory. AttenA+ reweights the training objective based on inverse velocity, so the model pays more attention to slow, precise moments.
The results are genuinely impressive. AttenA+ improves OpenVLA-OFT to 98.6% on the Libero benchmark (a 1.5 percentage point gain) and pushes FastWAM to 92.4% on RoboTwin 2.0. These are incremental improvements over already-strong baselines, but the approach is architecture-agnostic and requires no additional parameters. It's a plug-and-play enhancement, which means it could potentially be applied to any existing VLA.
The broader point here is that robotics has structural priors that language modeling doesn't. Treating robot trajectories like token sequences ignores physics. The velocity-based reweighting is one way to inject physical knowledge back into the learning process, and the paper suggests this might be more efficient than simply scaling up model size. Whether this generalizes beyond manipulation to, say, locomotion or driving remains unclear.
Long-horizon tasks expose another gap: memory. If you ask a robot to "put three apples in the bowl," it needs to count. If an object becomes temporarily occluded, the robot needs to remember where it was. These seem like trivial requirements, but they're not well-handled by current VLA architectures.
RoboMME is a new benchmark specifically designed to evaluate memory capabilities. The benchmark comprises 16 manipulation tasks organized under a taxonomy covering temporal, spatial, object, and procedural memory. The researchers built 14 memory-augmented variants on the π0.5 backbone to systematically explore different memory representations.
The results are, actually, the research shows something a bit frustrating: the effectiveness of memory representations is highly task-dependent. There's no single memory architecture that dominates across all tasks. Each design offers distinct advantages and limitations. This is useful to know, but it means practitioners will need to think carefully about which memory mechanism suits their specific application.
What I'd want to see next is a better theoretical understanding of why different memory architectures work for different tasks. Right now, the field is in an empirical exploration phase, which is fine, but it makes it hard to predict what will work without running expensive experiments.
Even if VLAs worked perfectly, they'd still have a deployment problem: they're too slow. Running a large vision-language backbone at every control step is computationally expensive, and robots need to react in real time. A robot arm that thinks for 200 milliseconds before every action will never be able to catch a falling object or respond to unexpected contact.
ElegantVLA addresses this with what the authors call "phase-adaptive inference." The core insight, borrowed from human motor control, is that not every moment requires the same amount of cognitive effort. When you're reaching for a coffee mug, you don't consciously plan every millimeter of the trajectory. You engage full attention at the beginning (where is the mug?) and at the end (don't knock it over), but the middle is largely automatic.
ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to dynamically allocate computation. When visual inputs are stable and the robot is in a predictable motion phase, it reuses prior computations. When things change or the task enters a goal-sensitive stage, it runs full inference.
The speedups are substantial: up to 2.55x on GR00T and 3.77x on CogACT. On six real-world GR00T tasks, ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz. Importantly, this is a plug-in framework that doesn't require modifying or retraining the base model.
ProgVLA takes a different approach to efficiency, focusing on compact model design. At only 0.1 billion parameters, ProgVLA reaches success rates competitive with much larger pretrained baselines, and on long-horizon tasks it actually exceeds them. The key innovation is a two-stage Perceiver resampling scheme that compresses variable-length visual, language, and proprioceptive streams into a fixed set of tokens, plus auxiliary "progress heads" trained with offline reinforcement learning to estimate task completion.
The progress estimation is clever. By giving the model an internal sense of how far along it is in a task, the policy can better allocate attention and recover from mistakes. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest contributors to performance, with progress-aware training providing consistent additional gains concentrated on long-horizon and multi-object tasks.
Qwen-VLA is the most ambitious of the bunch, attempting to unify manipulation, navigation, and trajectory prediction into a single model. The pitch is that embodied intelligence shouldn't require specialized models for each task type.
The architecture extends Qwen's vision-language stack with a DiT-based action decoder, and the training recipe is impressively diverse: robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, and auxiliary vision-language supervision. To handle multiple robot platforms, they introduce "embodiment-aware prompt conditioning," where robot-specific textual descriptions specify the current embodiment and control convention.
The benchmark results are strong across the board: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, and 76.9% average out-of-distribution success in real-world ALOHA experiments. The navigation results (69.0% OSR on R2R, 59.6% SR on RxR) are more modest but still represent a single model doing both manipulation and navigation.
I'm genuinely excited about this direction, but I want to flag some concerns. The training data diversity is impressive, but we don't know yet how the model handles truly novel embodiments that weren't represented in training. The embodiment-aware prompting is a nice idea, but it's basically asking the model to learn a mapping from text descriptions to control conventions. Whether this scales to arbitrary robots remains unclear.
What's missing from all of this work is a satisfying answer to the question of generalization. The benchmarks show that VLAs struggle with distribution shifts. The architectural innovations help at the margins. But nobody has cracked the fundamental problem of making these systems robust to the kind of variation that real-world deployment requires.
Some specific things I'd want to see:
First, better theoretical understanding of why VLAs fail under distribution shift. Is it a perception problem, an action generation problem, or something in between? The Colosseum V2 results suggest it's not purely perception, since the models can identify objects correctly but still fail to manipulate them. But we don't have a precise characterization of where the breakdown occurs.
Second, more rigorous real-world evaluation. Several of these papers include real-world experiments, which is good, but the sample sizes are small and the environments are controlled. We're still far from understanding how these models perform in genuinely messy, unstructured settings.
Third, better integration of classical robotics knowledge. AttenA+ shows that physics-aware training helps. ProgVLA shows that progress estimation helps. These are steps toward incorporating domain knowledge, but they feel ad hoc. A more principled framework for combining learned and engineered components would be valuable.
The field is making progress, and I don't want to undersell that. A year ago, we didn't have standardized benchmarks for VLA generalization. We didn't have efficient inference frameworks that maintain performance while cutting computation in half. We didn't have unified models that handle both manipulation and navigation. All of that is new.
But the gap between "works in the lab" and "works in the world" remains large. The honest thing to say is that we're getting better at measuring our failures, and we're finding clever patches for specific problems, but the general-purpose robot that can handle arbitrary tasks in arbitrary environments is still a long way off. These papers are useful precisely because they're clear-eyed about limitations. That's how progress actually happens.