VLAs Don't Know What They Don't Know. Two New Papers Are Trying to Fix That.
A pair of robotics papers tackle two of the most practical blockers standing between vision-language-action models and real-world deployment: overconfidence and computational bloat.
By
·14 hours ago·7 min de lectura
Here's a thing that should unsettle anyone excited about deploying humanoids in the real world: the most capable robot manipulation models currently have no reliable way to tell you when they're about to fail.
That's not a minor footnote. That's a fundamental problem. And it's the one that two new papers from the robotics research community are, in their different ways, trying to solve.
Both papers focus on vision-language-action models, or VLAs. If you've been following the humanoid space at all, you'll have heard the term. VLAs combine vision and language understanding with the ability to output robot actions directly. They've become something of a north star for embodied AI research, and companies like Physical Intelligence, Google DeepMind, and a growing list of startups are betting heavily on them. The empirical results have been genuinely impressive.
But impressive benchmark numbers and safe real-world deployment are two very different things. These papers are about the gap between them.
The first paper, out of TU Munich and posted to arXiv, tackles something called epistemic uncertainty. I'll be honest, when I first encountered this framing I had to sit with it for a bit. Epistemic uncertainty is, basically, uncertainty that comes from gaps in a model's knowledge, as opposed to noise that's inherent to the task itself. The distinction matters because epistemic uncertainty is, in principle, reducible. If you know the model doesn't know something, you can do something about it.
Cobertura relacionada
More in Humanoids
Two new papers tackle the problem of getting humanoid robots to gesture naturally during speech. It's a genuinely hard problem, and the solutions are more clever than the demos let on.
Mark Kowalski · 23 hours ago · 6 min
New research tackles one of the messiest problems in multi-robot collaboration: how do you train robots to coordinate when getting synchronized human demos is basically a logistical nightmare?
Sarah Williams · 23 hours ago · 6 min
A French startup backed by Eric Schmidt just unveiled a headless, legless humanoid. Bob Macintosh thinks they might be onto something.
Robert "Bob" Macintosh · 2 days ago · 4 min
The problem with current VLAs is that they don't flag this. They'll encounter a scenario outside their training distribution and just... keep going. Confidently. Which is not great when the robot is handling something fragile, or operating near a person, or in any situation where a graceful failure is vastly preferable to a confident wrong action.
The researchers propose a method they call VFD, for velocity-field disagreement. The core idea is to run a small ensemble of models and measure how much they disagree on the underlying velocity field that drives action generation in flow-matching models. High disagreement means high uncertainty. It's an efficient approach, and the paper claims it produces well-calibrated uncertainty estimates that actually predict downstream performance.
They build this into a framework called SAVE (uncertainty-guided active multitask fine-tuning), which uses those uncertainty signals not just for failure detection but to decide which new training examples are actually worth collecting. The results on the LIBERO benchmark show SAVE requires at least 22% fewer expert demonstrations than baseline approaches to adapt to new tasks. That's meaningful, because collecting robot demonstrations is expensive and slow.
I initially thought this was primarily a research contribution with limited near-term practical relevance. After reading more carefully, I think I was wrong about that. The failure detection angle is actually the more immediately deployable piece. A robot that can say "I'm not confident here, please intervene" is categorically safer than one that can't. That's relevant right now, not in five years.
What remains unclear is how VFD performs outside of simulation benchmarks. LIBERO is a controlled environment. The paper acknowledges that real-world non-stationary environments are the target, but the validation is still largely synthetic. That gap is worth watching.
The second paper addresses a different but equally practical blocker. VLAs are large. Inference is slow. Running them on the edge hardware that lives inside a robot is genuinely difficult, and for a lot of real-world platforms, basically impossible at the moment.
The arXiv paper on RLRC presents a compression pipeline designed to fix this. RLRC stands for Reinforcement Learning-based Recovery for Compressed VLAs, which is a mouthful, but the approach is fairly intuitive once you break it down.
It works in three stages. First, structured pruning: you cut the model down aggressively. Second, performance recovery: you fine-tune the compressed model using both supervised learning and reinforcement learning to get back the capability you lost. Third, quantization: you reduce the numerical precision of the weights to shrink memory further.
The RL stage is the interesting part. Pruning alone tends to hurt performance badly. Standard fine-tuning (what the paper calls SFT) helps, but the researchers found that adding a reinforcement learning stage on top, with some specific stabilization tricks like critic warm-up and behavioral cloning loss regularization, recovers performance more robustly. The claimed results are significant: up to 8x memory reduction and 2.3x inference speedup, while maintaining the original task success rate across multiple VLA backbones.
You might be wondering whether "maintaining the original task success rate" is doing a lot of work in that sentence. It's a fair question. The experiments are run across multiple backbones, which adds credibility, but the evaluation is still on standard benchmarks rather than deployment in the wild. The company (this is an academic paper, so no company, but the research group) didn't disclose figures on how the compressed models perform on tasks significantly outside the training distribution, which is exactly where you'd expect compression to hurt most.
Tbh, the 8x memory reduction number is the one that jumps out. If that holds up in practice, it meaningfully changes the on-device deployment calculus for a lot of robotics teams.
These two papers are solving different problems, but they're both fundamentally about the same thing: making VLAs deployable in the real world rather than just impressive in the lab.
The field has made enormous progress on the "can it do the task" question. The benchmarks keep improving. The demonstrations keep getting more impressive. But the harder questions, the ones that matter for actual deployment, are things like: does it know when it's failing? Can it run on the hardware we actually have? Can we adapt it to new tasks without burning through a fortune in human demonstration time?
Those are the questions these papers are taking seriously.
I think it's worth noting that both approaches are in some sense pragmatic rather than foundational. They're not proposing new architectures or new training paradigms. They're taking existing flow-matching VLAs and asking: how do we make these safer and more efficient? That's less glamorous than a new model architecture, but it might be more important for the near term.
What the humanoid space actually needs right now
Here's my honest take. The humanoid companies that are going to matter in three to five years are the ones solving deployment problems, not just demo problems. A robot that works beautifully in a controlled setting and then fails silently in an unstructured environment is not a product. It's a liability.
Uncertainty quantification, the kind VFD is trying to provide, is a prerequisite for the kind of human-robot collaboration that makes humanoids actually useful. If a robot can't communicate its own confidence, a human working alongside it can't make good decisions about when to trust it and when to intervene. That's not a nice-to-have. It's table stakes for deployment in anything other than a fully automated, fully controlled environment.
The compression work matters for a different reason. A lot of the most interesting robotics applications are on platforms where you can't just throw more compute at the problem. Edge deployment is the reality for most real-world robots, and 8x memory reduction, if it holds, is the difference between possible and not possible for a meaningful chunk of those use cases.
This raises questions about... well, multiple things. How these techniques interact with each other, for one. Could you run a compressed VLA with uncertainty quantification on edge hardware? The two papers don't address this directly, and I haven't found other sources that do either. It seems like a natural next question.
A note on where the research is
Both papers are preprints. Neither has been through peer review yet, as far as I can tell. The RLRC paper is on its second version (v2), which suggests some revision has happened, but that's not the same as formal peer review. I'd treat the specific numbers (22% fewer samples, 8x memory reduction) as promising but not definitive until they've been independently replicated.
The VFD paper has a project website at tum-lsy.github.io/uq_vla/ and RLRC has one at rlrc-vla.github.io, which at least suggests the researchers are taking the work seriously enough to build out supporting materials. That's a small signal, but it's something.
The broader point stands regardless of the exact numbers. The field is starting to grapple seriously with the practical limitations of VLAs, not just their capabilities. That's a sign of maturity, and honestly, it's overdue.