Two new papers tackle VLA's embarrassing handoff problem

Vision-language-action models can follow instructions, but they still can't reliably tell when they're done. New research from separate teams offers competing solutions.

By Aisha Patel

5 hours ago9 min de lecture

Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Why can't robots tell when they've finished a task?

It's a question that sounds almost too simple to be a research problem. You tell a robot to "pick up the cup, then place it on the shelf," and it picks up the cup. Great. But then what? How does the system know the first part is complete and it's time to move on?

This is, it turns out, genuinely hard. And two papers released this week on arXiv approach the problem from different angles, each revealing something about why vision-language-action models struggle with what humans do effortlessly.

The core issue is what researchers call the "completion" or "handoff" problem. VLA models have gotten remarkably good at executing natural language instructions, but they lack what the authors of one paper call an "operational interface" for deciding when an instruction is actually done. To be precise, the problem isn't that robots can't complete tasks. It's that they can't reliably detect their own completion, which means they can't chain tasks together without cascading failures.

What does "Completion at the Boundary" actually propose?

The first paper, "Completion at the Boundary (CaB)", frames the problem as fundamentally a closed-loop control issue. Their argument is subtle but important: switching between subtasks isn't just a classification problem ("am I done yet?"). It's an intervention that changes the instruction context, which in turn affects future actions and observations.

The authors work under what they call a "deployable low-calibration regime," which is their way of saying they want a solution that doesn't require retraining or recalibrating for every new instruction. This is a practical constraint that I wish more papers would adopt. A system that needs test-time relearning for each new task isn't really deployable.

More in AI Models

The new 'omnimodal' system combines vision, language, video, audio, and robot actions in one architecture. It's impressive work, but the hype cycle feels awfully familiar.

Mark Kowalski · 4 hours ago · 4 min

Three new papers tackle the same problem from different angles, and the results suggest we're still figuring out when diffusion planning actually helps.

Aisha Patel · 6 hours ago · 8 min

The company says it might hit its revenue goal early, but the interesting question is what this signals about the broader AI hardware landscape.

Aisha Patel · 8 hours ago · 5 min

Vision-Language-Action models are the new hotness in robotics research, but I've seen this movie before.

Two new papers tackle VLA's embarrassing handoff problem

Why can't robots tell when they've finished a task?

What does "Completion at the Boundary" actually propose?

More in AI Models

How does "See Less, Specify More" approach the same underlying problem?

What's genuinely new here versus incremental?

Why does the handoff problem matter so much?

What's missing from both papers?

What would I want to see next?

The bigger picture

Sources