Two new papers tackle VLA's embarrassing handoff problem
Vision-language-action models can follow instructions, but they still can't reliably tell when they're done. New research from separate teams offers competing solutions.
It's a question that sounds almost too simple to be a research problem. You tell a robot to "pick up the cup, then place it on the shelf," and it picks up the cup. Great. But then what? How does the system know the first part is complete and it's time to move on?
This is, it turns out, genuinely hard. And two papers released this week on arXiv approach the problem from different angles, each revealing something about why vision-language-action models struggle with what humans do effortlessly.
The core issue is what researchers call the "completion" or "handoff" problem. VLA models have gotten remarkably good at executing natural language instructions, but they lack what the authors of one paper call an "operational interface" for deciding when an instruction is actually done. To be precise, the problem isn't that robots can't complete tasks. It's that they can't reliably detect their own completion, which means they can't chain tasks together without cascading failures.
The first paper, "Completion at the Boundary (CaB)", frames the problem as fundamentally a closed-loop control issue. Their argument is subtle but important: switching between subtasks isn't just a classification problem ("am I done yet?"). It's an intervention that changes the instruction context, which in turn affects future actions and observations.
The authors work under what they call a "deployable low-calibration regime," which is their way of saying they want a solution that doesn't require retraining or recalibrating for every new instruction. This is a practical constraint that I wish more papers would adopt. A system that needs test-time relearning for each new task isn't really deployable.
À lire aussi
More in AI Models
The new 'omnimodal' system combines vision, language, video, audio, and robot actions in one architecture. It's impressive work, but the hype cycle feels awfully familiar.
Mark Kowalski · 4 hours ago · 4 min
Three new papers tackle the same problem from different angles, and the results suggest we're still figuring out when diffusion planning actually helps.
Aisha Patel · 6 hours ago · 8 min
The company says it might hit its revenue goal early, but the interesting question is what this signals about the broader AI hardware landscape.
Aisha Patel · 8 hours ago · 5 min
Vision-Language-Action models are the new hotness in robotics research, but I've seen this movie before.
Their key insight is that collapsing boundary evidence into a single scalar (essentially, a confidence score for "am I done?") is brittle. Different tasks have different completion signatures, and a globally calibrated threshold will inevitably fail on some subset of them.
CaB instead predicts what they call Boundary-Phase Tokens: Before, Hit, or After. This retains what they describe as "two-sided boundary evidence." The system isn't just asking "am I done?" but rather "am I approaching completion, at completion, or past completion?" This distinction matters for the cascade problem. If you're already past the boundary when you detect it, you've likely corrupted the setup for the next subtask.
The paper splits this into two components: CaB-When (for deciding when to switch) and CaB-How (for conditioning action generation through handoffs). I know I'm being picky here, but the naming convention is a bit cute for my taste. The substance is sound, though.
They evaluate on a Minecraft VLA benchmark, which is... fine. Minecraft has become a standard testbed for this kind of work because it offers compositional tasks in a controllable environment. But I'd want to see this validated on physical robots before drawing strong conclusions. The paper doesn't claim real-world results, which is honest, but it does limit what we can infer about deployability.
The second paper, "See Less, Specify More" (S2), comes at generalization from a different direction, but it's addressing a related failure mode. Their framing is that VLA models often fail because they're trying to do too much at once: inferring local execution details from coarse instructions while also figuring out which parts of the visual scene are relevant.
S2 proposes splitting these concerns. "Specify More" means augmenting the original instruction with refined trajectory-level and subtask-level language that disambiguates the current execution mode. The original instruction is preserved as a high-level goal, but the executor gets more detailed guidance about what it should be doing right now.
"See Less" is perhaps the more interesting contribution. They impose an explicit visual evidence budget, training the executor to act from "task-sufficient evidence rather than unconstrained visual context." This is done without region or mask annotations, which is notable. The idea is that a robot doesn't need to attend to the entire scene to pick up a cup. It needs to attend to the cup, its gripper, and maybe the immediate surroundings. Everything else is potential distraction.
The results here are actually quite striking. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. That's a substantial improvement, and it's on physical hardware, which gives me more confidence than simulation-only results.
The paper's ablations are worth noting. They find that goal-preserving local guidance outperforms instruction replacement. This is actually a bit counterintuitive. You might think that if you're going to give more detailed instructions, you should just replace the vague ones entirely. But preserving the high-level goal while adding local detail works better. The authors argue this prevents the executor from losing track of the overall objective.
Let me try to be precise about this, because it matters for understanding where the field actually is.
The completion detection problem itself isn't new. Researchers have been grappling with task segmentation and subtask boundaries for years. What CaB contributes is a specific formalization (Boundary-Phase Tokens) and a deployment-focused constraint (no test-time recalibration). The Minecraft evaluation is limited, but the conceptual framework is clean.
S2's contribution is more empirical. The idea of constraining visual attention isn't novel (there's a long history of attention mechanisms in robotics), but the specific implementation of an evidence budget without requiring annotations is useful. And the real-robot results give it practical weight that many VLA papers lack.
Neither paper solves the fundamental problem of robust task completion in open-ended environments. Both are working on simplified versions of the real challenge. CaB's Minecraft benchmark, while useful, doesn't capture the visual complexity of physical manipulation. S2's tasks, while real-world, are still relatively structured.
It's worth noting that both papers are essentially working around a limitation of current VLA architectures rather than fixing it. The models themselves don't have good internal representations of task completion. These papers add external mechanisms (boundary tokens, evidence budgets) to compensate. This is pragmatic, but it suggests the underlying architectures may need rethinking.
I should explain why this seemingly narrow issue has attracted serious attention. The short answer is that compositionality is the bottleneck.
Current robots can often execute single, well-defined instructions with reasonable reliability. "Pick up the red block" is achievable. But real-world utility requires chaining: "do A, then B." And the failure modes compound. If your completion detector has 90% accuracy, a three-step sequence has only 73% chance of full success. A ten-step sequence drops to 35%.
This is why the CaB paper specifically focuses on "short composites." They're not trying to solve long-horizon planning. They're trying to make two-step sequences reliable. That's a modest goal, but it's the right one. You can't build reliable long sequences without reliable short ones.
The S2 paper approaches this differently, by making each individual step more robust to visual distractors. If the executor is less likely to get confused by irrelevant scene elements, it's more likely to complete its current subtask correctly, which sets up better conditions for the next one.
Both approaches are, in a sense, about reducing error accumulation. They just target different error sources.
I have some methodological concerns that I think are worth raising, though they don't invalidate the contributions.
First, neither paper addresses recovery from completion detection errors. CaB can tell you that you're Before, At, or After a boundary, but what if you're wrong? What if you switch too early? The paper doesn't discuss rollback or error correction mechanisms. In real deployment, you need these.
Second, the S2 paper's evidence budget is trained, not learned online. The system decides what to attend to based on training data. But novel objects and novel environments will have different relevance patterns. How does the budget adapt? The paper doesn't say, and this hasn't been replicated yet in other settings.
Third, both papers work with relatively short instruction horizons. CaB explicitly targets short composites. S2 evaluates on subtask success, not full task completion. This is reasonable for current capabilities, but it leaves open the question of whether these approaches scale.
Finally, and this is a limitation of the field generally, we don't have good standardized benchmarks for completion detection. CaB uses a Minecraft benchmark. S2 uses a custom real-robot setup. It's hard to compare approaches when everyone is measuring different things.
If I were reviewing these papers for a future workshop, I'd push for several things.
For CaB: real-robot validation. The conceptual framework is solid, but Minecraft is not a convincing proxy for physical manipulation. The boundary detection problem is presumably harder when you have noisy sensors and imprecise actuators.
For S2: analysis of failure modes when the evidence budget is miscalibrated. What happens when the system attends to the wrong visual patches? How brittle is the approach to out-of-distribution scenes?
For both: explicit comparison. These papers were released the same week and address related problems. Someone should run them head-to-head on the same tasks. (I know, I know, this is easier said than done when one is simulation-only and one is real-robot. But still.)
More broadly, I'd like to see the field converge on completion detection benchmarks that span simulation and real hardware. The current fragmentation makes it hard to track actual progress.
These two papers represent a healthy trend in VLA research: focusing on specific failure modes rather than claiming general capability improvements. Both teams identified a real problem (completion detection, visual distraction), proposed a targeted solution, and evaluated it under reasonable constraints.
This is how progress actually happens. Not through revolutionary breakthroughs (I'm skeptical of anyone claiming those), but through incremental fixes to specific bottlenecks. The handoff problem is real. These papers make partial progress on it. That's valuable.
Whether these specific approaches will matter in two years, I genuinely don't know. VLA architectures are evolving quickly, and external mechanisms like boundary tokens might become unnecessary if the base models improve. But for now, if you're trying to deploy a VLA system for multi-step tasks, both papers offer practical techniques worth considering.
The completion problem remains unsolved in any general sense. But it's being worked on, by people who understand why it matters. That's something.