Two New Papers Tackle the Hardest Parts of Training Robot Policies in the Real World
A fine-tuning method called HABC and a video-based evaluation framework called SC3-Eval each address long-standing bottlenecks in deploying vision-language-action models on physical robots.
By
·8 hours ago·10 min de lecture
Two preprints posted this week to arXiv target problems that anyone who has tried to deploy a vision-language-action (VLA) model on a real robot will recognise immediately: how do you train effectively from sparse, binary outcome signals, and how do you evaluate a policy without running hundreds of expensive physical rollouts? The answers proposed, Hierarchical Advantage-Weighted Behavior Cloning (HABC) and SC3-Eval, are technically distinct but complementary, and together they point at a maturing understanding of what actually breaks in the VLA fine-tuning pipeline.
I want to be careful not to oversell either paper. Both are single-lab results, neither has been replicated, and the real-world experiments are small. But the problems they identify are real, and the framing in both cases is sharper than most of what I have read this year.
The standard setup for online reinforcement learning fine-tuning of a pretrained VLA goes roughly like this: you run the robot through a task, observe whether it succeeded or failed, and use that binary outcome to update the policy. The difficulty is that a single binary label has to be distributed across every transition in the episode, which can be dozens or hundreds of individual actions. Most existing approaches collapse the episode outcome to a scalar reward or advantage signal and assign it uniformly, or with some simple discount, across those transitions.
The authors of arXiv:2606.17043 identify two specific failure modes in this approach that I think are worth separating out clearly, because they are actually distinct problems that have been conflated in the literature.
À lire aussi
More in Research
A transformer for visual odometry, a 3D-consistent world model, and a zero-shot dexterous manipulation framework all dropped this week. Here's what the numbers actually mean.
James Chen · 9 hours ago · 6 min
FlowMPC and WAM-RL both attack the same core limitation of behavior cloning from different angles. Here's what the research actually shows.
Aisha Patel · Yesterday · 9 min
Two new research papers suggest the future of robot control might be written in code by AI agents that never touched a robot. That's either brilliant or a disaster waiting to happen.
Mark Kowalski · Yesterday · 7 min
Researchers dropped three notable papers on robot planning and navigation this week. The progress is real. The hype is, as usual, getting ahead of the engineering.
The first is what they call the viability-efficiency conflation. Once a policy has learned to succeed at all, a binary success label stops providing any useful gradient signal to distinguish a clean, efficient completion from a slow, barely-adequate one. The label is the same either way. This is a well-known limitation of sparse reward RL in general, but it is particularly acute in VLA fine-tuning because the pretrained model often gets to reasonable success rates quickly, after which training stalls.
The second problem is subtler and, to be precise, more specific to the human-in-the-loop setting that is increasingly common in real-robot experiments. When a human intervenes mid-rollout to correct the robot, the episode contains a mix of autonomous segments and intervention segments. If you assign the final outcome label to the intervention segments, you are crediting (or blaming) the policy for actions it did not take. The authors call this the credit assignment across intervention boundaries problem, and I think they are right that naively assigning outcomes across those boundaries introduces a systematic bias that has not received enough explicit attention.
HABC addresses both problems by training two separate critic heads: one for viability (did the policy make progress toward success?) and one for efficiency (did it do so quickly and cleanly?). A state-adaptive gate, denoted $g_t$ in the paper, merges the one-step advantages from these two heads, weighting viability more heavily when the policy is uncertain and shifting toward efficiency once viability is high. The combined signal becomes per-transition weights on the actor loss, which is a behavior cloning objective rather than a direct policy gradient. Intervention-aware credit assignment further restricts outcome labels to the segments actually executed by the current policy.
The results on three contact-rich bimanual tasks are striking on their face: HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively. That third task, at 38%, is a useful reminder that not everything works equally well. The authors do not hide this, which I appreciate. It's worth noting that contact-rich bimanual manipulation is a genuinely hard benchmark, not a toy domain, so these numbers are meaningful if they hold up.
What I cannot assess from the preprint alone is how sensitive these results are to the hyperparameters governing the gate function and the weighting between the two critic heads. The paper describes a state-adaptive mechanism, but the details of how that adaptivity is tuned in practice will matter a great deal for anyone trying to reproduce this. The sample size is small, three tasks in one lab, and this has not been replicated yet.
This is where I want to be slightly pedantic, because the distinction matters for how much credit to assign.
The idea of decomposing reward signals into multiple objectives with separate critics is not new. Multi-objective RL has a long history, and the specific idea of separating task completion from task efficiency has appeared in various forms. The intervention-aware credit assignment piece is more novel, at least in this specific framing applied to VLA fine-tuning with human corrections. I am not aware of a prior paper that addresses this exact problem in this exact setting, though related ideas appear in the imitation learning literature on learning from human demonstrations with corrections.
What HABC contributes, then, is a specific, practical instantiation of these ideas that is designed to work with the particular structure of VLA fine-tuning: pretrained backbone, sparse binary outcomes, mixed autonomous and intervention segments. That is incremental over prior work in multi-objective RL, but it is a meaningful and well-motivated increment rather than a superficial one. The engineering of the state-adaptive gate to handle the viability-efficiency tradeoff dynamically is the part I find most interesting, and it is genuinely tailored to this setting.
The second paper, arXiv:2606.18610, addresses a different but equally painful problem: evaluation. Running a robot policy in the real world is slow, expensive, and hard to scale. If you want to compare seven policies across multiple tasks, you are looking at hundreds or thousands of physical rollouts. Video world models offer a way to simulate those rollouts, but they come with their own failure modes.
SC3-Eval proposes three consistency constraints that work together to keep generated video rollouts physically plausible over long horizons.
Forward-inverse dynamics consistency trains the model to both predict future frames from actions and recover actions from frames. The intuition is that a model that can only predict forward has no penalty for generating physically implausible frames, as long as they look locally reasonable. Tying the forward and inverse modes together anchors the generated rollout to a manifold of physically plausible action sequences.
Cross-view consistency addresses a problem specific to multi-camera robot setups: if you are generating rollouts from, say, a wrist camera and a fixed overhead camera simultaneously, the two views need to remain geometrically consistent with each other over time. SC3-Eval trains the model to inpaint each view from the other, which enforces this consistency without requiring an explicit memory mechanism or 3D scene representation.
Test-time consistency is the most practically interesting of the three. At inference, the inverse dynamics mode is reused as an uncertainty signal: if the frames generated by the forward model are drifting away from the actions that were requested, the inverse model will detect this as a mismatch, and the rollout is terminated early. This is essentially a self-calibrating confidence mechanism that prevents the compounding error problem from accumulating silently.
The evaluation results are reported across seven real-world VLA policies, which is a reasonable scale for this kind of benchmark. SC3-Eval achieves a closed-loop Pearson correlation of 0.929 between simulated and real-world success rates, and an MMRV (a ranking metric I was not previously familiar with, it measures how well the evaluator reproduces the relative ranking of policies) of 0.119. The paper reports that this outperforms three prior video-model-based baselines. The correlation of 0.929 is high enough to be practically useful if it generalises, but it remains unclear how well these numbers will hold on tasks and policies that are further outside the training distribution.
Actually, the research shows that the cross-view consistency piece may be the most underappreciated contribution here. Multi-camera consistency in long-horizon video generation is a hard problem that has received relatively little attention in the robot learning literature specifically, even though most real robot setups use multiple cameras. The approach described, inpainting each view from the other during training, is simple enough to implement and does not require any explicit geometric reasoning, which makes it attractive.
Taken individually, each paper addresses a specific technical problem. Taken together, they suggest something about where the field is in the VLA deployment pipeline.
A year ago, the dominant conversation was about whether pretrained VLAs could do anything useful on physical robots at all. That question has been answered affirmatively, if not definitively. The conversation has shifted to the harder, more practical questions: how do you fine-tune these models efficiently in the real world, and how do you know when you have succeeded?
HABC and SC3-Eval are both attempts to answer those second-order questions. HABC says: the standard approach to fine-tuning from sparse outcomes is losing information it doesn't have to lose, and here is a structured way to recover some of it. SC3-Eval says: physical evaluation is a bottleneck, and a sufficiently constrained video model can substitute for it with high fidelity.
The combination matters because fine-tuning and evaluation are not independent. If SC3-Eval can reliably simulate policy rollouts, it becomes possible to run many more iterations of HABC-style fine-tuning without incurring the cost of physical experiments. That is a sort of multiplier effect, and it is the kind of infrastructure improvement that tends to accelerate progress in a field more than any single algorithmic advance.
I should be honest about the limitations here, though. Both papers are from single labs, both are based on small sets of tasks, and neither has been independently replicated. The Pearson correlation of 0.929 for SC3-Eval is impressive, but it is based on seven policies, which is a limited sample. The HABC results on the third task (12% to 38%) are a useful reminder that these methods are not universally effective.
For HABC, the most important next step is replication on a broader set of tasks and robot platforms. The three bimanual contact-rich tasks used in the paper are a reasonable starting point, but they share a common structure. I would want to see the method tested on tasks with different failure modes, particularly tasks where the viability-efficiency tradeoff looks different. I would also want a clearer ablation of the state-adaptive gate: how much of the gain comes from the two-critic decomposition versus the intervention-aware credit assignment versus the gating mechanism itself? The paper includes some ablations, but I know I'm being picky here, and a more systematic sensitivity analysis would strengthen the claims considerably.
For SC3-Eval, the key question is generalisation. A correlation of 0.929 across seven policies is encouraging, but the evaluator needs to generalise to policies whose behaviors lie outside its training distribution, and the paper acknowledges this challenge. I would want to see experiments where the evaluated policies are deliberately out-of-distribution relative to what the video model was trained on, to understand where the correlation degrades. The test-time consistency mechanism is designed to catch this case, but whether it catches it reliably enough to prevent misleading evaluations remains to be seen.
More broadly, the field needs better shared benchmarks for both fine-tuning methods and evaluation methods. Right now, it is genuinely difficult to compare results across labs because the tasks, robot platforms, and evaluation protocols differ. This raises questions about... well, multiple things, including whether the improvements reported in papers like these are task-specific or reflect genuine general progress. That is not a criticism of either paper specifically; it is a structural problem in the field that neither paper can solve on its own.