Two New Papers Tackle the Hardest Parts of Training Robot Policies in the Real World

A fine-tuning method called HABC and a video-based evaluation framework called SC3-Eval each address long-standing bottlenecks in deploying vision-language-action models on physical robots.

18 June 202610 min de lecture

Two preprints posted this week to arXiv target problems that anyone who has tried to deploy a vision-language-action (VLA) model on a real robot will recognise immediately: how do you train effectively from sparse, binary outcome signals, and how do you evaluate a policy without running hundreds of expensive physical rollouts? The answers proposed, Hierarchical Advantage-Weighted Behavior Cloning (HABC) and SC3-Eval, are technically distinct but complementary, and together they point at a maturing understanding of what actually breaks in the VLA fine-tuning pipeline.

I want to be careful not to oversell either paper. Both are single-lab results, neither has been replicated, and the real-world experiments are small. But the problems they identify are real, and the framing in both cases is sharper than most of what I have read this year.

What problem is HABC actually solving?

The standard setup for online reinforcement learning fine-tuning of a pretrained VLA goes roughly like this: you run the robot through a task, observe whether it succeeded or failed, and use that binary outcome to update the policy. The difficulty is that a single binary label has to be distributed across every transition in the episode, which can be dozens or hundreds of individual actions. Most existing approaches collapse the episode outcome to a scalar reward or advantage signal and assign it uniformly, or with some simple discount, across those transitions.

The authors of arXiv:2606.17043 identify two specific failure modes in this approach that I think are worth separating out clearly, because they are actually distinct problems that have been conflated in the literature.

More in Research

TurboMPC and jaxipm tackle the same bottleneck from different angles: getting constrained optimization off the CPU and onto the GPU where the rest of modern robotics already lives.

Aisha Patel · 25 Jun · 8 min

New work on exoskeletons, hybrid supervision, humanoid data collection, and vibrotactile sensing all circle the same bottleneck: getting good demonstration data into dexterous robot hands.

Aisha Patel · 25 Jun · 10 min

A flow-matching framework for cross-embodiment manipulation and a point-cloud feasibility predictor both land this week. One is genuinely novel. The other is incremental but useful.

Aisha Patel · 25 Jun · 10 min

Two New Papers Tackle the Hardest Parts of Training Robot Policies in the Real World

What problem is HABC actually solving?

More in Research

Is the two-critic framing genuinely new?

What is SC3-Eval doing differently from prior video model evaluators?

Why do these two papers matter together?

What would I want to see next?

Sources