The Great Humanoid Transfer Problem: Five Papers Point Toward a Post-Training-From-Scratch Future
A cluster of new research suggests we might finally be able to stop retraining humanoid control policies from scratch every time someone builds a new robot. The catch? We're not quite there yet.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
If you have ever trained a neural network to control a robot, you know the particular frustration of watching your carefully tuned policy fail completely the moment you swap in a different actuator or change the link lengths by a few centimeters. It is a bit like teaching someone to ride a bicycle, then handing them a unicycle and expecting the same performance. The body is different, the dynamics are different, and all that painstaking training seems to evaporate. This week, a cluster of five papers on arXiv suggests the robotics community is making serious progress on this problem, though (and I want to be precise here) the solutions remain partial and the claims require careful parsing.
The core challenge is what researchers call cross-embodiment transfer. You train a policy on Robot A, and you want it to work on Robot B without starting over. This matters enormously for humanoids specifically because, unlike industrial arms that follow standardized form factors, humanoid platforms are proliferating with wildly different morphologies. Unitree's G1, LimX's Oli and Luna, Figure's robots, Tesla's Optimus: each has different joint configurations, different mass distributions, different actuator characteristics. Training from scratch on each platform is expensive, time-consuming, and fundamentally wasteful if the underlying skills (walking, balancing, manipulating objects) are conceptually similar.
The paper that caught my attention first is Any2Any, which proposes what the authors call a paradigm for transferring whole-body tracking specialists across embodiments. The approach has two stages. First, kinematic alignment maps the input and output spaces between source and target robots so the pretrained policy's outputs are at least geometrically meaningful on the new platform. Second, dynamics adaptation applies parameter-efficient fine-tuning to modules that are particularly sensitive to the physical differences between robots. The headline result is genuinely striking: using only 1% of the compute and data required for full training, the authors claim to successfully transfer Sonic models pretrained on Unitree G1 to both LimX Oli and LimX Luna. That is a substantial efficiency gain if it holds up.
Related coverage
More in Humanoids
Six new papers promise to fix vision-language-action models. I'm cautiously optimistic, but the gap between simulation and reality remains massive.
Sarah Williams · 2 hours ago · 4 min
A trio of arXiv papers this week suggests the field is converging on diffusion-based approaches trained on massive motion datasets, but the real bottleneck might not be algorithms.
James Chen · 4 hours ago · 5 min
Three new papers dropped this week that suggest we've been watching the wrong competition.
Sarah Williams · 4 hours ago · 4 min
Three new papers tackle the same underlying issue: we've been forcing robots into kinematic boxes that don't fit their actual capabilities.
It is worth noting that the Any2Any paper focuses specifically on whole-body tracking, which is the task of making a humanoid imitate a reference motion trajectory. This is a narrower problem than general-purpose humanoid control, and the transfer is between platforms that share fundamental humanoid structure (bipedal, roughly anthropomorphic proportions). Whether the same approach would work for transferring between, say, a bipedal humanoid and a quadruped remains unclear. The authors are careful not to overclaim, which I appreciate, but the limitations should be explicit: this is transfer within a family of similar embodiments, not arbitrary cross-embodiment generalization.
A second paper, Direct Dynamic Retargeting, attacks a related but distinct problem: how do you translate human motion from video into feasible humanoid motion? The standard pipeline involves geometric retargeting (mapping human joint angles to robot joint angles based on kinematic correspondence) followed by dynamic optimization to make the result physically plausible. The authors argue, and I think they are correct, that this two-stage approach introduces what they call geometric bias. By forcing the solution through an intermediate kinematic representation, you constrain the search space in ways that may exclude dynamically superior solutions.
Their alternative, Direct Dynamic Retargeting or DDR, skips the kinematic intermediate entirely. Instead, they formulate the problem in task space and use sampling-based model predictive control within a physics simulator to generate trajectories directly from expert videos. The key insight is that the geometric projection step, while computationally convenient, is not actually necessary if you have sufficient compute for direct optimization. This is, actually, the research shows a meaningful improvement in demonstration tracking accuracy over baselines that use geometric retargeting. More importantly for practical applications, the authors find that providing these physically viable references to reinforcement learning agents accelerates training convergence.
I should note a methodology concern here: the paper promises source code will be made publicly available, but it is not available yet. Until independent replication happens, the results remain preliminary. The sample size of tested motions and the specific choice of baselines also deserve scrutiny. Still, the conceptual contribution (that geometric retargeting is a bottleneck we can potentially eliminate) seems sound.
The third paper in this cluster, ParkourFormer, addresses a different aspect of humanoid control: how do you handle rapidly changing terrain? Parkour is an interesting test case because it requires the robot to anticipate contact transitions and plan through dynamic maneuvers that standard reactive policies struggle with. The authors' argument is that existing reinforcement learning policies are too reactive, mapping observations directly to actions without explicitly modeling future body states.
ParkourFormer uses a Transformer architecture with a twist: a lightweight prediction head forecasts short-horizon future proprioceptive states, and these predictions are fused with temporal features to generate actions. The robot is, in effect, trained to imagine where its body will be in the near future and to condition its current actions on that prediction. The reported results are impressive: 93.85% average traversal success rate across a diverse terrain benchmark, with improvements of up to 42.73% over MLP and vanilla Transformer baselines. The model maintains a single unified policy across all terrain types, which is notable because many prior approaches require terrain-specific policies.
I know I am being picky here, but the 42.73% improvement figure deserves context. Improvements over baselines depend heavily on how strong the baselines are, and the paper compares against MLPs and vanilla Transformers rather than against the current state-of-the-art in terrain-adaptive locomotion. The real-robot experiments are encouraging but limited. Still, the core idea (that explicit future-state prediction helps with agile locomotion) is intuitive and the architectural choices seem well-motivated.
The fourth paper, X-DiffVLA, tackles cross-embodiment transfer from a different angle. Rather than transferring low-level control policies, this work focuses on Vision-Language-Action models, the increasingly popular approach of using large pretrained vision-language models as the backbone for robot control. The problem the authors identify is that existing VLA models require embodiment-specific fine-tuning, which defeats much of the purpose of pretraining on diverse data.
X-DiffVLA introduces a diffusion-based action head with two key innovations. First, Embodiment Forcing uses classifier-free guidance to steer action generation toward embodiment-specific components without explicit supervision. Second, Morphological Tree Diffusion strengthens behavioral correlations across diverse end-effectors. The experiments span RoboCasa and Isaac Gym environments with embodiments ranging from grippers to dexterous hands, and the authors report improvements of 15.3% and 12.5% respectively over prior methods.
This is genuinely new territory. Most prior work on VLA models has focused on single-embodiment settings or has treated cross-embodiment transfer as a fine-tuning problem. The idea of learning a unified action head that can generate appropriate actions for different embodiments based on implicit conditioning is elegant. The real-world evaluations are limited (as they almost always are) but the simulation results suggest the approach has legs.
The fifth paper, SCRIPT, addresses language-driven physics-based humanoid control, which is the task of making a humanoid execute motions described in natural language. The authors argue that existing methods fail to jointly achieve faithful instruction following, high-quality motion, and stable long-horizon control. Their solution is a Joint Action-State-Text Diffusion Transformer (they do love their acronyms) that represents actions, physical states, and text as separate token streams coupled through joint attention.
Two technical contributions stand out. First, a nonlinear history conditioning mechanism that preserves dense recent context while sampling increasingly sparse cues from long-term history. This addresses the problem of autoregressive control becoming unstable over long horizons. Second, a post-training stage using reinforcement learning with hybrid rewards that combines physical feedback with text-based rewards. The scaling studies on the 1200-hour MotionMillion dataset are particularly interesting: the authors demonstrate consistent performance gains with model scaling, which suggests the approach might benefit from the same scaling laws that have transformed language models.
Taken together, these five papers suggest several things about where humanoid control research is heading. First, the field is moving away from training-from-scratch toward transfer and fine-tuning approaches. This mirrors the trajectory of natural language processing and computer vision, where pretrained models and efficient adaptation have become standard. Second, diffusion models are becoming increasingly central to robot control, appearing in three of the five papers. The generative framing seems to offer advantages for handling multimodal action distributions and for incorporating conditioning signals. Third, there is growing recognition that the intermediate representations we use (kinematic retargeting, geometric projections, embodiment-specific action spaces) may be limiting factors that can be bypassed with sufficient compute and better architectures.
What remains unclear is how these approaches will compose. Can you combine Any2Any's efficient embodiment transfer with ParkourFormer's future-state prediction and SCRIPT's language conditioning? The papers exist in relative isolation, each addressing one piece of the puzzle. The integration problem is, I suspect, where most of the remaining difficulty lies.
There are also questions about real-world deployment that none of these papers fully address. Simulation-to-reality transfer remains challenging, and the real-robot experiments in these papers are limited in scope and duration. The parkour results, for instance, are demonstrated on flat terrain with artificial obstacles rather than the truly unstructured environments where such capabilities would be most valuable. This is not a criticism of the research (you have to start somewhere) but a reminder that the gap between impressive simulation results and robust real-world deployment remains substantial.
I would want to see several things in follow-up work. First, direct comparisons between these approaches on standardized benchmarks. Currently, each paper uses its own evaluation setup, making it difficult to assess relative strengths. Second, longer-horizon evaluations that test stability over hours rather than minutes. Third, transfer experiments across more diverse embodiments, including non-humanoid platforms, to understand the boundaries of these techniques. Fourth, and this is probably the most important, open-source implementations that allow independent replication and extension.
The broader trajectory seems clear. We are moving toward a world where humanoid control policies are pretrained on large datasets, transferred across embodiments with minimal fine-tuning, and conditioned on high-level instructions rather than low-level trajectories. This is basically the recipe that worked for language models, adapted for physical systems. Whether the analogy holds completely remains to be seen (physical systems have constraints that text does not) but the early results are encouraging.
One final observation: all five of these papers come from academic or research lab settings, not from the humanoid startups that have attracted so much attention and capital. The companies building humanoid robots have been relatively quiet about their control approaches, which makes it difficult to assess how much of this research is actually being deployed. It is possible that the production systems at Figure or Tesla use entirely different methods. It is also possible that they are building on exactly this kind of foundational research. We do not know yet, and the companies are not saying.