Three New Papers Point to the Same Conclusion: Humanoid Control Is Becoming a Data Problem
A trio of arXiv papers this week suggests the field is converging on diffusion-based approaches trained on massive motion datasets, but the real bottleneck might not be algorithms.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
What happens when three separate research teams, working on the same fundamental problem, arrive at remarkably similar conclusions within the same week?
That's the question I found myself asking after reviewing a cluster of humanoid locomotion papers that hit arXiv over the past few days. MuGen, SCRIPT, and Direct Dynamic Retargeting (DDR) each tackle the challenge of getting humanoid robots to move like humans. And while their specific approaches differ, the convergence in methodology is striking enough to suggest something important about where this field is heading.
Let me break down the three approaches, because the technical details matter here.
arXiv published MuGen, which uses vector-quantized autoencoders (VQ-VAEs) trained with model-based reinforcement learning. The key insight is creating a "generative representation of locomotion" from what the authors describe as "hours of heterogeneous human performance data." They employ a teacher-student learning framework with a new policy distillation strategy. The result: a robot that can track and mimic human motions it has never seen before.
SCRIPT, detailed in another arXiv paper, takes a different architectural approach with what the researchers call a Joint Action-State-Text Diffusion Transformer (JAST-DiT). The system represents actions, physical states, and text as separate token streams, then couples them through joint attention. What caught my attention was their training regime: supervised imitation pre-training followed by reinforcement learning with hybrid rewards. They tested on the MotionMillion dataset, which contains 1,200 hours of motion data.
Related coverage
More in Humanoids
Three new papers dropped this week that suggest we've been watching the wrong competition.
Sarah Williams · 1 hour ago · 4 min
Three new papers tackle the same underlying issue: we've been forcing robots into kinematic boxes that don't fit their actual capabilities.
Sarah Williams · 3 hours ago · 6 min
A batch of new papers suggests we've been training robots the wrong way, and the fixes are surprisingly straightforward.
Sarah Williams · 3 hours ago · 6 min
Two new papers tackle robot safety with CBFs. The math is elegant. The gap between theory and messy reality is still enormous.
The third paper, DDR, argues that existing approaches introduce what they call "geometric bias" through intermediate kinematic projections. Their solution is a single-stage framework that generates trajectories directly from expert videos using sampling-based Model Predictive Control within a physics simulator.
Here's what I find interesting. Despite different architectures and training approaches, all three papers share several core assumptions:
Data scale matters. MuGen uses "hours" of human performance data. SCRIPT explicitly tested scaling on 1,200 hours of motion capture. DDR works directly from video demonstrations, which are essentially unlimited in supply.
Diffusion-based or generative approaches are central. MuGen uses VQ-VAEs, SCRIPT uses diffusion transformers. Even DDR, while using MPC, frames the problem as generating trajectories rather than optimizing controllers.
The gap between human and robot morphology is a solvable problem. All three papers treat the kinematic mismatch between human bodies and humanoid robots as something that can be learned around rather than engineered around.
Physics simulation is non-negotiable. Each approach validates in physics simulation before any mention of real hardware. SCRIPT uses closed-loop simulations with physical feedback. DDR explicitly runs MPC within a physics simulator.
Look, I've seen enough spec sheets and research papers to recognize a pattern. When multiple teams converge on similar methodologies, it usually means the algorithmic problems are becoming tractable. The hard part shifts elsewhere.
In this case, I think the bottleneck is becoming data, and specifically, the right kind of data.
MuGen's "hours of heterogeneous human performance data" is notably vague. How many hours? What kinds of motions? Collected how? SCRIPT is more specific (1,200 hours from MotionMillion), but that dataset is, well, it's large but not exactly internet-scale. DDR's approach of working directly from monocular video is explicitly motivated by data availability: "a scalable approach for teaching complex skills."
The question none of these papers fully answers is whether current motion datasets capture the full distribution of movements a humanoid might need to perform. Walking, running, dancing, sure. But what about recovering from a stumble while carrying an asymmetric load? Navigating cluttered environments at speed? The long tail of real-world motion is, in a way, infinitely long.
Several things remain unclear from these papers, and I want to be honest about the limitations of drawing conclusions from arXiv preprints:
Real-world validation is sparse. MuGen mentions "deployment" but the abstract focuses on demonstration "through a diverse set of motions." SCRIPT's evaluation appears to be simulation-based ("closed-loop simulations"). DDR explicitly discusses providing references "to RL agents" for training, which suggests sim-to-real transfer is a downstream problem. From my time building hardware, I can tell you that simulation-to-reality gaps in contact-rich locomotion are substantial.
Computational requirements are underspecified. Training diffusion transformers on 1,200 hours of motion data isn't cheap. Running sampling-based MPC in real-time on robot hardware is computationally demanding. None of the abstracts mention inference latency or onboard compute requirements. That's an ambitious omission if these are meant to run on actual robots.
Comparison baselines vary. Each paper compares against different "state-of-the-art" methods. Without head-to-head comparisons on identical benchmarks, it's hard to say which approach actually performs best. The field could use a standardized evaluation protocol.
Despite these caveats, the convergence is meaningful. Three years ago, humanoid locomotion research was fragmented across trajectory optimization, model-free RL, and various hybrid approaches. The fact that multiple teams are now converging on generative models trained on large motion datasets suggests the field has found a productive research direction.
This has implications for the industry. If humanoid locomotion becomes primarily a data problem, then the companies with the best motion datasets (or the best pipelines for collecting them) will have a structural advantage. That's a different competitive landscape than one where algorithmic innovation is the primary differentiator.
It also suggests that the "foundation model" paradigm from language and vision is coming to robotics control. SCRIPT's scaling studies showing "consistent performance gains with model scaling" is exactly the kind of result that motivated GPT-3 and its successors. Whether robotics will follow the same scaling laws remains to be seen, but researchers are clearly betting in that direction.
I should note that convergence on a methodology doesn't guarantee that methodology is correct. The field has been wrong before. Remember when everyone was sure that explicit kinematic planning was the path forward? Or when model-free RL was going to solve everything?
These generative approaches might hit fundamental limits that aren't visible in current benchmarks. The "geometric bias" that DDR criticizes in other methods might turn out to be a feature, not a bug, when robots need to operate in constrained environments. The massive data requirements might prove impractical for specialized applications.
But for now, the direction is clear. Humanoid locomotion research is converging on generative models, diffusion-based architectures, and data-hungry training regimes. The real test will be whether any of these approaches can produce robots that move reliably in uncontrolled environments.