Humanoid Robots Still Can't Walk Straight When You Push Them. Two New Papers Are Working On That.
A pair of fresh research efforts tackle one of the most stubborn problems in humanoid locomotion: what happens when the real world shoves back.
By
·10 hours ago·読了 7 分
Think about cruise control. Not the fancy adaptive stuff your new car has, the old-school version from the 80s that just held a fixed throttle regardless of hills. It worked fine on flat highway and fell apart the moment terrain got complicated. That's roughly where humanoid locomotion has been sitting for a while now: policies trained in simulation, tested in clean lab conditions, deployed into a world that does not care about your training distribution. Two new papers out of arXiv this week are trying to fix that, in different ways, and I think both are worth your attention even if neither one is a finished answer.
The core problem, in plain terms, is that humanoid robots moving through real environments run into forces their training never accounted for. A person bumps into them. They carry an asymmetric load. Someone pushes on their torso while they're mid-stride. The existing toolkit for handling this is, honestly, a bit of a mess. You can randomize your training domain broadly and hope the robot learns something general enough to survive. You can build in task-specific force objectives, which work until you change the task. Or you can train a neural estimator that infers forces from motion history, which tends to fall apart on situations it hasn't seen before. None of these are satisfying. All of them involve tradeoffs that researchers have been arguing about for years.
The first paper, arXiv titled "ADAPT: Analytical Disturbance-Aware Policy Training for Humanoid Locomotion," takes a different angle. Instead of learning to estimate disturbances from data, ADAPT uses an analytical whole-body disturbance observer, meaning it's grounded in actual physics rather than pattern-matching. The observer estimates residual force and torque in real time using the robot's own dynamics model, and it does this without needing dedicated force or torque sensors on the hardware. That last part matters more than it might sound. Sensors add cost, add failure points, add calibration headaches. If you can get useful force estimates from the dynamics alone, you've removed a significant practical obstacle.
関連記事
More in Humanoids
Two new papers take on one of embodied AI's most frustrating practical problems: what happens when a robot's sensors go dark mid-task.
Sarah Williams · Yesterday · 4 min
One team tackled the memory and latency problem for robots finding objects in real spaces. Another rethought how robots translate intent into motion. Both point at the same underlying tension.
Sarah Williams · Yesterday · 6 min
Motion planning is one of those problems that sounds solved until you watch a robot arm get stuck. Two new research papers are taking very different approaches to unsticking it.
Sarah Williams · Yesterday · 5 min
Two new papers tackle the energy problem in humanoid robots from opposite ends, and together they point at something the field has been quietly ignoring.
The team tested this on a Unitree G1 humanoid, which is one of the more accessible research platforms right now, and the results show ADAPT outperforming a proprioception-only baseline across torso perturbations, standing pushes, and asymmetric hand payloads. Velocity tracking improved even on out-of-distribution disturbances, which is the part that usually falls apart first. There's also an interesting side benefit: because the system can infer disturbances at lower-body joints, you can penalize those inferred forces during training, which nudges the robot toward lighter, more efficient footfalls. Quieter locomotion as an emergent property of better disturbance awareness. I'll admit I didn't expect that.
Now, I've seen this movie before, and I want to be clear about what "outperforming a baseline" in a controlled experiment does and doesn't tell us. It tells us the approach works better under the specific conditions tested. It doesn't tell us how it performs across every robot platform, every environment, every disturbance type you'd actually encounter in a warehouse or a hospital corridor or someone's home. The authors are careful about this, to their credit. But it's worth keeping in mind as the hype machine starts doing what hype machines do.
The second paper takes a step back from the locomotion problem specifically and asks a more structural question: how do we even evaluate whether hierarchical control systems for humanoids are working? The paper, "HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning" (also on arXiv), introduces a simulation benchmark designed to expose gaps in how high-level and low-level policies talk to each other.
Here's the setup. Hierarchical control in humanoids typically means a high-level policy that figures out what to do (navigate to this point, pick up this object, squeeze through this gap) and a low-level general motion tracker that translates those decisions into actual joint commands. The high-level policy works in the space of intentions and whole-body targets. The low-level tracker works in the space of physics. The interface between them is where things get interesting, and also where things tend to break.
HumanoidArena specifically targets what the authors call "leg-critical" tasks, which are interactions where you can't just treat the legs as a wheeled base that moves you from point A to point B. We're talking foot placement decisions, balance maintenance during manipulation, posture adjustment under load, whole-body reorientation. Seven tasks in total, designed so that if your lower body coordination is bad, you fail. Full stop. No getting around it by being clever with your arms.
The benchmark evaluates from two angles. First, perturbation-conditioned generalization: does the policy hold up when the task distribution shifts? Second, and this is the part I find more interesting, GMT-conditioned transfer: can you swap out the low-level motion tracker without the high-level policy falling apart? The findings here are sort of sobering. Hierarchical control does enable robots to solve diverse leg-critical tasks, which is progress. But performance is strongly conditioned on which tracker you're using, and cross-GMT transfer is described as "fragile." That means the high-level policy is learning to exploit specific quirks of the tracker it was trained with, rather than learning something more general.
This raises questions about... well, multiple things. It raises questions about whether intermediate action representations can ever be truly tracker-agnostic, or whether there's always going to be some coupling between what the high-level policy asks for and how the low-level system responds. It raises questions about how you'd ever deploy a system like this in a commercial context where you might need to update components independently. And it raises the broader question of whether simulation benchmarks, however carefully designed, are capturing the right failure modes.
Call me old-fashioned, but I've been watching benchmark papers come and go since the early days of autonomous driving, and the pattern is familiar. You build a benchmark that exposes a real gap. Researchers optimize for the benchmark. The benchmark improves. The real-world gap turns out to be somewhere slightly different than where you were looking. That's not a criticism of HumanoidArena specifically, it's just the nature of the thing. The authors are explicit that this is a simulation-first benchmark and that real-world validation is the next step. Fair enough.
What connects these two papers is that they're both grappling with the same fundamental tension in humanoid robotics: the gap between what works in a controlled training environment and what works when physics gets complicated and unpredictable. ADAPT attacks this from the locomotion side, trying to give individual policies a better physical grounding. HumanoidArena attacks it from the systems architecture side, trying to build evaluation tools that expose where hierarchical systems are brittle.
Neither paper solves the problem. Neither paper claims to. But they're asking better questions than a lot of what I've been reading lately, which tends to be more focused on impressive demos than on understanding failure modes. The ADAPT work in particular, with its emphasis on out-of-distribution robustness and its sensor-free design, feels like it's pointing at something practically deployable rather than just academically interesting. Whether that holds up outside the Unitree G1 test conditions remains unclear, and I'd want to see results on at least two or three other platforms before getting too excited.
The HumanoidArena benchmark will be useful to the extent that people actually use it and don't just optimize for it. That's always the catch with benchmarks. It's too early to say whether it'll become a standard reference point the way some AV evaluation frameworks did, or whether it'll be cited a few times and then superseded by something else in eighteen months.
Here's my read on the bigger picture. The humanoid locomotion field is maturing in the way that self-driving car research matured around 2017 or 2018, meaning the easy demos are done, the hard problems are becoming clearer, and the work is getting less flashy but more rigorous. That's actually a good sign! It means the researchers still in the room are the ones who want to solve the hard problems rather than the ones who wanted to be in a press release.
The young founders building on top of this research are going to need both of these things, better disturbance handling at the policy level and better tools for evaluating whether their hierarchical systems actually generalize. Neither paper is the last word on either topic. But both are worth reading if you're following this space closely, and I'm keeping an eye on what comes next from both groups.
If you want to argue about any of this, my email's on the about page.