The Quiet Revolution in Robot Action Representations: Why Geometry Might Finally Solve Cross-Embodiment Transfer
A cluster of recent papers suggests we've been thinking about robot learning wrong. The action space itself, not just the policy, deserves first-class treatment.
画像クレジット: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
A robot arm in a lab picks up a mug. Another arm, built by a different company with different joints and different sensors, needs to learn the same task. In the old paradigm, you'd train both from scratch. In the emerging paradigm, you'd hope some magic of scale would let a foundation model figure it out. But a growing body of research suggests there's a third path, one that treats the geometry of actions as a first-class citizen rather than an afterthought.
I've spent the past week reading through five recent papers that, taken together, paint a picture of where robot learning might be heading. The thesis is straightforward, if a bit pedantic (I know I'm being picky here, but the distinction matters): we've been so focused on learning policies that we've neglected to ask whether our action representations are any good. It's like obsessing over your neural network architecture while feeding it poorly normalized data.
To be precise, most robot learning systems treat actions as vectors of joint angles or end-effector positions. You collect demonstrations, you regress to those vectors, you hope for the best. The problem is that this approach conflates several things that should be separate: the intrinsic geometry of a motion, the speed at which it's executed, and the specific embodiment performing it.
arXiv hosts a paper called "General Covariant Action Modeling" that makes this point forcefully. The authors argue that regressing to absolute coordinates violates what physicists call general covariance, basically, your representation shouldn't depend on arbitrary choices of coordinate system. When you train a policy to output specific joint angles at specific times, you're baking in execution details that have nothing to do with the task itself.
関連記事
More in AI Models
A cluster of recent papers suggests we're finally getting serious about how robots understand physical scenes, though the gap between simulation and reality remains stubbornly wide.
Aisha Patel · 3 hours ago · 8 min
A wave of new research is turning everyday human videos into robot training data, but the gap between watching someone make coffee and actually making it yourself remains stubbornly wide.
James Chen · 3 hours ago · 8 min
Six new papers in a week suggest the field is converging on a shared insight: how you train these models matters more than how you build them.
James Chen · 3 hours ago · 5 min
A flood of new research promises robots that can imagine the future before acting. The tech is real, but so is the hype cycle.
Their solution, the Generalized Action Manifold (GAM) framework, enforces invariance along two axes. Temporal invariance means separating the spatial path from how fast you traverse it. Geometric invariance means factoring out pose-specific details to extract what they call "canonical world lines." The terminology is physics-heavy, but the intuition is sound: if you want a policy to generalize, you need to disentangle what's essential from what's accidental.
This hasn't been replicated yet, and the sample size in their experiments is... well, it's a typical robotics paper sample size. But the conceptual framing is genuinely new, or at least newly formalized.
A different approach to the same problem comes from a paper called PHASOR, which exploits something obvious in hindsight: a lot of robot motion is periodic. Walking, reaching, manipulation, these all involve cyclic patterns that repeat with variations.
PHASOR factorizes motion into a "phase manifold" that captures cyclic structure using FFT-parametric coefficients, plus a pose branch for non-periodic details. Actually, the research shows something interesting here: by anchoring multiple humanoid robots to a shared human-pretrained manifold, you get a unified action embedding space that transfers across platforms. The key insight is that motion semantics, not joint configurations, should be the common language.
What's genuinely new versus incremental? The idea of using phase to structure action spaces isn't novel (people have done this in character animation for decades). But applying it specifically to cross-embodiment transfer in robotics, with the FFT parameterization and the human-pretraining step, that's a meaningful contribution. The cross-embodiment retrieval results are strong, though I'd want to see this tested on more diverse robot morphologies before getting too excited.
Bimanual manipulation is where unstructured action representations really fall apart. When two arms need to coordinate, the space of valid actions becomes geometrically complex in ways that flat vector representations struggle to capture.
A paper on semantic-geometric task representations tackles this by encoding object identities, inter-object relations, and motion histories in a graph structure. The architecture uses a Message Passing Neural Network encoder operating on temporal scene graphs, with a Transformer decoder that conditions on action context to forecast future actions.
It's worth noting that the benefit of this structured approach over simpler sequence models grows with task variability. For simple, repetitive tasks, you might not need the overhead. But for tasks where action ordering and object involvement vary across executions, the structure pays off. The authors report full task success on two real-robot bimanual tasks, which is... actually pretty good for bimanual work. Most papers in this space report partial success rates.
The clever bit is the decoupling: the encoder learns task-agnostic representations, which means you can reuse it across embodiments by only finetuning the decoder. This is the kind of architectural choice that looks obvious in retrospect but requires genuine insight to identify.
One of the persistent dreams in robot learning is to leverage the vast corpus of human demonstration videos available online. The problem, of course, is that humans and robots don't look the same, don't move the same, and don't perceive the same.
HARP-VLA proposes a human-robot aligned representation learning framework that addresses this mismatch. The approach uses limited paired human-robot demonstrations as "cross-embodiment bridges" while leveraging abundant unpaired videos for dynamics supervision.
The technical contribution is a robot-adapted visual encoder trained with what they call "source-relative pair-discriminative alignment loss." I'll admit the name is a mouthful, but the idea is to adapt robot representations toward human semantics while preserving the ability to discriminate between different demonstration pairs. The results on CALVIN ABC→D (a standard benchmark) show 4.481 average length, and there's a 7.1% real-world success rate gain over baselines.
Is this the solution to human video pretraining? It's too early to say. The gains are meaningful but not transformative, and the reliance on paired demonstrations (even if limited) might be a bottleneck. But it's a step in a direction that many researchers believe is necessary.
Diffusion models have become popular for action generation, but they have a fundamental problem for robotics: they require multiple sampling steps, which is prohibitive for high-frequency control. If your robot needs to make decisions at 100Hz, you can't afford 10 or 50 denoising steps per action.
Recent work on one-step formulations addresses the latency issue, but at a cost: you lose the iterative correction that makes diffusion models robust. A paper on Implicit Drifting Policy tries to have it both ways.
The core insight is that the intermediate trajectory evolution in diffusion provides action correction, basically, the model is steering toward the valid action manifold at each step. If you go to one step, you lose this correction. But explicitly estimating the "drifting field" that provides this correction is mathematically ill-posed due to sparse demonstrations.
IDP's solution extracts what they call "conditional expert geometry" from local variation of observation-similar expert actions, comparing it against a global reference geometry. This lets them isolate condition-specific constraints and weight a scalar potential objective accordingly. The approach maintains adherence to valid action manifolds without explicit vector field estimation.
I'm genuinely uncertain whether this will hold up across diverse manipulation tasks. The theoretical framing is elegant, but robotics has a way of humbling elegant theories. The real-world manipulation results are promising, but the sample of tasks is small.
Taken together, these papers suggest a research direction that I find compelling: treating action representations as first-class design targets, with downstream policy quality emerging from representation quality. This is a shift from the dominant paradigm of the past few years, which has been to scale up data and model size and hope that representation quality follows.
But there are open questions. Big ones.
First, how do these approaches compose? If you want phase-anchored representations that are also geometrically invariant and transfer from human videos, can you combine PHASOR with GAM with HARP? Or do these frameworks conflict in ways that aren't obvious from reading the individual papers?
Second, what's the right level of structure? The bimanual work suggests that more structure helps when tasks are variable, but structure has costs: it's harder to learn, harder to scale, and might impose assumptions that don't hold for novel tasks. We don't know yet where the sweet spot is.
Third, and this is the question that keeps me up at night, will any of this matter once we have enough robot data? The argument for structured representations is strongest in low-data regimes. But if companies like Tesla and Figure are collecting millions of hours of robot operation data, maybe brute-force learning will win anyway. The history of AI suggests that scale often beats structure. But robotics might be different. We don't have the equivalent of the internet to scrape for robot experience.
The honest answer is that it remains unclear which approach will dominate. What I can say is that the research community is, finally, taking action representations seriously as a design problem rather than an afterthought. That's progress.
There's a philosophical point lurking here that's worth making explicit. For years, the dominant narrative in AI has been that representations should be learned, not designed. Let the neural network figure out what features matter. This has worked spectacularly in vision and language.
But actions are different. Actions have physics. They have geometry. They have causal structure that you can reason about independently of any particular task. The papers I've discussed here are, in various ways, encoding that structure explicitly rather than hoping the network will discover it.
This is, in a way, a return to older ideas in robotics. Before deep learning, people spent decades thinking about motion primitives, task frames, and coordinate-free representations. Some of that work was too rigid, too hand-engineered, too dependent on assumptions that didn't hold. But the underlying intuition, that action spaces have structure worth exploiting, was sound.
What's new is the combination of learned and structured approaches. You can learn a phase manifold while enforcing that it has cyclic structure. You can learn geometric representations while enforcing covariance properties. You can learn from human videos while explicitly modeling the human-robot domain gap.
This hybrid approach feels, to me, like the right direction. Though I've been wrong before about research directions, and I'll probably be wrong again.
The work is early. The sample sizes are small. The benchmarks are limited. But the questions being asked are the right ones. That's more than I can say for a lot of what crosses my desk.