Hierarchical Tokenization Is Having a Moment in Robot Learning, But Let's Be Precise About What That Means
Three recent papers converge on multi-level action quantization for imitation learning. The approach is promising, but the field's enthusiasm may be outpacing the evidence.
Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Hierarchical tokenization for robot action learning is genuinely interesting work, and I want to be clear about that before I start picking it apart. The core idea (breaking continuous robot actions into discrete tokens at multiple levels of abstraction) addresses a real limitation in how we train robots from demonstrations. But as three recent papers converge on variations of this approach, I find myself wanting to slow down and ask what we actually know versus what we're hoping is true.
The most technically complete of the recent work comes from a team presenting HiST-AT, a hierarchical spatiotemporal action tokenizer for in-context imitation learning, published on arXiv. The architecture uses two successive levels of vector quantization: a lower level that assigns input actions to fine-grained subclusters, and a higher level that maps those subclusters to coarser clusters. What distinguishes this from prior tokenization schemes is the incorporation of temporal information. The system simultaneously recovers input actions and their associated timestamps, which matters because robot manipulation is fundamentally about when you do things, not just what you do.
To be precise, the claim is that this hierarchical approach outperforms non-hierarchical counterparts on multiple simulation and real robotic manipulation benchmarks. The authors report state-of-the-art performance in in-context imitation learning, which is a specific and somewhat narrow claim. In-context imitation learning refers to the ability to learn new tasks from a small number of demonstrations at inference time, without retraining the model. This is valuable, but it's worth noting that "state-of-the-art" in this subfield may not translate to state-of-the-art in robot manipulation more broadly.
Cobertura relacionada
More in AI Models
I spent a week parsing the claims around Google's new 'always-on' AI agent, and the answer is more complicated than the marketing suggests.
Aisha Patel · 5 hours ago · 7 min
The AI company is now officially the world's most valuable startup, and it's moving fast toward public markets.
James Chen · 6 hours ago · 3 min
The Claude maker beat OpenAI to the SEC paperwork, but I've seen enough tech IPO races to know this is really about runway, not rivalry.
Mark Kowalski · 6 hours ago · 5 min
Everyone's writing about the $200B CPU market grab. The actual story is how Nvidia is quietly becoming the landlord of global AI compute.
A parallel line of work, HiMAQ (hierarchical macro action quantization), takes a different angle on the same structural insight. This paper, also recently posted to arXiv, focuses on making reinforcement learning agents more human-like. The motivation here is interpretability and reliability rather than raw performance. The argument is that most RL agents are reward-driven in ways that produce behaviors humans find confusing or unpredictable, which limits their usefulness in collaborative settings.
The HiMAQ approach encodes human demonstrations into macro actions using the same two-level vector quantization structure. Lower level maps to fine-grained subaction clusters, higher level aggregates into action clusters. The evaluations on D4RL benchmarks show improvements in human-likeness scores while maintaining comparable or better success rates. I know I'm being picky here, but "human-likeness scores" is doing a lot of work in that sentence. The paper uses specific metrics for this, but whether those metrics actually capture what makes behavior feel human-like to human observers remains an open question. Quantifying human-likeness is notoriously difficult, and different metrics often disagree with each other.
What I find genuinely new in the HiMAQ work is the demonstration that the hierarchical approach generalizes across multiple RL algorithms (IQL, SAC, and RLPD). This suggests the benefit isn't specific to a particular learning paradigm but rather reflects something more fundamental about how action spaces should be structured. That's a stronger claim than showing improvement on one algorithm, though the sample size of three algorithms is still small.
The third piece of recent work takes a somewhat different approach to the memory problem in robot learning. Notes-to-Self, published on arXiv, addresses the fact that many manipulation tasks are non-markovian. Actually, the research shows that existing vision-language-action models are primarily "stateless" in a way that makes them struggle with tasks requiring memory of what happened earlier in an episode. The solution here is a language scratchpad that allows the model to write notes to itself about object positions, plans, and progress toward subgoals.
This is a different kind of hierarchy than the action tokenization approaches, but it shares the insight that flat representations are insufficient for complex manipulation. The scratchpad essentially creates a temporal hierarchy through language, allowing the model to maintain state across long horizons. The evaluations on ClevrSkills, MemoryBench, and a real-world pick-and-place task show significant improvements for both non-recurrent and recurrent models.
It's worth noting that the ClevrSkills environment, while useful for controlled evaluation, involves relatively simple objects and scenes compared to real-world clutter. The real-world pick-and-place task is a stronger validation, but the paper doesn't provide extensive details on the diversity of objects or the failure modes. This isn't a criticism unique to this paper; it's endemic to the field. We evaluate on the environments we have, and those environments are often cleaner than the real world.
So what connects these three papers? The underlying intuition is that robot actions have structure at multiple scales, and learning algorithms that respect that structure outperform those that don't. A reaching motion isn't just a sequence of joint angles; it's a high-level intention (reach toward the cup) instantiated through mid-level motor primitives (extend arm, orient wrist) executed as low-level control signals. Hierarchical tokenization attempts to recover something like this structure from data, without requiring explicit engineering of the hierarchy.
This is, in a way, a return to ideas from classical robotics and motor control. The notion of motor primitives and hierarchical control has been around for decades. What's different now is the claim that these hierarchies can be learned from data rather than hand-designed, and that they can be represented as discrete tokens compatible with modern sequence models. The connection to large language models is not coincidental. If you can tokenize actions the way we tokenize words, you can potentially leverage the same architectures and training approaches that have proven so effective for language.
But I want to pump the brakes a bit on the enthusiasm here. The evaluations in these papers, while rigorous within their scope, are limited in ways that matter. The simulation benchmarks (D4RL, ClevrSkills, various manipulation suites) are useful but not representative of the full complexity of real-world robotics. The real-world evaluations are typically on single tasks or small task sets, with limited variation in objects, lighting, and other conditions. We don't know yet how well these approaches transfer across robots, across task domains, or across the kinds of distribution shifts that occur in deployment.
There's also a question about what the right level of hierarchy is. The papers I've discussed use two levels, but why two? Is this optimal, or just convenient? Some manipulation tasks might benefit from three or four levels; others might be better served by a single level with a larger codebook. The choice of hierarchy depth is often presented as an architectural decision rather than something learned from data, which feels like a missed opportunity.
The temporal aspects remain underexplored. HiST-AT incorporates timestamps, which is good, but the relationship between temporal structure and spatial structure in actions is complex. Human motor control involves intricate timing dependencies that current tokenization schemes may not capture. When you pour water from a pitcher, the timing of the tilt, the duration of the pour, and the speed of the correction are all coupled in ways that a simple timestamp recovery objective might not learn.
I also want to flag a concern about evaluation metrics. "State-of-the-art" claims in robot learning are notoriously difficult to compare across papers because of differences in evaluation protocols, task definitions, and success criteria. The field would benefit from more standardized benchmarks with held-out test sets that aren't available to researchers during development. Right now, it's too easy to tune methods to specific benchmarks in ways that don't generalize.
What I'd want to see next is a systematic comparison of these hierarchical approaches against each other and against simpler baselines on a common benchmark suite. The papers I've discussed use different evaluation setups, making it hard to know whether HiST-AT, HiMAQ, or Notes-to-Self would perform better on any given task. A unified evaluation would also help identify whether the benefits of hierarchy are additive (combining HiST-AT's temporal tokenization with Notes-to-Self's scratchpad, for example) or whether they're solving the same underlying problem in different ways.
There's also the question of computational cost. Hierarchical approaches typically require more parameters and more computation than flat approaches. The papers don't always report inference times or memory requirements in ways that allow direct comparison. For robotics applications where real-time control matters, this could be significant.
Despite these caveats, I think the convergence on hierarchical representations is meaningful. When multiple research groups independently arrive at similar architectural choices, it often indicates that there's something real being captured. The specific implementations differ (two-level vector quantization, language scratchpads, temporal versus spatial emphasis), but the underlying insight is shared: robot actions have structure that flat representations fail to exploit.
The connection to in-context learning is particularly intriguing. If hierarchical tokenization enables robots to learn new tasks from a handful of demonstrations, that addresses one of the key bottlenecks in robot deployment. Collecting large datasets for every task is impractical; being able to teach a robot a new task by showing it a few examples is much closer to how we'd actually want to use these systems.
But we're not there yet. The "few examples" in current in-context learning work is typically 10 to 50 demonstrations, which is better than thousands but still requires careful data collection. The tasks are often relatively simple by human standards. And the generalization to truly novel objects and environments remains limited.
I'll end with a methodological note. All three papers I've discussed are available on arXiv, which means they haven't gone through peer review (or in the case of the replace announcements, have been revised since initial posting). This is increasingly common in machine learning and robotics, and it's not necessarily a problem. But it does mean we should treat the claims with appropriate caution. Peer review isn't perfect, but it does catch some errors and overstatements. The rapid pace of publication in this field sometimes means that ideas get ahead of the evidence supporting them.
Hierarchical action tokenization is a promising direction. The recent papers advance our understanding of how to structure action representations for robot learning. But the field has a tendency toward hype cycles, and I'd rather see steady progress than breathless announcements of breakthroughs. The work is good. Let's just be precise about what it shows and what remains to be demonstrated.