Affordances Are Having a Moment in Robotics Research, and It's About Time
Four recent papers tackle the same fundamental question: how do robots understand what objects are for? The answers are converging in interesting ways.
Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Affordance-based manipulation is not a new idea. J.J. Gibson introduced the concept in 1979, and roboticists have been trying to operationalize it ever since. But something has shifted in the past few months. A cluster of recent papers, all tackling category-level object manipulation, suggests we might finally have the computational tools to make affordances practical for real robots.
To be precise, I'm looking at four papers here: GIFT from a team working on functional maps, AffordGen which leverages 3D generative models, AFUN which bills itself as a step toward an "affordance foundation model," and SpaceTools which takes a different approach via tool-augmented reasoning. They're not all doing the same thing, but they're all circling the same problem: how do you get a robot to manipulate objects it has never seen before?
The answer, increasingly, is affordances. But the devil is in the details.
Let me be careful about novelty claims, because this field has a habit of rediscovering old ideas with new neural network architectures. The core insight that robots should reason about what objects afford (grasping, pouring, opening) rather than memorizing specific object instances goes back decades.
What's genuinely new is the scale and generalization. AFUN, for instance, builds what the authors call a "large-scale standardized data pipeline" that converts heterogeneous data sources (robot demonstrations, human videos, simulations, real-world scans) into a shared affordance schema. This is not trivial. Previous affordance datasets were small, task-specific, and rarely transferred across embodiments. The reported numbers are striking: +23.9 mean gIoU improvement over baselines for affordance segmentation, with gains of 12.7 to 61.3 percent on contact-point prediction hit rates.
À lire aussi
More in AI Models
The AI company's rapid expansion of access to its vulnerability-finding model raises questions about what changed, and what we still don't know.
Aisha Patel · 1 hour ago · 5 min
The company said Mythos was too risky for public release. Now it's handing out access like conference swag.
Sarah Williams · 1 hour ago · 3 min
A cluster of new research papers suggests we're finally cracking the problem of teaching robots to manipulate objects they've never seen before, though the field still has significant hurdles to clear.
Aisha Patel · 1 hour ago · 8 min
A wave of papers promises to make robot learning faster, cheaper, and more robust. Some of it might even be true.
I know I'm being picky here, but I want to note that "outperforms all baselines" claims depend heavily on which baselines you choose. The paper evaluates across 8 test sets from 4 benchmarks, which is more thorough than typical, but independent replication would strengthen these results.
GIFT takes a different approach that I find mathematically elegant. Rather than learning affordances from massive datasets, it derives geometric representations from a single human demonstration using the Functional Maps framework. This allows skill transfer across objects of similar topologies even when shapes differ significantly. The use of screw interpolation (ScLERP) for generating smooth robot paths is incremental over prior trajectory generation work, but the combination with functional maps for cross-object transfer is, actually, the research shows this is a novel contribution.
This is where things get interesting, and where I think the field needs more honest discussion about tradeoffs.
AFUN and AffordGen both assume access to large-scale 3D data and powerful vision foundation models. AffordGen explicitly leverages "powerful 3D generative models and vision foundation models" to generate diverse training trajectories. This is the foundation model playbook: scale up data, scale up compute, and generalization emerges. The results support this (zero-shot generalization to unseen objects in real-world experiments), but the computational requirements remain unclear from the abstracts.
GIFT operates in a fundamentally different regime. It requires only a single human demonstration. No massive datasets, no pre-trained foundation models (at least not for the core transfer mechanism). The tradeoff is that it's limited to objects of "similar topologies or categories." What counts as similar enough? The paper presumably addresses this, but it's a crucial constraint that determines practical applicability.
SpaceTools takes yet another approach, focusing on spatial reasoning augmentation for Vision Language Models. The key insight here is that VLMs are qualitatively good at visual understanding but struggle with "metrically precise spatial reasoning." Rather than building affordance representations from scratch, SpaceTools teaches VLMs to coordinate external tools (depth estimators, segmentation models, pose estimators) through reinforcement learning.
The Double Interactive Reinforcement Learning (DIRL) framework is the technical contribution here. It's worth noting that this addresses a real limitation: prior work on tool-augmented VLMs relied on handcrafted prompting or fixed tool pipelines. The +12% improvement over vanilla supervised fine-tuning and +16% over standard RL on RoboSpatial benchmarks suggests the approach works, though I'd want to see ablations on which components matter most.
None of these papers are silver bullets, and I appreciate when authors are upfront about constraints.
GIFT's reliance on functional maps means it's fundamentally a shape-matching approach. It transfers skills between objects that share topological structure. A mug and a differently-shaped mug? Probably fine. A mug and a bowl? Unclear. The "similar topologies or categories" constraint is doing a lot of work, and real-world robustness to truly novel object categories remains to be demonstrated.
AffordGen's dependence on 3D generative models introduces its own failure modes. The quality of generated training data depends on the quality of the underlying 3D models and the semantic correspondence of keypoints. If the generative model produces geometrically plausible but functionally incorrect objects, the learned policy could inherit those errors. The paper claims "high success rates" but absolute numbers and failure case analysis would be informative.
AFUN's ambition to be a "foundation model for functionality understanding" is admirable but, well, it's a step toward that goal rather than the goal itself. The paper acknowledges this in its title. The data pipeline that unifies heterogeneous sources is impressive engineering, but standardization always involves lossy compression. What affordance information is lost in translation from, say, human video demonstrations to the shared schema?
SpaceTools is limited by the capabilities of its constituent tools. If the depth estimator fails, or the segmentation model hallucinates, the VLM's reasoning degrades. The paper demonstrates real-world manipulation with a 7-DOF robot, which is encouraging, but the benchmark performance (RoboSpatial-Home, BLINK, BOP-ASK) may not fully capture the messiness of unstructured environments.
Four papers from different research groups, using different technical approaches, all arriving at affordance-centric representations for manipulation. This is not coincidence. It reflects a growing consensus that object-centric reasoning (what can I do with this thing?) is more tractable than pixel-level imitation (copy exactly what the human did).
The practical implications are significant. If robots can reason about affordances, they need fewer demonstrations per task. They generalize better to novel objects. They can potentially learn from human videos rather than requiring expensive teleoperation data.
But I want to inject some skepticism here. We've seen affordance-based manipulation papers before, and the field has not yet produced a general-purpose manipulation system that works reliably outside controlled settings. What's different this time?
Partly it's the foundation model era. AFUN and AffordGen explicitly leverage pre-trained vision models that encode semantic knowledge about objects. This is a genuine capability boost that wasn't available five years ago.
Partly it's better benchmarks. The papers evaluate on multiple diverse test sets rather than cherry-picked demonstrations. This makes the results more credible (though still not definitive).
And partly it's the combination of geometric and semantic reasoning. GIFT's functional maps capture geometric structure. SpaceTools' tool augmentation adds metric precision. AFUN's joint prediction of "where to interact" and "how to interact" bridges perception and action. These aren't just neural networks memorizing patterns; they're incorporating structured inductive biases about how objects work.
Several open questions remain, and I think the field would benefit from addressing them directly.
First, cross-paper evaluation. These four approaches use different benchmarks and baselines. A unified evaluation on shared tasks would clarify which methods work best under which conditions. The RoboSpatial benchmark used by SpaceTools seems like a reasonable starting point, but GIFT's functional map approach might require different evaluation criteria.
Second, failure mode analysis. When do these methods fail? Are the failure modes similar (all struggle with deformable objects, say) or complementary (GIFT fails on topology changes while AffordGen fails on texture ambiguity)? Understanding failures is often more informative than celebrating successes.
Third, computational requirements. Foundation model approaches require substantial compute for training and inference. GIFT's single-demonstration transfer is computationally lighter but more constrained. For real-world deployment, these tradeoffs matter. None of the abstracts provide clear compute budgets.
Fourth, long-horizon tasks. All four papers focus on relatively short manipulation primitives. Real-world tasks (making coffee, tidying a room) require chaining many such primitives. It remains unclear whether affordance representations compose well over extended task horizons.
Finally, and this is perhaps most important, we need more real-world validation. GIFT, AffordGen, and AFUN all claim real-world experiments, but the scale and diversity of those experiments is not clear from abstracts alone. SpaceTools demonstrates a 7-DOF robot manipulation, which is concrete but limited. The gap between benchmark performance and robust real-world deployment remains substantial in manipulation research.
I started by saying affordances are having a moment, and I stand by that. But moments pass. The question is whether this cluster of papers represents genuine progress toward general-purpose manipulation, or another cycle of incremental advances that don't quite add up to practical systems.
I'm cautiously optimistic, which is unusual for me. The convergence across multiple research groups suggests the community is coalescing around productive representations. The integration of foundation model capabilities with structured geometric reasoning feels like the right combination. And the emphasis on zero-shot generalization to novel objects addresses the fundamental limitation of prior imitation learning work.
But we've been here before. The field has a habit of declaring breakthroughs that don't survive contact with unstructured environments. These papers are promising, but they're not proof. What we need now is sustained effort on the hard problems: robustness, generalization, and real-world deployment at scale.
The affordance framework gives us a language for talking about manipulation that's more abstract than pixels and more concrete than task descriptions. That's valuable. Whether it's sufficient for general-purpose robotics remains to be seen. I suspect we'll know more in two to three years, once these methods have been stress-tested beyond their original benchmarks.
For now, I'd recommend reading AFUN for the most ambitious vision, GIFT for the most elegant mathematics, AffordGen for the generative data augmentation approach, and SpaceTools for the tool-augmented reasoning angle. They're all worth your time, even if (especially if) you're skeptical about the hype.