The Affordance Revolution: How Robots Are Learning Where and How to Grab Things
A cluster of new research papers suggests we're finally cracking the problem of teaching robots to manipulate objects they've never seen before, though the field still has significant hurdles to clear.
Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Think about how you pick up a mug. You don't consciously calculate grip positions or plan a trajectory. You just see the handle, understand it's for grasping, and your hand knows what to do. This seemingly trivial act, which toddlers master by age two, has been one of robotics' most stubborn unsolved problems. A robot trained to pick up a specific blue mug often fails catastrophically when presented with a red one, or a mug with a different handle shape, or (heaven forbid) a teapot. This is the generalisation problem, and it has plagued manipulation research for decades.
In the past few weeks, four papers have landed on arXiv that collectively suggest we may be turning a corner. To be precise, they're all attacking the same fundamental question from different angles: how do you teach a robot to understand what parts of an object are for, rather than just what the object looks like? The technical term is "affordance," a concept borrowed from ecological psychology, and it's becoming the central organising principle for a new generation of manipulation systems.
The most ambitious of these efforts is AFUN, which its authors describe as "a step towards an affordance foundation model." The framing is deliberately cautious (I appreciate that), but the results are genuinely impressive. Given a single RGB-D image and a natural language task description, AFUN predicts both where to interact with an object (a segmentation mask) and how to interact with it (a 3D motion curve). The key insight is treating these as a unified prediction problem rather than separate stages. Previous approaches typically handled localisation and motion planning as distinct modules, which created brittle handoff points where errors could compound.
What makes AFUN interesting, beyond the architecture, is the data pipeline. The authors built what they call a "standardised affordance schema" that converts heterogeneous data sources (robot demonstrations, human videos, simulations, real-world scans) into a common format with language labels, masks, and object-centric motion annotations. This is the kind of unsexy infrastructure work that actually enables progress. The benchmark results are striking: improvements of 23.9 and 26.3 points in mean gIoU and cIoU over baselines across eight test sets. More importantly, they demonstrate real-world robot manipulation without fine-tuning for specific embodiments, which suggests the representations are genuinely transferable.
Related coverage
More in AI Models
The AI company's rapid expansion of access to its vulnerability-finding model raises questions about what changed, and what we still don't know.
Aisha Patel · 1 hour ago · 5 min
The company said Mythos was too risky for public release. Now it's handing out access like conference swag.
Sarah Williams · 1 hour ago · 3 min
Four recent papers tackle the same fundamental question: how do robots understand what objects are for? The answers are converging in interesting ways.
Aisha Patel · 1 hour ago · 8 min
A wave of papers promises to make robot learning faster, cheaper, and more robust. Some of it might even be true.
It's worth noting that AFUN hasn't been replicated yet, and the paper is fresh (v1 on arXiv). The benchmark comparisons are against methods the authors selected, which is standard practice but always warrants some caution. I'd want to see independent evaluations before declaring victory.
A complementary approach comes from AffordGen, which tackles the data diversity problem from a different angle. The core observation is that imitation learning methods are only as good as their training demonstrations, and collecting diverse demonstrations is expensive. AffordGen uses 3D generative models and vision foundation models to synthesise new manipulation trajectories by finding semantic correspondences between objects. If you have a demonstration of opening one drawer, the system can generate plausible demonstrations for opening geometrically different drawers by mapping "meaningful keypoints" across the 3D meshes.
I know I'm being picky here, but the phrase "meaningful keypoints" is doing a lot of work in that paper. The method relies on the assumption that vision foundation models have learned robust semantic correspondences, which is probably true for common objects but may break down for unusual geometries or materials. The authors report "high success rates" and "zero-shot generalisation to truly unseen objects," but the specific numbers vary significantly across object categories. Some categories hit 90%+ success, others struggle below 60%. This heterogeneity is actually informative (it tells us where the method's assumptions hold), but it's easy to miss if you only read the abstract.
The third paper in this cluster, GIFT, takes a more mathematically grounded approach using Functional Maps, a framework from computational geometry. The idea is to represent object interactions as functions on surfaces, then use the Functional Maps machinery to transfer these functions between objects with similar topologies. If you've demonstrated how to pour from one pitcher, GIFT can map that skill to a differently-shaped pitcher by finding correspondences in the functional space rather than the geometric space.
Actually, the research shows this is genuinely novel in one specific way: previous skill transfer methods typically required objects to be geometrically similar, not just topologically similar. A tall thin pitcher and a short wide pitcher have the same topology (both are containers with spouts) but very different geometry. GIFT handles this by working in a more abstract representation space. The authors also incorporate screw interpolation for generating smooth robot paths, which addresses a practical issue that many papers ignore: transferred skills often produce jerky or physically implausible motions.
The limitation here is that GIFT requires objects to have "similar topologies or categories," which is a meaningful constraint. It won't help you transfer skills from manipulating rigid objects to deformable ones, or from tabletop manipulation to in-hand manipulation. The paper validates on "diverse real-world environments," but the diversity is within a relatively narrow band of tabletop pick-and-place style tasks.
The fourth paper, SpaceTools, approaches the problem from the vision-language model side rather than the manipulation side. The core insight is that VLMs are good at qualitative visual understanding ("that's a cup on a table") but bad at metrically precise spatial reasoning ("the cup is 23cm from the edge"). SpaceTools addresses this by teaching VLMs to use external tools (depth estimators, segmentation models, pose estimators) through a two-phase reinforcement learning framework they call Double Interactive RL.
The teaching phase combines demonstrations from single-tool specialists with traces from frontier models using all tools. The exploration phase then refines multi-tool coordination through continued RL. This is a clever way to manage the combinatorial explosion of multi-tool reasoning, which has limited previous RL approaches to single-tool settings. The results on spatial understanding benchmarks are strong: +12% over vanilla supervised fine-tuning and +16% over standard RL baselines on RoboSpatial.
What's unclear from the SpaceTools paper is how the tool-use patterns generalise beyond the specific tool set used in training. The authors demonstrate real-world manipulation with a 7-DOF robot, which is encouraging, but the manipulation tasks shown are relatively constrained. Whether the learned tool coordination transfers to, say, mobile manipulation or bimanual tasks remains an open question.
Taken together, these four papers suggest a convergent evolution toward affordance-centric representations. AFUN builds a foundation model that predicts affordances directly. AffordGen uses affordances to generate diverse training data. GIFT transfers affordance-like functional representations between objects. SpaceTools augments VLMs with tools that can extract affordance-relevant spatial information. They're all circling the same fundamental insight: robots need to understand what objects are for, not just what they look like.
This is genuinely new compared to, say, five years ago, when the dominant paradigm was end-to-end learning from pixels to actions. That approach worked in narrow settings but failed to generalise. The affordance framing provides a more structured intermediate representation that seems to enable better transfer. Whether this represents a true paradigm shift or just incremental progress over prior work on semantic grasping and task-oriented manipulation, well, it's too early to say definitively. The benchmark improvements are substantial, but benchmarks have a way of overstating real-world performance.
There are also some conspicuous gaps in this research cluster. None of these papers seriously address deformable object manipulation, which has different affordance structures than rigid objects. None of them handle tool use in the sense of using objects as tools (as opposed to using software tools, which SpaceTools does). And none of them grapple with the social affordances that matter for human-robot interaction: understanding that a door handle affords opening, but also that you shouldn't open someone else's door without permission.
The sample sizes in the real-world experiments are also, frankly, small. AFUN shows a handful of manipulation scenarios. GIFT validates on "diverse environments" but doesn't quantify diversity. AffordGen's zero-shot generalisation claims are based on limited object categories. This is standard for robotics papers (real-world experiments are expensive and time-consuming), but it means we should hold our conclusions loosely.
What I'd want to see next is systematic comparison of these methods on shared benchmarks, particularly benchmarks that stress-test generalisation to truly out-of-distribution objects and environments. The field has historically suffered from every paper using its own evaluation setup, making it difficult to assess relative progress. The AFUN paper's use of multiple existing benchmarks is a step in the right direction.
I'd also want to see more analysis of failure modes. When do affordance-based methods fail? Are the failures systematic (suggesting fundamental limitations) or idiosyncratic (suggesting engineering problems to solve)? The papers report success rates but rarely characterise the failures in detail.
The practical implications are significant if these methods mature. Manufacturing robots that can handle part variability without reprogramming. Assistive robots that can manipulate household objects they weren't explicitly trained on. Warehouse robots that can pick novel items from the first encounter. These have been perpetually "five years away" for decades, but the affordance-centric approach feels like it might actually close the gap.
Or it might not. Robotics has a long history of promising paradigms that worked beautifully in controlled settings and fell apart in deployment. The real test will come when these methods encounter the chaos of actual human environments: objects in unexpected poses, partial occlusions, adversarial lighting, surfaces covered in clutter. The benchmark numbers are encouraging, but benchmarks are not the world.
For now, the most honest assessment is that affordance-based manipulation has moved from "interesting research direction" to "probably the right framework." The specific implementations will evolve, the architectures will change, but the core insight (that robots need functional understanding, not just perceptual recognition) seems likely to persist. That's progress, even if we're still years away from robots that can actually clean up after themselves.