The Affordance Revolution: How Robots Are Learning Where and How to Grab Things

A cluster of new research papers suggests we're finally cracking the problem of teaching robots to manipulate objects they've never seen before, though the field still has significant hurdles to clear.

By Aisha Patel

1 hour ago8 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Think about how you pick up a mug. You don't consciously calculate grip positions or plan a trajectory. You just see the handle, understand it's for grasping, and your hand knows what to do. This seemingly trivial act, which toddlers master by age two, has been one of robotics' most stubborn unsolved problems. A robot trained to pick up a specific blue mug often fails catastrophically when presented with a red one, or a mug with a different handle shape, or (heaven forbid) a teapot. This is the generalisation problem, and it has plagued manipulation research for decades.

In the past few weeks, four papers have landed on arXiv that collectively suggest we may be turning a corner. To be precise, they're all attacking the same fundamental question from different angles: how do you teach a robot to understand what parts of an object are for, rather than just what the object looks like? The technical term is "affordance," a concept borrowed from ecological psychology, and it's becoming the central organising principle for a new generation of manipulation systems.

The most ambitious of these efforts is AFUN, which its authors describe as "a step towards an affordance foundation model." The framing is deliberately cautious (I appreciate that), but the results are genuinely impressive. Given a single RGB-D image and a natural language task description, AFUN predicts both where to interact with an object (a segmentation mask) and how to interact with it (a 3D motion curve). The key insight is treating these as a unified prediction problem rather than separate stages. Previous approaches typically handled localisation and motion planning as distinct modules, which created brittle handoff points where errors could compound.

What makes AFUN interesting, beyond the architecture, is the data pipeline. The authors built what they call a "standardised affordance schema" that converts heterogeneous data sources (robot demonstrations, human videos, simulations, real-world scans) into a common format with language labels, masks, and object-centric motion annotations. This is the kind of unsexy infrastructure work that actually enables progress. The benchmark results are striking: improvements of 23.9 and 26.3 points in mean gIoU and cIoU over baselines across eight test sets. More importantly, they demonstrate real-world robot manipulation without fine-tuning for specific embodiments, which suggests the representations are genuinely transferable.

Related coverage

More in AI Models

The AI company's rapid expansion of access to its vulnerability-finding model raises questions about what changed, and what we still don't know.

Aisha Patel · 1 hour ago · 5 min

The company said Mythos was too risky for public release. Now it's handing out access like conference swag.

Sarah Williams · 1 hour ago · 3 min

Four recent papers tackle the same fundamental question: how do robots understand what objects are for? The answers are converging in interesting ways.

Aisha Patel · 1 hour ago · 8 min

A wave of papers promises to make robot learning faster, cheaper, and more robust. Some of it might even be true.

Sources