The Attention Problem Nobody Wants to Talk About

Two new papers tackle the same bottleneck in vision transformers, and it's a sign that the field's scaling strategy is hitting a wall.

By Mark Kowalski

1 hour ago6 min read

Image credit: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source

Most of the coverage I've seen on transformer efficiency focuses on the exciting stuff, the flashy demos, the "look what our model can do" videos. But here's what those press releases don't mention: the attention mechanism that makes these models work is also what's strangling them. And two papers that dropped recently suggest the research community knows it, even if the marketing departments don't.

I've seen this movie before. Back in the early 2000s, everyone was throwing compute at problems without thinking about whether the underlying architecture could scale. Then reality hit. We're approaching that moment again with visual transformers, and the smart money is on efficiency, not just capability.

The quadratic problem

Here's the thing about attention in transformers (and I'm going to oversimplify because this isn't a textbook): every token needs to look at every other token. That's great for understanding context, terrible for scaling. Double your input length, quadruple your compute. It's basic math, but somehow it keeps surprising people.

A team from, well, the paper doesn't specify the institution prominently but it's on arXiv, just released work on what they call "Good Token Hunting." The name's a bit cute for my taste, but the idea is solid. Instead of letting every query attend to every key and value token, they restrict the interactions. Fewer tokens, less compute, hopefully similar results.

Their approach uses a two-stage framework. First, inter-frame selection picks which frames matter (they found diversity-based selection works best, which makes intuitive sense, you want coverage of the whole scene, not five slightly different views of the same corner). Second, intra-frame selection throws out redundant tokens within those frames, guided by attention entropy. The results? They claim over 85% acceleration for scenes with 500 images while maintaining or improving baseline performance.

That last part, the "improving baseline performance," is interesting. It suggests that maybe, just maybe, all those tokens weren't helping in the first place. The models were doing a lot of unnecessary work. Call me old-fashioned, but I find that both encouraging and slightly embarrassing for the field.

A different angle on the same wall

Meanwhile, another group tackled the same fundamental problem from a different direction. Their framework, DySta, appears in a separate arXiv paper focused on Vision-Language-Action models, the kind of thing you'd use for robotic control.

Their insight is that not everything in a video changes between frames. The static stuff (background, furniture, whatever isn't moving) doesn't need to be reprocessed constantly. So they disentangle visual inputs into static and dynamic tokens, keep one copy of the static stuff, and only update when necessary through what they call a "recache gate."

Sources

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers· arXiv — cs.RO (Robotics)
Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement· arXiv — cs.RO (Robotics)

Related coverage

More in AI Models

A wave of new research is revisiting an old idea in robotics, and the results suggest we've been overthinking trajectory generation for years.

Aisha Patel · 1 hour ago · 6 min

New benchmarks show vision-language-action models are getting better at understanding what you want, but still struggle with the basics of knowing when they've found it.

Robert "Bob" Macintosh · 1 hour ago · 4 min

A wave of new research is pushing robot learning away from raw pixel prediction toward something more structured, and the results are starting to look promising.

James Chen · 1 hour ago · 6 min

I was asked to cover recent AI news, but what I found instead was a pile of consumer electronics listicles masquerading as tech journalism.