Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Why can't robots see glass?
It's a question I've been asking since the early 2000s, when I watched a very expensive industrial arm repeatedly fail to pick up a beaker in a lab demo. The engineers blamed the lighting. Then the sensor. Then the software. Twenty years later, we've got robots doing backflips and having conversations, but hand one a wine glass and watch the magic disappear.
Three papers dropped this week that suggest we might finally be getting somewhere. And look, I've seen enough "breakthrough" announcements to fill a landfill, but this batch actually tackles the problem in ways that could work outside a controlled lab. Call me old-fashioned, but I care more about whether something ships than whether it publishes.
Transparent objects break depth sensors. That's the short version. The longer version involves physics that would put most readers to sleep, but basically: when light hits glass, it refracts and reflects in ways that make standard depth cameras see nonsense. They'll report a glass as being behind where it actually is, or not there at all, or somehow floating in mid-air.
This isn't a new problem. It's been a known limitation since depth cameras became cheap enough to put on robots. But the robotics industry, in its infinite wisdom, spent the last decade focusing on more exciting challenges (autonomous vehicles! humanoids! chatbots that can code!) while warehouse robots continued to struggle with anything see-through.
I've seen this movie before. Remember when everyone was chasing self-driving cars while basic industrial automation problems went unsolved? Same energy.
The industrial robotics giant is betting on Nvidia's AI stack, but the real question is whether physical AI can deliver beyond the demo stage.
James Chen · 3 hours ago · 5 min
A batch of new papers promises real-time diffusion on edge hardware. I've seen enough 'breakthroughs' to know which parts matter.
Robert "Bob" Macintosh · 3 hours ago · 5 min
The Tokyo Stock Exchange wants to make it easier to list actively managed ETFs, and I'm trying to figure out if this matters for the robotics sector or if I'm connecting dots that aren't there.
Sarah Williams · 3 hours ago · 6 min
New research tackles the speed problem that's kept diffusion planners in the lab. About time.
The first one, Trans2Occ from arXiv, takes what I'd call the "just use a regular camera" approach. Instead of trying to fix depth sensors, they skip them entirely. Single RGB image goes in, voxel-space occupancy prediction comes out. The robot essentially learns to infer 3D shape from 2D appearance, which is sort of how humans do it if you think about it.
The clever bit is their training pipeline. They built a simulation system that generates paired images and occupancy labels under different materials and lighting conditions, thousands of variations, then showed the model transfers to real robots without fine-tuning. That last part matters! Sim-to-real transfer has been the graveyard of many promising approaches.
The second paper, ActMVS, goes after a related but distinct problem: active scene reconstruction using only a monocular camera. This is for robots and drones that need to build maps while navigating, without the weight and cost penalty of depth sensors. They're claiming performance competitive with RGB-D methods on the Replica dataset, which, if it holds up in messier environments, would be genuinely useful for lightweight inspection drones.
The third, AFUN, is more ambitious and therefore more likely to disappoint in practice (but what do I know). They're building what they call an "affordance foundation model" that predicts not just where to interact with an object but how to interact with it. Give it an RGB-D image and a language description like "pour water" and it outputs a functional mask plus a 3D motion curve.
Their numbers look impressive, 23.9 point improvement in mean gIoU over baselines, 12.7 to 61.3 percent hit-rate gains on contact-point prediction. They're also claiming real-world deployment without fine-tuning for specific robot embodiments, which is exactly the kind of claim that sounds great in a paper and falls apart in a factory.
Here's the thing about transparent object manipulation: it's not sexy, but it's everywhere. Pharmaceutical packaging. Food and beverage. Laboratory automation. Consumer goods. Basically any industry that puts things in bottles or containers has been working around this limitation for years, usually by adding fiducial markers or redesigning packaging or just accepting higher failure rates.
The Trans2Occ approach is particularly interesting because it's simple. Single camera, no depth sensor, rule-based grasping on top of the occupancy prediction. That's the kind of solution that actually gets deployed because it doesn't require rebuilding your entire perception stack.
I talked to a warehouse automation engineer last year (over email, naturally, because I'm not a Slack person) who estimated that transparent and reflective objects account for roughly 15 to 20 percent of pick failures in mixed-SKU environments. That's not a small number when you're processing millions of items.
None of these papers solve the full problem, and to their credit, they don't claim to. Trans2Occ focuses on grasping but doesn't address manipulation sequences. ActMVS is about mapping, not manipulation. AFUN tries to bridge perception and action but relies on RGB-D input, which brings back the depth sensor dependency.
There's also the question of speed. Real-time performance claims in papers often use different definitions of "real-time" than production systems require. The ActMVS paper mentions frame rates suitable for UAV navigation, but the specific numbers remain unclear from the abstract.
And honestly, we don't know yet how these approaches handle edge cases. Wet glass. Dirty glass. Glass next to mirrors. Glass in direct sunlight. The real world has a way of finding failure modes that simulation doesn't anticipate.
What I find encouraging about this batch of papers is that they're attacking the problem from first principles rather than trying to patch existing approaches. The shift from "how do we make depth sensors work on transparent objects" to "how do we perceive transparent objects without depth sensors" is the kind of reframing that sometimes leads to actual progress.
It reminds me of the early days of computer vision, when people stopped trying to hand-engineer features and started letting neural networks learn them. That paradigm shift took years to play out, but it eventually changed everything.
We're probably 3 to 5 years from seeing this stuff in production systems, assuming the usual timeline of academic paper to startup to pilot program to deployment. The kids building these systems (and yes, I'm calling them kids, they're probably younger than my email habits) will need to figure out all the boring integration challenges that papers don't cover.
But for the first time in a while, I'm actually optimistic that the transparent object problem might get solved. Not because any single paper cracked it, but because multiple groups are finally taking it seriously.
If you want to argue about this, my email's on the about page. Just don't expect a quick response. I'm still working through messages from last month.