Two New SLAM Papers Want Robots to See the World More Like We Do
A pair of arXiv papers tackle one of robotics' oldest headaches: getting robots to build accurate maps of the world, even when the lighting is terrible or the geometry is tricky.
By
·7 hours ago·8 min read
Think about the last time you walked into a dark parking garage. Your eyes adjusted, maybe slowly, but you didn't lose track of where you were. You didn't suddenly forget the shape of the pillars or the slope of the ramp. You just... kept going.
Robots can't do that. Not reliably, anyway. And that gap, between what humans take for granted and what robots can actually manage, is basically the whole problem that SLAM research is trying to close.
SLAM stands for Simultaneous Localization and Mapping. It's the process by which a robot figures out where it is while also building a map of its surroundings. It's been a core challenge in robotics for decades, and honestly, I think a lot of people outside the field assume it's a solved problem by now. It's not. Two new papers posted to arXiv this week suggest researchers are still finding meaningful ways to push it forward, and both of them are leaning on a technique called 3D Gaussian Splatting to do it.
If you've been following robotics or computer vision for the past couple of years, you've probably seen the term "3D Gaussian Splatting" (3DGS) a lot. I should know this better than I do, but here's my working understanding: instead of representing a scene as a mesh or a point cloud, 3DGS represents it as a collection of overlapping 3D blobs, each with its own position, size, orientation, and color. The result is a representation that's fast to render and surprisingly good at capturing fine visual detail.
The technique took off in computer graphics, where people used it to generate photorealistic novel views of scenes from a handful of photos. Then robotics researchers started asking: what if we used this for mapping? What if a robot could build one of these Gaussian representations in real time, as it moves through the world?
Related coverage
More in Humanoids
A new technique from arXiv mirrors robot demonstrations to double usable training data without collecting a single extra example, and it's simpler than it sounds.
Sarah Williams · 5 hours ago · 6 min
A pair of freshly released robotics datasets tackle opposite ends of the same problem: teaching humanoids what to do, and teaching them what not to do.
Sarah Williams · 2 days ago · 5 min
Three new robotics papers suggest we're past the proof-of-concept phase for humanoid loco-manipulation, and the numbers are starting to back that up.
Mark Kowalski · 2 days ago · 7 min
A cluster of new research is tackling one of robotics' most stubborn problems: getting robots to actually use touch. The sim-to-real gap is the villain of the story.
That's the promise. The problem is that current 3DGS-based SLAM systems have some real weaknesses, and the two papers posted this week each attack a different one.
The first paper introduces something called MMD-SLAM, which stands for Multi-Meta Gaussian Distribution SLAM. The core complaint it's responding to is that most existing 3DGS SLAM systems don't really think about the structure of a scene. They treat every surface more or less the same, which works fine in some environments but leads to blurry, inconsistent maps in others.
MMD-SLAM tries to fix this by baking in what the authors call the "Atlanta World" assumption. I initially thought this was some quirky name for a new dataset, but it's actually a geometric prior: the idea that most indoor environments are built around a small number of dominant directions. Walls are vertical. Floors are horizontal. Ceilings are parallel to floors. Most rooms, in other words, have a kind of underlying regularity that a smart system should be able to exploit.
The system does this in a few ways. It uses a "point-line fusion" strategy for tracking, meaning it doesn't just look at individual points in the scene but also at 3D line segments, like the edges of walls or door frames. Those lines give the system extra geometric anchors to work with, which makes tracking more robust when the visual texture is sparse or repetitive.
It also introduces a "Gaussian evolution strategy" that adapts the shape and placement of Gaussian blobs based on the underlying geometry of the scene. Rather than scattering blobs uniformly, the system tries to align them with dominant structural directions.
The results, at least on standard benchmarks, are impressive. The paper reports a 48.56% reduction in tracking error (measured as ATE RMSE) on the ScanNet dataset compared to MonoGS, which is one of the more widely used baselines in this space. It also shows a 5.71% improvement in rendering quality on the Replica dataset. Those are meaningful numbers, though it's worth noting that benchmark performance and real-world performance don't always line up cleanly.
The second paper, LIT-GS, is tackling a different problem entirely: what happens when your visual sensors can't see properly?
Most SLAM systems rely heavily on RGB cameras. They track features in the image, match them across frames, and use that to estimate how the camera has moved. This works great in well-lit, visually rich environments. It falls apart in the dark, in fog, in scenes with lots of reflective surfaces, or anywhere that visual texture is sparse.
The LIT-GS team's answer is to swap out (or supplement) the RGB camera with a thermal camera. Thermal imaging doesn't depend on visible light. It captures heat signatures instead, which means it works in complete darkness, through smoke, and in other conditions where standard cameras struggle.
The catch is that thermal images are often low-contrast and lack the sharp edges and textures that visual SLAM systems depend on. So the LIT-GS system adds two more sensors to the mix: a LiDAR (which measures distances using laser pulses) and an inertial measurement unit (which tracks acceleration and rotation). The result is a LiDAR-inertial-thermal system, which is where the "LIT" in LIT-GS comes from.
The clever part is how the system fuses these modalities. LiDAR gives you precise geometric information about surfaces, specifically the planes that make up walls, floors, and other flat structures. LIT-GS uses those plane measurements as constraints during the optimization process, essentially telling the Gaussian representation: "these blobs need to align with these actual surfaces we measured with the laser."
This matters because thermal images alone would let the system drift. The geometry from LiDAR acts as a corrective anchor. The paper reports consistent improvements in geometric accuracy and rendering quality over existing LiDAR-inertial-visual baselines, particularly in challenging lighting conditions.
You might be wondering why this matters beyond the lab. Fair question.
Here's the thing: any robot that needs to navigate the real world needs SLAM, or something very much like it. That includes warehouse robots, autonomous vehicles, surgical robots, and yes, humanoids. The whole vision of a humanoid robot that can walk into your home and help with tasks depends on that robot being able to build an accurate, stable model of its environment in real time, under all kinds of conditions.
The lighting problem LIT-GS addresses is particularly relevant. Homes aren't uniformly lit. Kitchens have bright overhead lights; living rooms have lamps in corners; hallways are often dim. A robot that can only navigate in good lighting isn't going to be very useful.
The structural awareness that MMD-SLAM introduces matters too. Most human environments are, in fact, built around dominant directions. If a robot can exploit that regularity, it can build better maps faster and with less compute.
Tbh, I think the more interesting question is whether these approaches can be combined. LIT-GS handles multi-modal sensing under difficult lighting. MMD-SLAM handles structural reasoning in visually rich environments. A system that does both would be more robust than either alone. It remains unclear whether the authors have any plans to collaborate, or whether others in the field will attempt a synthesis.
Both papers are evaluated on standard benchmarks, which is the norm in this field but also a real limitation. Benchmarks like ScanNet and Replica are recorded in controlled conditions with known properties. Real deployment environments are messier, more variable, and full of edge cases that benchmarks don't capture.
The LIT-GS paper also mentions evaluation on "proprietary sequences," which means some of the data isn't publicly available. That makes independent verification harder. This is based on what's reported in the papers; I haven't had the chance to dig into the code or run any experiments myself.
There's also the question of compute. 3DGS-based SLAM is more computationally demanding than traditional approaches. Both papers appear to target systems with meaningful GPU resources. How these methods perform on the kind of embedded hardware you'd actually put in a mobile robot is a question that neither paper fully answers.
And then there's the real-world gap. Even state-of-the-art SLAM systems can fail in ways that are hard to predict. Glass walls. Mirrors. Highly repetitive patterns like tiled floors. Dynamic objects moving through the scene. These are the things that tend to expose the limits of any mapping system, and it's too early to say how MMD-SLAM or LIT-GS would handle them at scale.
I've been covering embodied AI long enough to have a sort of complicated relationship with SLAM papers. There are a lot of them. The benchmarks keep going up. And yet robots still struggle with mapping in ways that feel like they shouldn't, given how much research has gone into this.
But I think these two papers represent something genuinely useful: not just incremental benchmark improvements, but new architectural ideas. Using structural priors to guide Gaussian representations. Fusing thermal and LiDAR data to handle illumination failures. These are conceptual moves, not just parameter tweaks, and those tend to have longer legs.
The path from a compelling arXiv paper to a robot that reliably navigates your home is long and full of unpleasant surprises. But someone has to take the early steps, and papers like these are part of how that happens.