Bildnachweis: Image via source article. Used under fair use for news commentary. · source
So here's a question I've been mulling over: why does your fancy H100 GPU only use 27 percent of its memory bandwidth when running the kind of AI inference that robots actually need?
I've been covering tech long enough to recognize when an industry is building cathedrals on sand. The self-driving car hype cycle taught me that. The dot-com bubble before that. And now I'm watching the AI hardware market make what looks like the same fundamental mistake, just with better marketing and bigger numbers.
A new paper from researchers on arXiv lays out the problem with uncomfortable clarity. When you're running a robot, an autonomous vehicle, or any physical AI system that needs to respond in real time, you're not doing the same kind of inference that OpenAI runs in its data centers. You're doing what they call "batch-1 autoregressive decode," which is a fancy way of saying: one robot, one camera feed, one user, waiting on the next token. No batching. No parallelism to hide the inefficiencies.
And here's where it gets interesting (and by interesting I mean concerning for anyone who's bought into the GPU arms race).
The researchers tested batch-1 decode across four Nvidia GPUs: the H100 SXM5, A100-80GB, L40S, and the humble L4. They ran three different 7 to 8 billion parameter models at various context lengths. What they found should make hardware planners nervous.
The L4, Nvidia's cheapest option in the test, achieved roughly 81 percent of its theoretical memory bandwidth floor. The H100, their flagship monster, hit only 27 percent. Let me say that again: the most expensive GPU in the lineup was the least efficient at the actual workload physical AI systems need.
Verwandte Beiträge
More in Autonomy
Two new papers suggest we've been overthinking autonomous vehicle perception, and the simpler approaches are winning.
Sarah Williams · 19 hours ago · 5 min
Three new papers tackle the same problem most coverage ignores: predicting the future is useless if you can't actually do anything with it.
Sarah Williams · 19 hours ago · 7 min
Two new papers show robots are finally learning to navigate spaces the way humans do: by reading signs and understanding context, not just mapping geometry.
Sarah Williams · 19 hours ago · 5 min
Forget the humanoid hype for a second. These research papers tackle the boring, essential problem of how robots remember where they've been.
This isn't a bug in the benchmark. It's a feature of how these chips are designed. The faster the memory, the more visible a different bottleneck becomes: launch-side overhead. The researchers isolated this with a CUDA Graphs experiment that improved H100 decode latency by 1.26x while barely touching the L4's performance. The slow GPU was already close to its floor. The fast GPU had all this headroom it couldn't actually use.
Call me old-fashioned, but I remember when "faster hardware equals faster performance" was a reasonable assumption. Those days are apparently over.
This is the context you need to understand why a startup called Majestic Labs thinks it can challenge Nvidia with a fundamentally different approach. IEEE Spectrum covered their new Prometheus server, and the specs are, well, audacious.
128 terabytes of memory. That's not a typo. That's over 60 times more than Nvidia's DGX B300. Their pitch is simple: current AI servers are "greatly over-provisioning on compute and starving on memory."
Now, I've watched enough startups promise to dethrone incumbents to be skeptical. Most of them end up as footnotes or acquisition targets. But the technical argument here is at least coherent.
"You get this shoreline at the compute die where you can put your HBM. If you wanted to put more, you can't," explains Sha Rabii, Majestic's co-founder. It's a physical constraint. High-bandwidth memory needs to sit millimeters from the processor. That limits how much you can actually install.
Majestic's solution is proprietary copper cables that work up to a meter, paired with custom memory aggregation chips. "It's an endpoint for that high-speed interface and fans out to many, many commodity DRAM chips," Rabii says. They're claiming memory bandwidth up to 25.6 terabytes per second using LPDDR6 instead of HBM.
The compute side uses something they call Ignite, a custom chip combining ARM cores with RISC-V vector and tensor cores. Twelve of these per server. They haven't released specific performance numbers, which, you know, is always a bit of a yellow flag. But the architecture at least makes sense on paper.
Here's where my skepticism kicks in harder. You can build the most elegant hardware in the world, but if developers have to rewrite their entire stack to use it, you've already lost. Nvidia's moat isn't just silicon, it's CUDA and the decade of software infrastructure built on top of it.
Majestic knows this. "We're trying to reduce friction as much as possible in every aspect of our customer adoption, whether it's physical or software," Rabii acknowledges. They're promising PyTorch, vLLM, and Triton compatibility without code changes. Existing models run as-is, supposedly.
I'll believe it when I see it in production. But at least they're saying the right things.
The economic pitch is bold: "Our customers' capital expenditure will come down by, depending on the workload, 10 to 50 times, and the power consumption comes down by a similar amount." That's a massive claim. If true, it changes the math for anyone deploying physical AI at scale. If not, well, add it to the pile of startup promises that didn't survive contact with reality.
Let's bring this back to what actually matters for the robotics industry. The arXiv paper's findings have immediate implications.
First, if you're buying H100s for robot inference, you're probably wasting money. The paper shows that common quantization paths on cheaper GPUs don't recover expected performance gains either, GPTQ with ExLlamaV2 being the notable exception. So the optimization path is narrow and requires careful engineering.
Second, the "memory wall" is real and it's not going away with faster GPUs. This is a fundamental architectural problem. Throwing more compute at it doesn't help when you're starving for memory bandwidth and the overhead is eating your gains.
Third, and this is the part that remains unclear, we don't actually know if Majestic's approach will work at scale. They're shipping servers this year, apparently, but real-world performance data is thin. The Prometheus fits four units per rack at up to 120 kilowatts with liquid cooling. That's a lot of infrastructure to bet on an unproven architecture.
I've watched enough hardware cycles to know that the technically superior solution doesn't always win. VHS beat Betamax. x86 outlasted everything. Sometimes good enough plus ecosystem wins over elegant but orphaned.
What I find genuinely interesting about this moment is that we might be watching the limits of the GPU-centric approach become visible. Nvidia built an empire on the assumption that more parallel compute is always the answer. For training massive models and serving millions of users in data centers, they're probably still right.
But for physical AI? For the robot in your warehouse or the car trying not to hit pedestrians? The workload is fundamentally different. It's memory-dominated, latency-critical, and doesn't benefit from batching. The arXiv paper calls this the "Physical AI Inference Gap," which is a nice way of saying the industry optimized for the wrong thing.
Majestic isn't the only company noticing this. I'd expect more startups and even established players to start exploring memory-centric architectures over the next few years. Whether any of them can actually challenge Nvidia's dominance is a different question entirely.
For now, the practical advice is boring but true: benchmark your actual workload on your actual hardware. Don't assume that the most expensive GPU is the right choice. And maybe, just maybe, consider that the kids building robots today are running into constraints that the AI hype cycle hasn't fully acknowledged yet.
If you want to argue about any of this, my email's on the about page. I still prefer it to Slack.