The Factuality Problem Nobody Wants to Talk About
Google and OpenAI just released benchmarks showing their best models get basic facts wrong 30-40% of the time. That's... not great.
Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Here's a number that stopped me cold: OpenAI's o1-preview, their most capable reasoning model, gets simple factual questions wrong about 40% of the time.
Forty percent. On questions with clear, verifiable answers.
I've been thinking about this since both Google DeepMind and OpenAI quietly dropped new factuality benchmarks in the past few weeks. The timing feels deliberate, like two companies simultaneously admitting to a problem they've been dancing around. And honestly, the results are worse than I expected.
What the benchmarks actually show
Let me back up. Google's FACTS Benchmark Suite and OpenAI's SimpleQA are both trying to answer the same question: when you ask an AI a straightforward factual question, how often does it just... make stuff up?
SimpleQA is the simpler of the two (hence the name, I guess). Short questions, single correct answers, stuff you could verify with a quick search. Things like "What year did X happen?" or "Who founded Y company?" The kind of questions where there's no ambiguity, no room for interpretation.
Google's FACTS suite is more comprehensive. It tests across different domains, different question types, different levels of complexity. But the core goal is the same: figure out how much you can actually trust what these models tell you.
The results are, tbh, pretty sobering. Even the best models are hovering somewhere in the 60-70% accuracy range on these benchmarks. That means if you ask a frontier AI model ten simple factual questions, it'll probably get three or four wrong.
You might be wondering why this matters when we've known about hallucinations forever. Fair point. But there's something different about seeing it quantified this precisely. It's one thing to know models sometimes make things up. It's another to see that the error rate on basic facts is comparable to flipping a weighted coin.
The calibration problem is actually scarier
Here's what really got me though. It's not just that models get things wrong. It's that they don't know when they're wrong.
OpenAI's benchmark specifically measures what they call calibration, basically whether the model's confidence matches its actual accuracy. A well-calibrated model would say "I'm not sure" when it's likely to be wrong. These models... don't really do that.
They'll state incorrect facts with the same confident tone they use for correct ones. There's no reliable signal that tells you "hey, I'm guessing here." Which means you, the user, have no way to know when to double-check and when to trust.
Fuentes
- FACTS Benchmark Suite: Systematically evaluating the factuality of large language models· Google DeepMind
- Introducing SimpleQA· OpenAI Blog
Cobertura relacionada
More in AI Models
Five years after AlphaFold solved protein folding, researchers are engineering heat-tolerant plants by redesigning photosynthesis itself.
Sarah Williams · 1 hour ago · 5 min
Three papers in two weeks suggest synthetic training data could replace expensive real-world robot demonstrations. I've seen this movie before, but the ending might be different this time.
Mark Kowalski · 1 hour ago · 6 min
Everyone's focused on AI chatbots manipulating users. The real concern is what happens when these systems control physical hardware.
James Chen · 1 hour ago · 6 min
DeepMind has released so many Gemini variants in the past few months that I genuinely lost count. Here's what's actually going on.