The Factuality Problem Nobody Wants to Talk About

Google and OpenAI just released benchmarks showing their best models get basic facts wrong 30-40% of the time. That's... not great.

25 May 20265 min de lectura

Here's a number that stopped me cold: OpenAI's o1-preview, their most capable reasoning model, gets simple factual questions wrong about 40% of the time.

Forty percent. On questions with clear, verifiable answers.

I've been thinking about this since both Google DeepMind and OpenAI quietly dropped new factuality benchmarks in the past few weeks. The timing feels deliberate, like two companies simultaneously admitting to a problem they've been dancing around. And honestly, the results are worse than I expected.

What the benchmarks actually show

Let me back up. Google's FACTS Benchmark Suite and OpenAI's SimpleQA are both trying to answer the same question: when you ask an AI a straightforward factual question, how often does it just... make stuff up?

SimpleQA is the simpler of the two (hence the name, I guess). Short questions, single correct answers, stuff you could verify with a quick search. Things like "What year did X happen?" or "Who founded Y company?" The kind of questions where there's no ambiguity, no room for interpretation.

Google's FACTS suite is more comprehensive. It tests across different domains, different question types, different levels of complexity. But the core goal is the same: figure out how much you can actually trust what these models tell you.

The results are, tbh, pretty sobering. Even the best models are hovering somewhere in the 60-70% accuracy range on these benchmarks. That means if you ask a frontier AI model ten simple factual questions, it'll probably get three or four wrong.

Cobertura relacionada

More in AI Models

Chipmakers swung wildly this week, from a Tuesday 'chip-wreck' to a Micron-led surge after hours. What's actually going on with AI's hardware backbone?

Sarah Williams · 26 Jun · 5 min

The original Creator Studio was shut down in 2023. Now it's back, rebuilt around an AI assistant that promises to grow your audience and reply to comments in your voice.

Sarah Williams · 26 Jun · 5 min

At its annual Config conference, Figma announced coding layers, AI-generated motion graphics, and a reimagined canvas that blurs the line between design and full-stack development.

Sarah Williams · 26 Jun · 5 min

Everyone talks about chips and models. The memory bottleneck is the part of the AI buildout that keeps getting underestimated, and Micron's latest earnings make that case hard to ignore.

The Factuality Problem Nobody Wants to Talk About

What the benchmarks actually show

More in AI Models

The calibration problem is actually scarier

What this means for anyone building with LLMs

Fuentes