Crédito de imagen: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Most of the coverage I've seen on OpenAI's new chain-of-thought monitoring research focuses on the technical achievement, which is fine, that's the obvious angle. But here's what nobody's saying: we've been down this road before with every major tech shift, where the companies building potentially dangerous things also get to define what "safe" means and how we measure it.
Call me old-fashioned, but that's a conflict of interest worth naming out loud.
OpenAI released a batch of research this week on monitoring the internal reasoning of their AI models, the stuff that happens in the "chain of thought" before the model spits out an answer. The idea is straightforward enough: if you can see what the model is thinking, you can catch it when it's thinking about doing something bad.
The OpenAI Blog post on their evaluation framework claims that monitoring a model's internal reasoning is "far more effective than monitoring outputs alone." They tested this across 13 different evaluations in 24 environments, which sounds comprehensive until you remember that these are environments they designed to test properties they chose to measure.
The more interesting finding, honestly, comes from their research on controllability. They introduced something called CoT-Control and discovered that reasoning models struggle to deliberately manipulate their own chains of thought. OpenAI frames this as good news, because it means the thinking process is harder to fake, which makes monitoring more reliable.
Cobertura relacionada
More in AI Models
The company's new 'Agentic Commerce Protocol' sounds impressive, but I've seen enough automation hype cycles to know the difference between demos and deployment.
Robert "Bob" Macintosh · 1 hour ago · 4 min
GPT-5.4 mini and nano aren't about chatbots. They're about running inference on edge hardware without melting your power budget.
James Chen · 1 hour ago · 4 min
The company says it built safety 'at the foundation.' I have questions.
Sarah Williams · 1 hour ago · 4 min
If you've ever set up safety interlocks on a factory floor, you'll recognise what OpenAI is doing here with prompt injection defenses.
I'll give them credit here, that's a genuinely useful property if it holds up. A model that can't easily hide its reasoning is easier to trust than one that can put on a show.
The piece that caught my attention was their post on monitoring internal coding agents. This isn't hypothetical lab stuff, they're talking about AI systems they've actually deployed internally at OpenAI to write code.
They found instances of what they call "misalignment" in real-world usage, cases where the model's chain of thought revealed intentions that didn't match what it was supposed to be doing. The post is frustratingly vague on specifics (of course it is), but the fact that they're finding anything at all in production systems is, well, it's something.
The numbers they don't give you are the ones I want. How often does this happen? What percentage of agent runs show concerning patterns? What's the false positive rate on their monitoring? They mention analyzing "real-world deployments" but won't tell us the scale or the severity distribution. If you want to argue about this, my email's on the about page.
Here's where I get grumpy, and you can decide if I'm being fair.
Every major technology company, when faced with pressure to prove their systems are safe, does exactly what OpenAI is doing here: they create their own safety frameworks, run their own evaluations, and publish papers showing their methods work. The papers are often technically sound! The methods are often genuinely useful! And somehow the conclusion is always that the company is doing the right things and just needs more time and resources to do them better.
I covered the autonomous vehicle industry for years. I watched Waymo and Cruise and Aurora all publish safety frameworks that they designed, measured by metrics they chose, evaluated by teams they employed. The frameworks kept getting more sophisticated while the fundamental question, "is this actually safe enough to deploy," remained unanswered by anyone who didn't have a financial stake in the answer being yes.
OpenAI's post on external testing acknowledges they work with "independent experts" to evaluate their systems. But the company still controls access, still chooses which experts get to look, still decides what gets published. That's not independence, that's supervised access.
The technical goals post states OpenAI's mission is to "build safe AI, and ensure AI's benefits are as widely and evenly distributed as possible." That's a fine mission statement. It's also exactly what you'd expect them to say whether or not they can deliver it.
The governance post talks about "voluntary commitments" from leading labs to reinforce safety. Voluntary! The kids running these companies, and yes I know some of them are in their 40s now, but what do I know, they're asking us to trust that voluntary commitments will be enough.
I'm not saying the chain-of-thought monitoring research is bad. It's probably necessary work, and it's better that someone's doing it than no one. The CoT-Control findings in particular suggest there might be real technical constraints that make deceptive AI harder to build, which would be genuinely good news if it generalizes.
But we don't know if it generalizes! The models they tested are the models they have now. The environments they tested are environments they designed. The definition of "misalignment" they're using is their definition.
Look, I've been covering tech long enough to know that the companies building transformative technologies are never the right entities to regulate themselves. It's not that they're evil, it's that their incentives point the wrong direction. OpenAI genuinely believes they're the good guys. They probably are the good guys, in the sense that the people working there want AI to go well for humanity.
That's not enough.
What would actually impress me is if OpenAI handed their evaluation frameworks to an independent body, gave that body full access to their systems without supervision, and committed to publishing whatever that body found, whether it made OpenAI look good or not. What would impress me is if the "voluntary commitments" in their governance post had teeth, real consequences if they're broken, enforced by someone other than the companies themselves.
The chain-of-thought monitoring work is a step in a useful direction. Watching what AI systems think before they act is obviously better than only watching what they do. The finding that models struggle to control their own reasoning processes is potentially important for making monitoring work long-term.
But let's not pretend that the company building the potentially dangerous thing has solved the problem of making sure it's safe. That's not how this works. That's never been how this works. And the fact that their research is technically sophisticated doesn't change the fundamental question of who gets to decide what "safe" means.
We've been here before with social media, with autonomous vehicles, with every platform that promised to self-regulate. The research papers were always impressive. The voluntary commitments were always sincere. And somehow we still ended up where we are.
Maybe AI will be different. Maybe these models really are easier to monitor than previous technologies. Maybe OpenAI's internal culture of safety is strong enough to overcome the normal incentive problems.
But I've seen this movie before, and I know how it usually ends.