OpenAI's GPT-5 safety strategy is extensive, but I've seen this playbook before
The company has released a mountain of documentation on how it's keeping its most powerful models in check. The real question is whether any of it matters when things go wrong.
Crédito da imagem: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
I spent most of last week reading through OpenAI's various system cards and technical reports for the GPT-5 family, and I have to tell you, my eyes started glazing over somewhere around page forty. Not because the material is bad (it's actually pretty thorough), but because I've been reading documents like these since before most of the kids at OpenAI were born.
The company has been busy. GPT-5.2 dropped recently, along with a specialized coding model called GPT-5.1-Codex-Max, a pair of open-weight safety models, and a whole new approach to what they're calling "safe-completions." There's also updated safety metrics for GPT-5.1 Instant and Thinking. That's a lot of model names to keep track of, and frankly I'm not sure why they need this many variants, but what do I know.
The documentation is extensive. I'll give them that. But extensive documentation and actual safety are two different things, and I've seen this movie before.
The most interesting shift is something OpenAI calls "output-centric safety training," which they describe in a report titled From hard refusals to safe-completions. The basic idea is that instead of having the model just refuse to answer sensitive questions (which is annoying and often counterproductive), they're training it to provide helpful responses that are also safe.
This is, in a way, an admission that the old approach didn't work. Anyone who's used ChatGPT knows the frustration of asking a legitimate question about, say, chemistry or security vulnerabilities and getting a prim refusal. Meanwhile actual bad actors just jailbreak the thing anyway. So OpenAI is trying to thread a needle here, making the model more useful for normal people while still preventing misuse.
Cobertura relacionada
More in AI Models
Everyone's covering the parental controls. The real story is how OpenAI is trying to solve an almost impossible problem: age verification without surveillance.
James Chen · 1 hour ago · 7 min
The company is rapidly expanding where customer data can live, but the real question is whether this solves the problems enterprises actually have.
James Chen · 1 hour ago · 5 min
Three announcements in quick succession reveal OpenAI isn't just scaling up, it's building the backbone for AI that needs to think and respond in real-time.
Sarah Williams · 1 hour ago · 6 min
A string of partnerships with Foxconn, the DOE, and governments worldwide suggests OpenAI is becoming something very different from what it started as.
Whether this actually works remains unclear. The company provides metrics but they're measuring against their own benchmarks, which is sort of like grading your own homework.
The GPT-5.1-Codex-Max system card is where things get genuinely interesting, and also genuinely concerning. This is a model designed to write and execute code autonomously, which means it can actually do things in the world rather than just talk about doing things.
OpenAI has implemented what they call "agent sandboxing" and "configurable network access," which basically means the model runs in a contained environment and you can control whether it can reach the internet. They've also done specialized safety training for prompt injections, which is when someone tries to trick the model into doing something it shouldn't by hiding instructions in the input.
Call me old-fashioned, but I remember when we worried about software having too many permissions. Now we're building AI systems that can write their own code and execute it, and we're relying on sandboxing to keep things contained. The history of sandboxing in computer security is, shall we say, not encouraging! Sandboxes get escaped. It's basically what they're for, from an attacker's perspective.
OpenAI also announced something called SafetyKit, which uses GPT-5 to do content moderation and compliance enforcement. The pitch is that it's more accurate than "legacy safety systems," which I assume means the previous generation of content filters that everyone hated.
This is the self-driving car hype cycle all over again, honestly. The promise is always that AI will be better than humans at some task, and sometimes it is, but the failure modes are different and often worse. A human content moderator might miss something, but they're not going to suddenly start flagging everything with the word "grape" because of some weird training artifact.
I'm not saying SafetyKit is bad. I genuinely don't know yet, the company didn't disclose exact accuracy figures in the announcement. I'm saying that replacing one set of problems with a different set of problems isn't the same as solving the problems.
Okay here's something I didn't expect to say: the gpt-oss-safeguard technical report describes something genuinely useful. These are open-weight models (meaning anyone can download and run them) that are specifically trained to evaluate content against a provided policy.
So instead of having a single set of rules baked into the model, you can tell the model "here's my company's content policy" and it will evaluate content against that specific policy. This is more flexible than previous approaches and it means different organizations can have different standards without needing to train their own models from scratch.
The 120 billion parameter version is presumably more capable than the 20 billion parameter version, though OpenAI is using the underlying gpt-oss models as a baseline for comparison rather than absolute benchmarks. The report is technical enough that I trust they're being honest about the limitations, which is more than I can say for some of the marketing material.
The GPT-5.1 system card addendum includes something I haven't seen before: evaluations for "mental health and emotional reliance." This is OpenAI acknowledging, finally, that people are forming relationships with these things.
I've been saying for years that the parasocial relationship problem with AI is going to be bigger than the misinformation problem, and it looks like the companies are starting to take it seriously. Or at least they're starting to measure it, which is the first step.
What they're actually doing about it, well, the addendum is light on details. They mention updated safety metrics but not specific interventions. This feels like an area where we're going to learn a lot of hard lessons over the next few years.
Look, I've been covering tech since the 90s and I've watched the safety theater cycle play out multiple times. Company releases powerful new technology. Company releases extensive documentation about how safe it is. Problems emerge anyway. Company releases more documentation. Rinse and repeat.
I'm not saying OpenAI is being dishonest. The documentation is real and the safety work appears genuine. The GPT-5.2 system card explicitly states that the models were trained on publicly available internet data plus partner data plus user-provided data, which is at least transparent about the training sources.
But documentation is not the same as safety. Process is not the same as outcomes. And we won't actually know how safe these systems are until they've been deployed at scale for a while and we can see what breaks.
The young founders I talk to are always surprised when I'm skeptical about safety claims. They think I'm being cynical. But I've just been around long enough to remember when nuclear power was going to be too cheap to meter, when social media was going to democratize information, when self-driving cars were going to eliminate traffic deaths by 2020.
Maybe this time is different. Maybe OpenAI has figured out how to make AI systems that are genuinely safe at scale. But if you want to argue with me about it, my email's on the about page.