Crédit photo: Lottie animation by Centre Robotics (LottieFiles Free, used with credit). · source
Codex-1 is trained using reinforcement learning on real-world coding tasks. It generates code that "closely mirrors human style," adheres to instructions, and runs tests iteratively until they pass. That's the pitch from OpenAI's latest system card addendum, and honestly, it's not nothing.
But here's the thing: I've been covering tech since the 90s, and I've watched at least three generations of tools that were supposed to make programmers obsolete. CASE tools in the early 90s. Visual Basic's drag-and-drop revolution. Low-code platforms that were going to let "citizen developers" build enterprise software. Each time, the tools got absorbed into workflows, made some tasks easier, and programming jobs kept growing. Call me old-fashioned, but I'm skeptical this time is fundamentally different.
Let's start with the technical architecture, because OpenAI published a surprisingly detailed breakdown of what they call the "agent loop." The Codex CLI orchestrates models, tools, prompts, and performance using something called the Responses API. That's a lot of jargon, so let me translate.
Codex is essentially a sophisticated automation layer. You give it a task (fix this bug, add this feature, refactor this module), and it spins up a cloud sandbox, writes code, runs your test suite, reads the output, adjusts, and repeats until the tests pass. The "agent" part means it's not just generating code in one shot, it's iterating. Trying things. Failing and trying again.
This is genuinely more sophisticated than what we had two years ago! The reinforcement learning approach, where the model was trained on actual coding tasks rather than just predicting the next token in a code file, seems to produce more coherent multi-step behavior. The system card mentions that codex-1 is "a version of OpenAI o3 optimized for software engineering," which suggests they've done substantial fine-tuning beyond the base reasoning model.
À lire aussi
More in AI Models
When a company raising $122 billion suddenly announces a billion-dollar charitable foundation, an old robotics hand can't help but squint a little.
Robert "Bob" Macintosh · 1 hour ago · 3 min
The company published detailed guidelines for how its models should behave. The document is surprisingly thoughtful, but the real test is whether it actually constrains anything.
Aisha Patel · 1 hour ago · 8 min
The AI company is giving away software to lock in government and healthcare customers. I've seen this playbook before.
Robert "Bob" Macintosh · 1 hour ago · 3 min
The company just raised $122 billion and is now pledging at least $1 billion for disease cures and community programs. The numbers are big, but what do they actually mean?
But (and this is a big but), we don't have hard numbers on success rates, task complexity limits, or failure modes. OpenAI's blog post is heavy on architecture diagrams and light on benchmarks. The system card addendum mentions that Codex "adheres precisely to instructions" and achieves "passing results," but passing on what? Simple unit tests? Complex integration scenarios? Edge cases that would trip up a junior developer? We don't know yet.
I keep thinking about autonomous vehicles when I read this stuff. Around 2016, 2017, every major automaker and a dozen startups were promising fully self-driving cars by 2020. The demos were impressive! Highway driving, lane changes, parking. The technology was real. But the gap between "works in controlled conditions" and "works reliably in the messy real world" turned out to be enormous.
We're now in 2025, and Waymo operates in a handful of geofenced cities. Cruise had that whole pedestrian-dragging incident and scaled way back. Tesla's "Full Self-Driving" still requires driver supervision. The technology improved dramatically, but the timeline predictions were off by a decade or more.
Coding agents feel similar to me. The demos are impressive (they always are). The underlying models have genuinely improved. But software engineering isn't just writing code that passes tests. It's understanding what the tests should be. It's knowing when requirements are ambiguous and need clarification. It's maintaining systems over years as requirements shift. It's dealing with legacy code that has no tests and undocumented assumptions baked in.
I'm not saying Codex can't help with any of this. I'm saying we should be humble about extrapolating from "runs tests until passing" to "replaces software engineers."
To be fair to OpenAI, they're not explicitly claiming Codex will replace programmers. The framing is more about augmentation, giving developers a "cloud-based coding agent" that handles routine tasks. That's a more modest and probably more accurate pitch.
The technical innovations seem real. According to the OpenAI blog post, the Responses API provides a unified interface for model calls, tool execution, and state management. The agent loop handles context accumulation (remembering what it tried before), error recovery, and performance optimization. This is solid engineering work.
The training approach is also interesting. Using reinforcement learning on real coding tasks, rather than just supervised learning on code repositories, should theoretically produce a model that's better at the iterative, trial-and-error nature of actual development. The claim that generated code "closely mirrors human style and PR preferences" suggests they've paid attention to the social aspects of software development, not just correctness.
But here's what remains unclear: how does this perform on codebases the model hasn't seen during training? What happens when the task requires understanding business context that isn't in the code? How does it handle security-sensitive code where a subtle bug could be catastrophic? These aren't gotcha questions, they're the real challenges that determine whether a tool is useful for production work or just impressive for demos.
I've been watching the discourse around AI coding tools, and there's a generational split that's kind of fascinating. A lot of younger founders and developers are extremely bullish, talking about 10x productivity gains and the imminent obsolescence of traditional software engineering. Some of the more experienced folks I talk to (mostly via email, because I'm old and prefer it) are more measured.
This isn't because the kids are naive and the veterans are wise. It's that different people are solving different problems. If you're building a greenfield app with modern tooling, clear requirements, and a well-tested architecture, these tools probably do help a lot. If you're maintaining a 15-year-old Java monolith with spotty documentation and tests that only pass on the third Tuesday of months with R in them, your mileage may vary.
The system card mentions that Codex was trained to "iteratively run tests until passing results are achieved." That's great if you have good tests. But a huge amount of production software has inadequate test coverage, and writing good tests is often harder than writing the code itself. An agent that optimizes for "tests pass" could potentially make things worse if the tests don't actually validate correct behavior.
Here's my prediction, and you can email me in two years to tell me I was wrong (my email's on the about page). Codex and tools like it will become standard parts of the development workflow. They'll handle boilerplate, routine refactoring, simple bug fixes, and test generation. Junior developers will use them heavily. Senior developers will use them selectively.
But the "end of programming" takes will age poorly. The bottleneck in software development has never been typing speed or even code generation. It's understanding requirements, making architectural decisions, debugging subtle issues, and maintaining systems over time. AI tools will help with some of this, but they won't eliminate the need for human judgment.
The historical parallel that keeps coming to mind is spreadsheets. VisiCalc and Lotus 1-2-3 didn't eliminate accountants, they made them more productive and changed what accounting work looked like. AI coding tools will probably do something similar for software engineering. That's a real transformation! But it's not the sci-fi scenario where we all become prompt engineers and the machines write all the code.
OpenAI has built something technically impressive with Codex. The agent loop architecture is well-designed, the training approach is sensible, and the product seems genuinely useful for certain tasks. But we're still in the early innings here, and the gap between "impressive demo" and "reliable production tool" has historically been wider than the hype suggests.
I could be wrong. Maybe this time really is different. But I've seen enough tech cycles to know that the boring, incremental reality usually beats the revolutionary predictions. We'll see.