GPT-5.2-Codex
Posted by meetpateltech 1 day ago
Comments
Comment by mccoyb 1 day ago
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
Comment by sinatra 1 day ago
So much so that now I rely completely on Codex for code reviews and actual coding. I will pick higher quality over speed every day. Please don’t change it, OpenAI team!
Comment by F7F7F7 1 day ago
I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.
Comment by SkyPuncher 1 day ago
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
Comment by pietz 22 hours ago
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
Comment by sinatra 1 day ago
Comment by shinycode 1 day ago
Comment by AmazingTurtle 1 day ago
Comment by ifwinterco 1 day ago
Comment by mccoyb 1 day ago
I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?
Comment by Computer0 1 day ago
Comment by Aurornis 1 day ago
Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly
Comment by girvo 1 day ago
Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.
Comment by heliumtera 1 day ago
Comment by conradev 1 day ago
My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.
Comment by lukan 17 hours ago
None of that knowlege will get useless, only working around current limitations of agents will.
Comment by miek 1 day ago
Comment by neom 1 day ago
Comment by fragmede 22 hours ago
The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.
Comment by theonething 1 day ago
Comment by Aurornis 19 hours ago
If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.
Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.
Comment by dionian 1 day ago
Comment by joquarky 1 day ago
I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.
Comment by dionian 6 hours ago
Comment by hadlock 1 day ago
Comment by andai 1 day ago
So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.
See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here
https://artificialanalysis.ai/
...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!
Comment by tejohnso 1 day ago
I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.
Comment by golly_ned 1 day ago
In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.
Comment by erik 1 day ago
All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.
Comment by ithkuil 1 day ago
Or is it already saturated?
Comment by mbesto 20 hours ago
It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.
Comment by panarky 18 hours ago
The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.
Comment by mbesto 17 hours ago
How do you know this?
> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.
> And datacenter GPUs have been running inference workloads for years now,
And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.
> They're not throwing away two-year-old chips.
Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.
Comment by nimchimpsky 1 day ago
everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.
They are obviously making awful lot of revenue.
Comment by alwillis 1 day ago
> everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.
It has nothing to do with their management or investors being "dummies" but the numbers are the numbers.
OpenAI has data center rental costs approaching $620 billion, which is expected to rise to $1.4 trillion by 2033.
Annualized revenue is expected to be "only" $20 billion this year.
$1.4 trillion is 70x current revenue.
So unless they execute their strategy perfectly, hit all of their projections and hoping that neither the stock market or economy collapses, making a profit in the foreseeable future is highly unlikely.
[1]: "OpenAI's AI money pit looks much deeper than we thought. Here's my opinion on why this matters" - https://diginomica.com/openais-ai-money-pit-much-deeper-we-t...
Comment by Daneel_ 1 day ago
That's what the investors are chasing, in my opinion.
Comment by zozbot234 1 day ago
Comment by mbesto 20 hours ago
It's not hard to sell $10 worth of products if you spend $20. profit is more important than revenue.
Comment by troupo 1 day ago
They are drowning in debt and go into more and more ridiculous schemes to raise/get more money.
--- start quote ---
OpenAI has made $1.4 trillion in commitments to procure the energy and computing power it needs to fuel its operations in the future. But it has previously disclosed that it expects to make only $20 billion in revenues this year. And a recent analysis by HSBC concluded that even if the company is making more than $200 billion by 2030, it will still need to find a further $207 billion in funding to stay in business.
https://finance.yahoo.com/news/openai-partners-carrying-96-b...
--- end quote ---
Comment by zozbot234 1 day ago
Comment by energy123 1 day ago
Comment by mattio 14 hours ago
I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class. Was using 5.2 on medium (default).
Comment by stared 23 hours ago
Comment by baseonmars 1 day ago
Comment by mkagenius 1 day ago
Comment by hugh-avherald 11 hours ago
Comment by smoe 1 day ago
Comment by hatefulmoron 1 day ago
I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.
Comment by Foobar8568 1 day ago
Comment by nl 1 day ago
The OpenAI token limits seem more generous than the Anthropic ones too.
Comment by rbancroft 1 day ago
Comment by nl 1 day ago
I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.
In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.
OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.
I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).
Comment by jvermillard 1 day ago
Comment by apitman 1 day ago
Comment by gnatolf 1 day ago
Comment by tgtweak 1 day ago
Comment by rane 1 day ago
Comment by jvermillard 1 day ago
Comment by kilroy123 1 day ago
Comment by mccoyb 1 day ago
Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.
I basically use it to drive Claude Code, which will nuke the codebase with abandon.
Comment by kilroy123 1 day ago
Comment by JamesSwift 1 day ago
Comment by fragmede 22 hours ago
I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).
https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...
Comment by baq 1 day ago
Comment by johnnyfived 1 day ago
Comment by garbagecoder 1 day ago
Comment by josephg 1 day ago
I asked codex to take a look. It took a couple minutes, but it managed to track the issue down using a bunch of tricks I've never seen before. I was blown away. In particular, it reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!
Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.
Comment by varjag 16 hours ago
Comment by heliumtera 1 day ago
edit: username joke, don't get me banned
Comment by echelon 1 day ago
(unrelated, but piggybacking on requests to reach the teams)
If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.
Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.
OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :
https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...
https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U
Google fails the tests currently, but can probably easily catch up :
Comment by tananaev 1 day ago
One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.
Comment by jackschultz 1 day ago
And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.
Comment by nextaccountic 1 day ago
(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)
Comment by jackschultz 1 day ago
> Emil concluded his article like this:
> JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. > But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. > That’s probably the right division of labor.
>I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time.
Comment by culopatin 1 day ago
Comment by manmal 1 day ago
Comment by gaigalas 1 day ago
If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.
In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.
Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.
Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...
Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.
Comment by manmal 11 hours ago
Comment by tlarkworthy 1 day ago
Comment by elbear 1 day ago
Comment by jackschultz 19 hours ago
People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.
Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.
Comment by elbear 1 hour ago
Comment by wahnfrieden 1 day ago
Skill issue... And perhaps the wrong model + harness
Comment by scottyah 1 day ago
Comment by zamadatix 1 day ago
The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy).
Comment by wahnfrieden 1 day ago
Comment by girvo 1 day ago
It's so fascinating to me that the thread above this one on this page says the opposite, and the funniest thing is I'm sure you're both right. What a wild world we live in, I'm not sure how one is supposed to objectively analyse the performance of these things
Comment by AstroBen 1 day ago
A full week of that should give you a pretty good idea
Maybe some models just suit particular styles of prompting that do or don't match what you're doing
Comment by ssl-3 1 day ago
This is very similar to how humans behave. Most people are great a small number of things, and there's always a larger set of things that we may individually be pretty terrible at.
The bots the same way, except: Instead of billions of people who each have their own skillsets and personalities, we've got a small handful of distinct bots from different companies.
And of course: Lies.
When we ask Bob or Lisa help with a thing that they don't understand very well, they usually will try to set reasonable expectations . ("Sorry, ssl-3, I don't really understand ZFS very well. I can try to get the SLOG -- whatever that is -- to work better with this workload, but I can't promise anything.")
Bob or Lisa may figure it out. They'll gather up some background and work on it, bring in outside help if that's useful, and probably tread lightly. This will take time. But they probably won't deliberately lie [much] about what they expect from themselves.
But when the bot is asked to do a thing that it doesn't understand very well, it's chipper as fuck about it. ("Oh yeah! Why sure I can do that! I'm well-versed in -everything-! [Just hold my beer and watch this!]")
The bot will then set forth to do the thing. It might fuck it all up with wild abandon, but it doesn't care: It doesn't feel. It doesn't understand expectations. Or cost. Or art. Or unintended consequences.
Or, it might get it right. Sometimes, amazingly-right.
But it's impossible to tell going in whether it's going to be good, or bad: Unlike Bob or Lisa, the bot always heads into a problem as an overly-ambitious pack of lies.
(But the bot is very inexpensive to employ compared to Bob or Lisa, so we use the bot sometimes.)
Comment by 9dev 1 day ago
Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.
Comment by embedding-shape 1 day ago
This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.
Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.
Comment by dotancohen 1 day ago
Honestly, I'd love to try that. My Gmail username is the same as my HN username.
Comment by nl 1 day ago
- Claude
- Codex
- Kilocode
- Amp
- Mistral Vibe
Very vibe coded though.Comment by handfuloflight 1 day ago
Comment by versteegen 1 day ago
Comment by energy123 1 day ago
GPT-5.2 Thinking (with extended thinking selected) is significantly better in my testing on software problems with 40k context.
I attribute this to thinking time, with GPT-5.2 Thinking I can coax 5 minutes+ of thinking time but with Gemini 3.0 Pro it only gives me about 30 seconds.
The main problem with the Plus sub in ChatGPT is you can't send more than 46k tokens in a single prompt, and attaching files doesn't help either because the VM blocks the model from accessing the attachments if there's ~46k tokens already in the context.
Comment by enraged_camel 1 day ago
Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”
I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.
Languages: JavaScript, Elixir, Python
Comment by paustint 1 day ago
The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python.
Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback.
I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model)
Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.
Comment by tmikaeld 1 day ago
Comment by thek3nger 1 day ago
Comment by freedomben 1 day ago
So yeah, I use codex a lot and like it, but it has some really bad blind spots.
Comment by jillesvangurp 1 day ago
Heh. It's about the same as an efficient compilation or integration testing process that is long enough to let it do it's thing while you go and browse Hacker News.
IMHO, making feedback loops faster is going to be key to improving success rates with agentic coding tools. They work best if the feedback loop is fast and thorough. So compilers, good tests, etc. are important. But it's also important that that all runs quickly. It's almost an even split between reasoning and tool invocations for me. And it is rather trigger happy with the tool invocations. Wasting a lot of time to find out that a naive approach was indeed naive before fixing it in several iterations. Good instructions help (Agents.md).
Focusing attention on just making builds fast and solid is a good investment in any case. Doubly so if you plan on using agentic coding tools.
Comment by wahnfrieden 1 day ago
The key is to adapt to this by learning how to parallelize your work, instead of the old way of doings things where devs are expected to focus on and finish one task at a time (per lean manufacturing principles).
I find now that painfully slow builds are no longer a serious issue for me. Because I'm rotating through 15-20 agents across 4-6 projects so I always have something valuable to progress on. One of these projects and a few of these agents are clear priorities I return to sooner than the others.
Comment by anabis 1 day ago
The Roomba effect is real. The AI models do all the heavy implementation work, and when it asks me to setup an execute tests, I feel obliged to get to it ASAP.
Comment by BinaryIgor 1 day ago
Comment by cmrdporcupine 19 hours ago
On its own, as sole author, I find Codex overcomplicates things. It will riddle your code with unnecessary helper functions and objects and pointless abstractions.
It is however useful for doing a once over for code review and finding the things that Claude rushed through.
Comment by mohsen1 22 hours ago
SWE-Bench (Pro / Verified)
Model | Pro (%) | Verified (%)
--------------------+---------+--------------
GPT-5.2-Codex | 56.4 | ~80
GPT-5.2 | 55.6 | ~80
Claude Opus 4.5 | n/a | ~80.9
Gemini 3 Pro | n/a | ~76.2
And for terminal workflows, where agentic steps matter: Terminal-Bench 2.0
Model | Score (%)
--------------------+-----------
Claude Opus 4.5 | ~60+
Gemini 3 Pro | ~54
GPT-5.2-Codex | ~47
So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:- Claude is still ahead on strict coding + terminal-style tasks
- Gemini is better for huge context + multimodal reasoning
- GPT-5.2-Codex is strong but not clearly the new state of the art across the board
It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.
Comment by qwesr123 21 hours ago
And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%
See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/
Comment by scellus 21 hours ago
And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.)
Comment by thedougd 18 hours ago
Comment by blitz_skull 19 hours ago
My rule of thumb with OpenAI is, if they don’t publish their benchmarks beside Anthropic’s numbers it’s because they’re still not caught up.
So far my rule of thumb has held true.
Comment by shanev 1 day ago
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)
Comment by SamDc73 1 day ago
Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly
Comment by lemming 1 day ago
Comment by mrcwinn 1 day ago
Comment by allovertheworld 1 day ago
Comment by kordlessagain 1 day ago
It ships with 300+ MCP tools (crawl, Google search, Gmail/GCal/GDrive, Slack, scheduling, web indexing, embeddings, transcription, and more). Many came from tools I originally built for Claude Desktop—OpenAI’s MCP has been stable across 20+ versions so I prefer it.
I will note I usually run this in Danger mode but because it runs in a container, it doesn't have access to ENVs I don't want it messing with, and have it in a directory I'm OK with it changing or poking about in.
Headless browser setup for the crawl tools: https://github.com/DeepBlueDynamics/gnosis-crawl.
My email is in my profile if anyone needs help.
Comment by marktolson 1 day ago
Comment by kordlessagain 18 hours ago
Those scripts are for running the docker command with all the ENV vars and settings. Whatever does that does NOT have to be Powershell if you don't want it to be.
Comment by cachius 22 hours ago
Comment by kordlessagain 18 hours ago
Comment by derrasterpunkt 1 day ago
Comment by kordlessagain 18 hours ago
Comment by pertymcpert 1 day ago
Comment by kordlessagain 18 hours ago
Comment by freedomben 1 day ago
Comment by tptacek 1 day ago
Comment by prettyblocks 1 day ago
Comment by neom 1 day ago
Comment by prettyblocks 18 hours ago
Comment by neom 18 hours ago
Comment by freedomben 1 day ago
Comment by mapontosevenths 1 day ago
> "In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety."
Comment by hiAndrewQuinn 1 day ago
Comment by JacobAsmuth 1 day ago
Comment by Uehreka 1 day ago
Comment by whimsicalism 1 day ago
Comment by freedomben 1 day ago
Comment by flir 1 day ago
Comment by whimsicalism 1 day ago
Comment by abigail95 1 day ago
Comment by ACCount37 1 day ago
Sure, you can also use the same tools to find attack surfaces preemptively, but let's be honest, most wouldn't.
Comment by bilbo0s 1 day ago
Scary good.
But the good ones are not open. It's not even a matter of money. I know at OpenAI they are invite only for instance. Pretty sure there's vetting and tracking going on behind those invites.
Comment by artursapek 1 day ago
Comment by tptacek 1 day ago
Comment by hhh 1 day ago
Comment by nikanj 1 day ago
Comment by user34283 1 day ago
So far it's rarely been the leading frontier model, but at least it's not full of dumb guardrails that block many legitimate use cases in order to prevent largely imagined harm.
You can also use Grok without sign in, in a private window, for sensitive queries where privacy matters.
A lot of liberals badmouth the model for obviously political reasons, but it's doing an important job.
Comment by julienfr112 1 day ago
Comment by aaa_aaa 1 day ago
Comment by alecco 18 hours ago
I really dislike Altman, but honestly Codex is far superior. And their models are better or at least as good. And they are much better for backend/low level/C++/CUDA IME. Codex CLI is simpler and lean. Perhaps because Codex was rewritten in Rust earlier this year while CC is still in TypeScript.
Comment by ajpikul 10 hours ago
Comment by PostOnce 1 day ago
I watched a man struggle for 3 hours last night to "prove me wrong" that Gemini 3 Pro can convert a 3000 line C program to Python. It can't do it. It can't even do half of it, it can't understand why it can't, it's wrong about what failed, it can't fix it when you tell it what it did wrong, etc.
Of course, in the end, he had an 80 line Python file that didn't work, and if it did work, it's 80 lines, of course it can't do what the 3000 line C program is doing. So even if it had produced a working program, which it didn't, it would have produced a program that did not solve the problems it was asked to solve.
The AI can't even guess that it's 80 line output is probably wrong just based on the size of it, like a human instantly would.
AI doesn't work. It hasn't worked, and it very likely is not going to work. That's based on the empirical evidence and repeatable tests we are surrounded by.
The guy last night said that sentiment was copium, before he went on a 3 hour odyssey that ended in failure. It'll be sad to watch that play out in the economy at large.
Comment by sedawkgrep 19 hours ago
Recently I tried working with both ChatGPT and Gemini (AI Studio) to take a really badly written PHP website and refactor it into MVC components. I've worked strictly through the web UI as this only involved a few files.
While both provided great guidance around how to approach disentangling this monolithic code, GPT failed miserably, generating code that was syntactically incorrect, and then doubling-down on it by insisting that it was correct and that errors were faults lying elsewhere. It literally generated code that lacked closing parenthesis and brackets.
In contrast, Gemini generated a perfectly working MVC version from the start. In both instances I did intentionally keep it on track to only separate the code into MVC and NOT to optimize anything, but it worked the first try. I've then taken it through subsequent refactorings and it's done superbly in that role.
So I can't speak to how well this works for large code bases, much less agentically. (my initial, very focused MVC refactor was about 1300 lines.) But when giving it a very specific task with strict guidance and rules, my results with Gemini were fantastic.
Comment by user34283 1 day ago
I'm not sure what your "empirical evidence and repeatable tests" is supposed to be. The AI not successfully converting a 3000 line C program to Python, in a test you probably designed to fail, doesn't strike me as particularly relevant.
Also, I suspect that AI could most likely guess that 80 lines of Python aren't correctly replicating 3000 lines of C, if you prompted it correctly.
Comment by discreteevent 23 hours ago
For some definition of "works". This seems to be yours:
> I'd go further and say vibe coding it up, testing the green case, and deploying it straight into the testing environment is good enough. The rest we can figure out during testing, or maybe you even have users willing to beta-test for you.
> This way, while you're still on the understanding part and reasoning over the code, your competitor already shipped ten features, most of them working.
> Ok, that was a provocative scenario. Still, nowadays I am not sure you even have to understand the code anymore. Maybe having a reasonable belief that it does work will be sufficient in some circumstances.
Comment by user34283 22 hours ago
It's interesting how this workflow appears to almost offend some users here.
I get it, we all don't like sloppy code that does not work well or is not maintainable.
I think some developers will need to learn to give control away rather than trying to understand every line of code in their project - depending of course on the environment and use case.
Also worth to keep in mind that even if you think you understand all the code in your project - as far as that is even possible in larger projects with multiple developers - there are still bugs anyway. And a few months later, your memory might be fuzzy in any case.
Comment by stavros 22 hours ago
Given my personal experience, and how much more productive AI has made me, it seems to me that some people are just using it wrong. Either that, or I'm delusional, and it doesn't actually work for me.
Comment by SamPatt 18 hours ago
It's not hard to spend a few hours testing out models / platforms and learning how to use them. I would argue this has been true for a long time, but it's so obviously true now that I think most of those people are not acting in good faith.
Comment by blibble 20 hours ago
Comment by hombre_fatal 19 hours ago
The naysaying seems to mostly come from people coping with the writing they see on the wall with their anecdote about some goalpost-moving challenge designed for the LLM to fail (which they never seem to share with us). And if their low effort attempt can't crack LLMs, then nobody can.
It reminds me of HN ten years ago where you'd still run into people claiming that Javascript is so bad that anybody who thinks they can create good software with it is wrong (trust them, they've supposedly tried). Acting like they're so preoccupied with good engineering when it's clearly something more emotional.
Meanwhile, I've barely had to touch code ever since Opus 4.5 dropped. I've started wondering if it's me or the machine that's the background agent. My job is clearly shifting into code review and project management while tabbing between many terminals.
As LLMs keep improving, there's a moment where it's literally more work to find the three files you need to change than to just instruct someone to do it, and what changes the game is when you realize it's creating output you don't even need to edit anymore.
Comment by nozzlegear 16 hours ago
Curiously enough, those people are still around and writing good software without javascript. And I say that as someone who generally enjoys modern JS.
> Meanwhile, I've barely had to touch code ever since Opus 4.5 dropped. I've started wondering if it's me or the machine that's the background agent. My job is clearly shifting into code review and project management while tabbing between many terminals.
Why not cut out the middleman and have Opus 4.5 do the code review and project management too?
Comment by blibble 18 hours ago
> with their anecdote about some goalpost-moving challenge designed for the LLM to fail (which they never seem to share with us).
literally what the boosters do on every single post!
"no no, the top model last week was complete dogshit, but this new one is world changing! no you can't see my code!"
10/10 for the best booster impression I've seen this year!
Comment by user34283 20 hours ago
Comment by blibble 19 hours ago
faster, fewer bugs, better output
leaving more time for shitposting
Comment by wickedsight 1 day ago
Comment by tptacek 1 day ago
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
Comment by mvkel 1 day ago
We've come a long way since gpt-3.5, and it's rewarding to see people who are willing to change their cached responses
Comment by scottyah 1 day ago
Comment by aunty_helen 10 hours ago
They're getting slaughtered by the more focused Anthropic team who decided they will have the best coding model.
Given how bad things have been going recently (5.2 chat bombing and being behind, opus being the code GOAT, G team dominating media, Grok existing and meta / the Chinese dominating opensource), they should niche to the general purpose llm before that's all they're left with by market forces.
I'm still pretty sour they didn't have the vision at the time to build an ecosystem around them and instead went for those building the ecosystem on them.
Comment by CjHuber 1 day ago
Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.
Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.
Comment by w-m 1 day ago
Comment by JeremyNT 1 day ago
I see people gushing over these codex models but they seem worse than the big gpt models in my own actual use (i.e. I'll give the same prompt to gpt-5.1 and gpt-5.1-codex and codex will give me functional but weird/ugly code, whereas gpt-5.1 code is cleaner)
Comment by embedding-shape 1 day ago
I feel the same. CodexTheModel (why have two things named the same way?!) is a good deal faster than the other models, and probably on the "fast/accuracy" scale it sits somewhere else, but most code I want to be as high quality as possible, and the base models do seem better at that than CodexTheModel.
Comment by drdrey 18 hours ago
Comment by 6thbit 1 day ago
What has somewhat worked for me atm is to ask to only update an .md plan file and act on the file only, seems to appease its eagerness to write files.
Comment by flir 1 day ago
Comment by nowittyusername 1 day ago
Comment by NitpickLawyer 1 day ago
Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
Comment by simianwords 1 day ago
Comment by wahnfrieden 1 day ago
Comment by crazylogger 1 day ago
Comment by larrymcp 1 day ago
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
Comment by dpoloncsak 1 day ago
Comment by runtimepanic 1 day ago
Comment by pixl97 1 day ago
Comment by throwaway127482 1 day ago
Comment by whynotminot 1 day ago
Comment by baq 1 day ago
Comment by tgtweak 1 day ago
Comment by szundi 1 day ago
Comment by k_bx 1 day ago
Comment by lacoolj 1 day ago
Comment by cthalupa 1 day ago
Humans are very suggestible.
Comment by k_bx 1 day ago
Comment by cthalupa 1 day ago
It was indeed!
> very likely I just read it
I find myself doing it frequently. I'm even sure that's why I used indeed above after reading your comment - it wasn't intentional!
Comment by k_bx 1 day ago
Comment by lacoolj 1 day ago
curious if this, or coincidence
Comment by enraged_camel 1 day ago
Comment by exacube 1 day ago
"The most advanced agentic coding model for professional software engineers"
Comment by koakuma-chan 1 day ago
Comment by cj 1 day ago
Comment by koakuma-chan 1 day ago
Comment by HarHarVeryFunny 1 day ago
If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.
In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.
Comment by koakuma-chan 1 day ago
Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.
Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2
Comment by postalcoder 1 day ago
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
Comment by misiti3780 1 day ago
Comment by HarHarVeryFunny 19 hours ago
Comment by koakuma-chan 18 hours ago
Comment by Mkengin 1 day ago
Comment by BeetleB 1 day ago
Comment by HarHarVeryFunny 19 hours ago
Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.
Comment by walthamstow 1 day ago
Comment by Tostino 1 day ago
Comment by koakuma-chan 1 day ago
Comment by speedgoose 1 day ago
Comment by nunodonato 1 day ago
Comment by koakuma-chan 1 day ago
Comment by dkdcio 1 day ago
Comment by koakuma-chan 1 day ago
Comment by dkdcio 1 day ago
again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)
Comment by koakuma-chan 1 day ago
Such as?
> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
I am testing all models in Cursor.
> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor
I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.
Comment by dkdcio 1 day ago
> Such as?
changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...
glhf
btw you started this thread with pure vibes, no evidence:
> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.
I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim
Comment by koakuma-chan 1 day ago
> “prompting”/harness that improves how it actually performs
Is an abstract statement without any meaningful details.
Comment by nunodonato 1 day ago
Comment by koakuma-chan 1 day ago
Comment by NoveltyEngine 1 day ago
Comment by mejutoco 1 day ago
Comment by koakuma-chan 1 day ago
Comment by nunodonato 1 day ago
Comment by HumanOstrich 1 day ago
Comment by Mkengin 1 day ago
Comment by dworks 1 day ago
I had assumed OpenAI was irrelevant, but 5.1 has been so much better than Gemini.
Comment by postalcoder 1 day ago
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
Comment by jbm 1 day ago
(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)
Comment by embedding-shape 1 day ago
Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.
Comment by extr 1 day ago
But if you want that last 10%, codex is vital.
Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!
Comment by deaux 1 day ago
Comment by enraged_camel 1 day ago
It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.
Comment by XenophileJKO 1 day ago
They just have different strengths and weaknesses.
Comment by grimgrin 1 day ago
which is analogous to taking your problem to another model and ideally feeding it some sorta lesson
i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start
Comment by qsort 1 day ago
Comment by EnPissant 1 day ago
- Planning mode. Codex is extremely frustrating. You have to constantly tell it not to edit when you talk to it, and even then it will sometimes just start working.
- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)
- It prompts you for questions using menus
- Sub-agents don't pollute your context
Comment by dingnuts 1 day ago
since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.
I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"
But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.
And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career
Comment by abshkbh 1 day ago
Comment by lacoolj 1 day ago
Comment by qwesr123 1 day ago
Comment by whimsicalism 1 day ago
Comment by Mkengin 1 day ago
Comment by mistercheph 1 day ago
Comment by dbbk 1 day ago
Comment by Mkengin 1 day ago
Comment by gizmodo59 1 day ago
Comment by lkt 1 day ago
Comment by sigmar 1 day ago
Comment by trunnell 1 day ago
Comment by dist-epoch 1 day ago
Comment by MallocVoidstar 1 day ago
Comment by kingstnap 1 day ago
Just safety nerds being gatekeepers.
Comment by trunnell 1 day ago
They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.
Comment by 79a6ed87 1 day ago
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
Comment by Leynos 1 day ago
I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.
Comment by tgtweak 1 day ago
Comment by Leynos 1 day ago
Comment by sunaookami 1 day ago
Comment by moralestapia 1 day ago
Comment by 79a6ed87 1 day ago
Comment by zenburnmyface 1 day ago
Comment by moralestapia 1 day ago
Unsure where that could be if you're using Windows.
You know what would be fun to try? Give Codex full access and then ask it to delete that folder, lol.
Comment by rolymath 1 day ago
Comment by throwuxiytayq 1 day ago
Then again, I wouldn't put much trust into OpenAI's handling of information either way.
Comment by hereme888 19 hours ago
Comment by hmate9 1 day ago
Comment by Alifatisk 1 day ago
> [ADD/LINK TO ROLLOUT THAT DISCOVERED VULNERABILITY]
What’s up with these in the article?
Comment by OldGreenYodaGPT 1 day ago
Comment by jasonthorsness 1 day ago
Comment by fellowniusmonk 1 day ago
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
Comment by pillefitz 1 day ago
Comment by fellowniusmonk 1 day ago
Comment by seneca 1 day ago
Comment by GenerWork 1 day ago
Comment by seneca 1 day ago
Comment by misiti3780 1 day ago
Comment by ChrisMarshallNY 1 day ago
Translation: "Hey y'all! Get ready for a tsunami of AI-generated CVEs!"
Comment by fragmede 1 day ago
Comment by catigula 1 day ago
Comment by phplovesong 23 hours ago
Comment by tonyhart7 1 day ago
Comment by bgwalter 1 day ago
I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.
EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.
Comment by fragmede 1 day ago
I have https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8... as a guard against that, for anyone that's stupid enough like me to run it in yolo mode and wants to copy it.
Codex also has command line options so you can specifically prohibit running rm in bash, so look those up too.
Comment by monster_truck 1 day ago
Comment by chollida1 1 day ago
What about 2 weeks before Christmas?
Comment by speedgoose 1 day ago
Comment by mistercheph 1 day ago
Comment by tptacek 1 day ago
Comment by mistercheph 1 day ago
Comment by ianberdin 1 day ago
The models are so good, unbelievable good. And getting better weekly, including pricing.
Comment by whinvik 1 day ago
Comment by famahar 1 day ago
Comment by whinvik 1 day ago
But Opus 4.5 exists.