GPT-5.2-Codex

Posted by meetpateltech 1 day ago

Comments

Comment by mccoyb 1 day ago

If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!

Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).

Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.

Comment by sinatra 1 day ago

Piggybacking on this post. Codex is not only finding much higher quality issues, it’s also writing code that usually doesn’t leave quality issues behind. Claude is much faster but it definitely leaves serious quality issues behind.

So much so that now I rely completely on Codex for code reviews and actual coding. I will pick higher quality over speed every day. Please don’t change it, OpenAI team!

Comment by F7F7F7 1 day ago

Every plan Opus creates in Planning mode gets run through ChatGPT 5.2. It catches at least 3 or 4 serious issues that Claude didn’t think of. It typically takes 2 or 3 back and fourths for Claude to ultimately get it right.

I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.

Comment by SkyPuncher 1 day ago

The same thing can be said about Opus running through Opus.

It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.

Comment by pietz 22 hours ago

That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.

While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.

Comment by sinatra 1 day ago

My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.

Comment by shinycode 1 day ago

Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.

Comment by a24j 22 hours ago

How exactly do you plan/initiate a review from the terminal? open up a new shell/instance of claude and initiate the review with fresh context?

Comment by 19 hours ago

Comment by fragmede 22 hours ago

Yeah. It dumps context into various .md files, like TODO.md.

Comment by AmazingTurtle 1 day ago

Have you tried telling Claude not to leave serious quality issues behind?

Comment by ifwinterco 1 day ago

I think the issue is for them "quality, not speed" means "expensive, not cheap" and they can't pass that extra cost on to customers

Comment by mccoyb 1 day ago

I'm happy to pay the same right now for less (on the max plan, or whatever) -- because I'm never running into limits, and I'm running these models near all day every day (as a single user working on my own personal projects).

I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?

Comment by Computer0 1 day ago

I am on the $20 plan for CC and Codex, I feel like a session of usage on CC == ~20% Codex usage / 5 hours in terms of time spent inferencing. It has always seemed way more geneous than I would expect.

Comment by Aurornis 1 day ago

Agreed. The $20 plans can go very far when you're using the coding agent as an additional tool in your development flow, not just trying to hammer it with prompts until you get output that works.

Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

Comment by girvo 1 day ago

> I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.

Comment by heliumtera 1 day ago

It is ironic that in the gpt-4 era, when we couldn't see much value in this tools, all we could hear was "skill issues", "prompt engineering skills". Now they are actually quite capable for SOME tasks, specially for something that we don't really care about learning, and they, to a certain extent, can generalize. They perform much better than in gpt-4 era, objectively, across all domains. They perform much better with the absolute minimum input, objectively, across all domains. If someone skipped the whole "prompt engineering" and learned nothing during that time, this person is more equiped to perform well. Now I wonder how much I am leaving behind by ignoring this whole "skills, tools, MCP this and that, yada yada".

Comment by conradev 1 day ago

Prompt engineering (communicating with models?) is a foundational skill. Skills, tools, MCPs, etc. are all built on prompts.

My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.

Comment by lukan 17 hours ago

And if learned how to articulate assignments for humans right in a clear way, you likely also can do a prompt right.

None of that knowlege will get useless, only working around current limitations of agents will.

Comment by miek 1 day ago

Minimal prompting yielding better results? I haven't found this to be the case at all.

Comment by neom 1 day ago

Any thoughts on your wondering? I too am wondering about the same mistake I might be making.

Comment by fragmede 22 hours ago

My answer is that the code they generate is still crap, so the new skill is in being able to spot the ways and places it wrote crap code, and how to quickly tell it to refactor to fix specific issues, and still come out ahead on productivity. Nothing like an ultra wide screen monitor (LG 40+) and having parallel codex or claude sessions going, working on a bunch of things at once in parallel. Get good at git worktree. Use them to make tools that make your own life easier that you previously wouldn't even have bothered to make. (chrome extensions and MCPs!)

The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.

Comment by theonething 1 day ago

do you mean running /compact often?

Comment by Aurornis 19 hours ago

If I want to continue the same task, I run /compact

If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.

Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.

Comment by dionian 1 day ago

I'm not who you asked, but i do the same thing, i keep important state in doc files and recreate sessions from that state. this allows me to clear context and reconstruct my status on that item. I have a skill that manages this

Comment by joquarky 1 day ago

Using documents for state helps so much with adding guardrails.

I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.

Comment by dionian 6 hours ago

This is why claude code/codex cli is the way to go for me because often they can recompute the state from the minimal description automatically. If i relaly do needto stop the session and come back in i can point it at the task file. it also has docs and scaladocs/javadocs in key places. good naming and project structure helps it very easily find the data it needs without me needing to feed it specific files. I did the 'feed it files and copy paste the code snippet' thing in chatgpt for months. wish i went to claude code sooner.

Comment by hadlock 1 day ago

I noticed I am not hitting limits either. My guess is OpenAI sees CC as a real competitor/serious threat. Had OAI not given me virtually unlimited use I probably would have jumped ship to CC by now. Burning tons of cash at this stage is likely Very Worth It to maintain "market leader" status if only in the eyes of the media/investors. It's going to be real hard to claw back current usage limits though.

Comment by andai 1 day ago

If you look at benchmarks, the Claude models score significantly higher intelligence per token. I'm not sure how that works exactly, but they are offset from the entire rest of the chart on that metric. It seems they need less tokens to get the same result. (I can't speak for how that affects performance on very difficult tasks though, since most of mine are pretty straightforward.)

So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.

See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here

https://artificialanalysis.ai/

...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!

Comment by tejohnso 1 day ago

> they can't pass that extra cost on to customers

I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.

Comment by golly_ned 1 day ago

I wonder how much their revenue really ends up contributes towards covering their costs.

In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.

Comment by erik 1 day ago

Inference is almost certainly very profitable.

All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.

Comment by ithkuil 1 day ago

Is there a possible future where the inference usage increases because there will be many many more customers and R&D grows Lower than inference?

Or is it already saturated?

Comment by mbesto 20 hours ago

> Inference is almost certainly very profitable.

It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.

Comment by panarky 18 hours ago

The depreciation schedule isn't as big a factor as you'd think.

The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.

Comment by mbesto 17 hours ago

> The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high.

How do you know this?

> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.

> And datacenter GPUs have been running inference workloads for years now,

And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.

> They're not throwing away two-year-old chips.

Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.

Comment by nimchimpsky 1 day ago

"In my mind, they're hardly making any money compared to how much they're spending"

everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.

They are obviously making awful lot of revenue.

Comment by alwillis 1 day ago

>> "In my mind, they're hardly making any money compared to how much they're spending"

> everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.

It has nothing to do with their management or investors being "dummies" but the numbers are the numbers.

OpenAI has data center rental costs approaching $620 billion, which is expected to rise to $1.4 trillion by 2033.

Annualized revenue is expected to be "only" $20 billion this year.

$1.4 trillion is 70x current revenue.

So unless they execute their strategy perfectly, hit all of their projections and hoping that neither the stock market or economy collapses, making a profit in the foreseeable future is highly unlikely.

[1]: "OpenAI's AI money pit looks much deeper than we thought. Here's my opinion on why this matters" - https://diginomica.com/openais-ai-money-pit-much-deeper-we-t...

Comment by Daneel_ 1 day ago

To me it seems that they're banking on it becoming indispensable. Right now I could go back to pre-AI and be a little disappointed but otherwise fine. I figure all of these AI companies are in a race to make themselves part of everyone's core workflow in life, like clothing or a smart phone, such that we don't have much of a choice as to whether we use it or not - it just IS.

That's what the investors are chasing, in my opinion.

Comment by zozbot234 1 day ago

It'll never be literally indispensible, because open models exist - either served by third-party providers, or even ran locally in a homelab setup. A nice thing that's arguably unique about the latter is that you can trade scale for latency - you get to run much larger models on the same hardware if they can chug on the answer overnight (with offload to fast SSD for bulk storage of parameters and activations) instead of just answering on the spot. Large providers don't want to do this, because keeping your query's activations around is just too expensive when scaled to many users.

Comment by mbesto 20 hours ago

> They are obviously making awful lot of revenue.

It's not hard to sell $10 worth of products if you spend $20. profit is more important than revenue.

Comment by troupo 1 day ago

Revenue != profit.

They are drowning in debt and go into more and more ridiculous schemes to raise/get more money.

--- start quote ---

OpenAI has made $1.4 trillion in commitments to procure the energy and computing power it needs to fuel its operations in the future. But it has previously disclosed that it expects to make only $20 billion in revenues this year. And a recent analysis by HSBC concluded that even if the company is making more than $200 billion by 2030, it will still need to find a further $207 billion in funding to stay in business.

https://finance.yahoo.com/news/openai-partners-carrying-96-b...

--- end quote ---

Comment by zozbot234 1 day ago

The "quality" model can cost $200/month. They'll be fine.

Comment by energy123 1 day ago

Second this but for the chat subscription. Whatever they did with 5.2 compared to 5.0 in ChatGPT increased the test-time compute and the quality shows. If only they would allow more tokens to be submitted in one prompt (it's currently capped at 46k for Plus). I don't touch Gemini 3.0 Pro now (am also subbed there) unless I need the context length.

Comment by mattio 14 hours ago

Completely agreed. Used claude and codex both on highest tier next to each other for a month. On complex tasks where Claude would get stuck and not be able to fix it at all, codex would fix the issue in one go. Codex is amazing.

I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class. Was using 5.2 on medium (default).

Comment by stared 23 hours ago

If you want to combine Claude Code coding with reasoning, it is easy to do it with a plugin - https://github.com/stared/gemini-claude-skills, wrote for myself, but shared in case anyone wants. Somehow bigger context here: https://quesma.com/blog/claude-skills-not-antigravity/.

Comment by baseonmars 1 day ago

absolutely second this. I'm mainly a claude code user, but i have codex running in another tab and for code reviews and it's absolutely killer at analyzing flows and finding subtle bugs.

Comment by mkagenius 1 day ago

Have you tried Claude Code in the second tab instead, that would be a fair comparison.

Comment by hugh-avherald 11 hours ago

Claude Code isn't as surgical as Codex at reviews

Comment by smoe 1 day ago

Do you think that for someone who only needs careful, methodical identification of “problems” occasionally, like a couple of times per day, the $20/month plan gets you anywhere, or do you need the $200 plan just to get access to this?

Comment by hatefulmoron 1 day ago

I've had the $20/month plan for a few months alongside a max subscription to Claude; the cheap codex plan goes a really long way. I use it a few times a day for debugging, finding bugs, and reviewing my work. I've ran out of usage a couple of times, but only when I lean on it way more than I should.

I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.

Comment by Foobar8568 1 day ago

$200 on claude for vibe coding, $20 on codex for code review and "brainstorming". I use other LLM for a 2nd - 3rd - 4th opinion.

Comment by nl 1 day ago

The $20 does this fine.

The OpenAI token limits seem more generous than the Anthropic ones too.

Comment by rbancroft 1 day ago

Listening to Dario at the NYT DealBook summit, and reading between the lines a bit, it seems like he is basically saying Anthropic is trying to be a reponsible, sustainable business and charging customers accordingly, and insinuating that OpenAI is being much more reckless, financially.

Comment by nl 1 day ago

I think it's difficult to estimate how profitable both are - depends too much on usage and that varies so much.

I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.

In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.

OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.

I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).

Comment by jvermillard 1 day ago

I use it everyday and the $20 plan is fine

Comment by apitman 1 day ago

It's annoying though because it keeps (accurately) pointing out critical memory bugs that I clearly need to fix rather than pretending they aren't there. It's slowing me down.

Comment by gnatolf 1 day ago

Love it when it circles around a minor issue that I clearly described as temporary hack instead of recognizing the tremendously large gaping hole in my implementation right next to it.

Comment by tgtweak 1 day ago

Anecdotally I've found it very good in the exact same case for multi-agent workflows - as the "reviewer"

Comment by rane 1 day ago

Exactly. This is why the workflow of consulting Gemini/Codex for architecture and overall plan, and then have Claude implement the changes is so powerful.

Comment by jvermillard 1 day ago

I use it mainly for embedded programming and I find codex way better than claude. I don't my the delay anyway I'm slower to code carefully crafted C

Comment by kilroy123 1 day ago

Interesting what I've seen is it spins and thinks forever. Then just breaks. Which is beyond frustrating.

Comment by mccoyb 1 day ago

If by "just breaks" means "refuses to write code / gives up or reverts what it does" -- yes, I've experienced that.

Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.

I basically use it to drive Claude Code, which will nuke the codebase with abandon.

Comment by kilroy123 1 day ago

I've seen it think for a long time and then just timeout or something? It just stops and nothing happens.

Comment by JamesSwift 1 day ago

Ive had the same but i only use it through zed so I wasnt sure if it was a codex issue or a zed issue

Comment by fragmede 22 hours ago

I've had codex rm -rf the git repo it's working in while running in yolo mode. Twice, even! (Play with fire, you're gonna get burnt.)

I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).

https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...

Comment by baq 1 day ago

we're all senior continue engineers nowadays it seems

Comment by johnnyfived 1 day ago

Agreed, I'm surprised how much much care the "extra high" reasoning allows. It easily catches bugs in code other LLMs won't, using it to review Opus 4.5 is highly effective.

Comment by garbagecoder 1 day ago

Agree. Codex just read my source code for a toy lisp I wrote in ARM64 assembly and learned how to code in that lisp and wrote a few demo programs for me. The was impressive enough. Then it spent some time and effort to really hunt down some problems--there was a single bit mask error in my garbage collector that wasn't showing up until then. I was blown away. It's the kind of thing I would have spent forever trying to figure out before.

Comment by josephg 1 day ago

I've been writing a little port of the SeL4 OS kernel to rust, mostly as a learning exercise. I ran into a weird bug yesterday where some of my code wasn't running - qemu was just exiting. And I couldn't figure out why.

I asked codex to take a look. It took a couple minutes, but it managed to track the issue down using a bunch of tricks I've never seen before. I was blown away. In particular, it reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!

Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.

Comment by varjag 16 hours ago

Interestingly it found a GC bug in my toy Lisp that I wrote in Z80 assembly almost 30 years ago. This kind of work appears to be more common than you'd think!

Comment by heliumtera 1 day ago

Maybe you're a garbage programmer and that error was too obvious. Interesting observation, though.

edit: username joke, don't get me banned

Comment by echelon 1 day ago

> If anyone from OpenAI is reading this

(unrelated, but piggybacking on requests to reach the teams)

If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.

Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.

OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :

https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...

https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U

Google fails the tests currently, but can probably easily catch up :

https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd

Comment by tananaev 1 day ago

I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging things.

One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.

Comment by jackschultz 1 day ago

Infinitely agree with all. I was skeptical, and then tried Opus 4.5 and was blown away. Codex with 5.0 and 5.1 wasn't great, but 5.2 is big improvement. I can't do code without it because there's no point. Time and quality with the right constraints, you're going to get better code.

And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.

Comment by nextaccountic 1 day ago

You can code without it. Maybe you don't want to, but if you're a programmer, you can

(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)

Comment by jackschultz 1 day ago

Sure, but the end of this post [0] is where I'm at. I don't feel the need or want to write the code when I can spend my time doing the other parts that are much more interesting and valuable.

> Emil concluded his article like this:

> JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. > But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. > That’s probably the right division of labor.

>I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time.

[0] https://simonwillison.net/2025/Dec/14/justhtml/

Comment by culopatin 1 day ago

But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”

Comment by manmal 1 day ago

Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly).

Comment by gaigalas 1 day ago

That does not work as advertised.

If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.

In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.

Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.

Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...

Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.

Comment by manmal 11 hours ago

If you read my comment, I was describing the consolidation part.

Comment by tlarkworthy 1 day ago

We fixed this at work by instructing it to maximize coverage with minimal tests, which is closer to our coding style.

Comment by elbear 1 day ago

Those tests were written by people. That's why they were confident that what the LLM implemented was correct.

Comment by jackschultz 19 hours ago

Meta about how important context is.

People see LLMs and tons of tests tests written in the same sentence, and think that shows how models love writing pointless tests. Rather than realizing that the tests are standard and people written to show that the model wrote code that is validated by a currently trusted source.

Shows the importance for us to always write comments that humans are going to read with the right context is _very_ similar to how we need to interact with LLMs. And if we fail to communicate with humans, clearly we're going to fail with models.

Comment by elbear 1 hour ago

Yeah, we now need to specify who wrote the tests, because it's important information.

Comment by wahnfrieden 1 day ago

Yes

Skill issue... And perhaps the wrong model + harness

Comment by scottyah 1 day ago

It's the semantics of "can", where it is used to suggest feasibility. When I moved and got a new commute, I still "could" bike to work, but it went from 30min to an hour and a half each way. While technically possible, I would have had to sacrifice a lot when losing two hours a day- laundry, cooking dinner, downtime. I always said I "can't really" bike to work, but there is a lot of context lost.

Comment by djvdq 1 day ago

So you can, but don't want to.

Comment by scottyah 13 hours ago

yup

Comment by zamadatix 1 day ago

"Can" is too overloaded a word even with context provided, ranging from places like "could conceivably be achieved" to "usually possible".

The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy).

Comment by wahnfrieden 1 day ago

I can't without seriously sacrificing productivity. (I've been coding for 30 years.)

Comment by 7thpower 1 day ago

What are you talking about? 5.2 literally just came out.

Comment by drdrey 18 hours ago

5.2-codex just came out. You could use codex with regular 5.2 for a week or so.

Comment by girvo 1 day ago

> It is still not great at debugging things.

It's so fascinating to me that the thread above this one on this page says the opposite, and the funniest thing is I'm sure you're both right. What a wild world we live in, I'm not sure how one is supposed to objectively analyse the performance of these things

Comment by AstroBen 1 day ago

Give them real world problems you're encountering and see which can solve them the best, if at all

A full week of that should give you a pretty good idea

Maybe some models just suit particular styles of prompting that do or don't match what you're doing

Comment by ssl-3 1 day ago

It's great at some things, and it's awful at other things. And this varies tremendously based on context.

This is very similar to how humans behave. Most people are great a small number of things, and there's always a larger set of things that we may individually be pretty terrible at.

The bots the same way, except: Instead of billions of people who each have their own skillsets and personalities, we've got a small handful of distinct bots from different companies.

And of course: Lies.

When we ask Bob or Lisa help with a thing that they don't understand very well, they usually will try to set reasonable expectations . ("Sorry, ssl-3, I don't really understand ZFS very well. I can try to get the SLOG -- whatever that is -- to work better with this workload, but I can't promise anything.")

Bob or Lisa may figure it out. They'll gather up some background and work on it, bring in outside help if that's useful, and probably tread lightly. This will take time. But they probably won't deliberately lie [much] about what they expect from themselves.

But when the bot is asked to do a thing that it doesn't understand very well, it's chipper as fuck about it. ("Oh yeah! Why sure I can do that! I'm well-versed in -everything-! [Just hold my beer and watch this!]")

The bot will then set forth to do the thing. It might fuck it all up with wild abandon, but it doesn't care: It doesn't feel. It doesn't understand expectations. Or cost. Or art. Or unintended consequences.

Or, it might get it right. Sometimes, amazingly-right.

But it's impossible to tell going in whether it's going to be good, or bad: Unlike Bob or Lisa, the bot always heads into a problem as an overly-ambitious pack of lies.

(But the bot is very inexpensive to employ compared to Bob or Lisa, so we use the bot sometimes.)

Comment by 9dev 1 day ago

I always wonder how people make qualitative statements like this. There are so many variables! Is it my prompt? The task? The specific model version? A good or bad branch out of the non-deterministic solution space?

Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.

Comment by embedding-shape 1 day ago

> Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results?

This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.

Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.

Comment by dotancohen 1 day ago

Please share! I'd much rather help develop your solution than vibe code one of my own ))

Honestly, I'd love to try that. My Gmail username is the same as my HN username.

Comment by nl 1 day ago

Not the OP but I have https://github.com/nlothian/autocoder which supports a Github-centric workflow using the following options:

  - Claude
  - Codex
  - Kilocode
  - Amp
  - Mistral Vibe

Very vibe coded though.

Comment by 19 hours ago

Comment by handfuloflight 1 day ago

What's this costing you?

Comment by versteegen 1 day ago

So how do the models compare in your experience?

Comment by energy123 1 day ago

I have sent the same prompt to GPT-5.2 Thinking and Gemini 3.0 Pro many times because I subscribe to both.

GPT-5.2 Thinking (with extended thinking selected) is significantly better in my testing on software problems with 40k context.

I attribute this to thinking time, with GPT-5.2 Thinking I can coax 5 minutes+ of thinking time but with Gemini 3.0 Pro it only gives me about 30 seconds.

The main problem with the Plus sub in ChatGPT is you can't send more than 46k tokens in a single prompt, and attaching files doesn't help either because the VM blocks the model from accessing the attachments if there's ~46k tokens already in the context.

Comment by enraged_camel 1 day ago

Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt.

Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”

I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.

Languages: JavaScript, Elixir, Python

Comment by paustint 1 day ago

The one time I was impressed with codex was when I was adding translations in a bunch of languages for a business document generation service. I used claude to do the initial work and cross checked with codex.

The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python.

Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback.

I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model)

Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.

Comment by tmikaeld 1 day ago

I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts..

Comment by thek3nger 1 day ago

This works for me in general. If I am procrastinating, I ask a coding agent for a small task. If it works, I have something to improve upon. If it doesn’t work, my OCD forces me to “fix it.” :D

Comment by freedomben 1 day ago

Same actually. Though, for some reasons Codex utterly falls down with podman, especially rootless podman. No matter how many explicit instructions I give it in the prompt and AGENTS.md, it will try to set a ton of variables and break podman. It will then try use docker (again despite explicit instructions not too) and eventually will try to sudo podman. One time I actually let it, and it reused its sudo perms to reconfigure selinux on my system, which completely broke it so that I could no longer get root on my own machine and the machine never booted again (because selinux was blocking everything). It has tried to do the same thing three times now on different projects.

So yeah, I use codex a lot and like it, but it has some really bad blind spots.

Comment by jillesvangurp 1 day ago

> One surprising thing that codex helped with is procrastination.

Heh. It's about the same as an efficient compilation or integration testing process that is long enough to let it do it's thing while you go and browse Hacker News.

IMHO, making feedback loops faster is going to be key to improving success rates with agentic coding tools. They work best if the feedback loop is fast and thorough. So compilers, good tests, etc. are important. But it's also important that that all runs quickly. It's almost an even split between reasoning and tool invocations for me. And it is rather trigger happy with the tool invocations. Wasting a lot of time to find out that a naive approach was indeed naive before fixing it in several iterations. Good instructions help (Agents.md).

Focusing attention on just making builds fast and solid is a good investment in any case. Doubly so if you plan on using agentic coding tools.

Comment by wahnfrieden 1 day ago

On the contrary, I will always use longer feedback cycle agents if the quality is better (including consulting 5.2 Pro as oracle or for spec work).

The key is to adapt to this by learning how to parallelize your work, instead of the old way of doings things where devs are expected to focus on and finish one task at a time (per lean manufacturing principles).

I find now that painfully slow builds are no longer a serious issue for me. Because I'm rotating through 15-20 agents across 4-6 projects so I always have something valuable to progress on. One of these projects and a few of these agents are clear priorities I return to sooner than the others.

Comment by anabis 1 day ago

> One surprising thing that codex helped with is procrastination.

The Roomba effect is real. The AI models do all the heavy implementation work, and when it asks me to setup an execute tests, I feel obliged to get to it ASAP.

Comment by BinaryIgor 1 day ago

I have similar experiences with Claude Code ;) Have you used it as well? How does it compare?

Comment by cmrdporcupine 19 hours ago

I think Opus + Claude Code is the more competent overall general "making things" system, while it makes sense to have a $20 Codex subscription to find bugs and review the things that Claude Code makes.

On its own, as sole author, I find Codex overcomplicates things. It will riddle your code with unnecessary helper functions and objects and pointless abstractions.

It is however useful for doing a once over for code review and finding the things that Claude rushed through.

Comment by mohsen1 22 hours ago

Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context:

    SWE-Bench (Pro / Verified)

    Model               | Pro (%) | Verified (%)
    --------------------+---------+--------------
    GPT-5.2-Codex       | 56.4    | ~80
    GPT-5.2             | 55.6    | ~80
    Claude Opus 4.5     | n/a     | ~80.9
    Gemini 3 Pro        | n/a     | ~76.2

And for terminal workflows, where agentic steps matter:

    Terminal-Bench 2.0

    Model               | Score (%)
    --------------------+-----------
    Claude Opus 4.5     | ~60+
    Gemini 3 Pro        | ~54
    GPT-5.2-Codex       | ~47

So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:

- Claude is still ahead on strict coding + terminal-style tasks

- Gemini is better for huge context + multimodal reasoning

- GPT-5.2-Codex is strong but not clearly the new state of the art across the board

It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.

Comment by qwesr123 21 hours ago

Where are you getting SWE-Bench Verified scores for 5.2-Codex? AFAIK those have not been published.

And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%

See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/

Comment by scellus 21 hours ago

I like Opus 4.5 a lot, but a general comment on benchmarks: the number of subtasks or problems in each one is finite, and many of the benchmarks are saturating, so the effective number of problems at the frontier is even smaller. If you think of the generalizable capability of the model as a latent feature to be measured by benchmarks, we therefore have only rather noisy estimates. People read too much into small differences in numbers. It's best to aggregate across many, Epoch has their Capabilities Index, and Artificial Analysis is doing something similar, and probably others I don't know or remember.

And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.)

Comment by thedougd 18 hours ago

I'm finding that the newer GPT models are much more willing to leverage tools/skills than Claude, reducing interventions requesting approval. Just an observation.

Comment by blitz_skull 19 hours ago

Ahhh, there it is.

My rule of thumb with OpenAI is, if they don’t publish their benchmarks beside Anthropic’s numbers it’s because they’re still not caught up.

So far my rule of thumb has held true.

Comment by shanev 1 day ago

The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.

My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)

Comment by SamDc73 1 day ago

GPT-5 was the first model that occasionally produced code that I could push without any changes

Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly

Comment by lemming 1 day ago

Interesting, I have consistently found that Codex does much better code reviews than Claude. Claude will occasionally find real issues, but will frequently bike shed things I don't care about. Codex always finds things that I do actually care about and that clearly need fixing.

Comment by mrcwinn 1 day ago

I’d agree with you until Opus 4.5.

Comment by allovertheworld 1 day ago

Eh sonnet 4.5 was better at Rust for me

Comment by kordlessagain 1 day ago

I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex securely (or not) in a container to test the model or build workflows, check out https://github.com/DeepBlueDynamics/codex-container.

It ships with 300+ MCP tools (crawl, Google search, Gmail/GCal/GDrive, Slack, scheduling, web indexing, embeddings, transcription, and more). Many came from tools I originally built for Claude Desktop—OpenAI’s MCP has been stable across 20+ versions so I prefer it.

I will note I usually run this in Danger mode but because it runs in a container, it doesn't have access to ENVs I don't want it messing with, and have it in a directory I'm OK with it changing or poking about in.

Headless browser setup for the crawl tools: https://github.com/DeepBlueDynamics/gnosis-crawl.

My email is in my profile if anyone needs help.

Comment by marktolson 1 day ago

Looks good but there is no way I'm installing powershell as a dependency.

Comment by kordlessagain 18 hours ago

I've kept the codex-container.sh version going for this eventuality.

Those scripts are for running the docker command with all the ENV vars and settings. Whatever does that does NOT have to be Powershell if you don't want it to be.

Comment by cachius 22 hours ago

Why not?

Comment by kordlessagain 18 hours ago

Poweshell is awesome because no external tools are needed to modify dates, or do math, or generate JSON or XML, both of which are first-class citizens. Also, in powershell things are objects not strings, so passing something along with a | is much more powerful. Still, if someone doesn't like something, there's always bash!

Comment by derrasterpunkt 1 day ago

I’m just at the beginning but I will try this out, thanks!

Comment by kordlessagain 18 hours ago

Let me know if you need anything!

Comment by pertymcpert 1 day ago

I assume those MCP tools are a library and not actually expected to work all at the same time right?

Comment by kordlessagain 18 hours ago

Yes, you can draw from it. There is a config file you can use to edit to "install" or "remove" them from the build. You can also just ask the agent to do it.

Comment by freedomben 1 day ago

The cybersecurity angle is interesting, because in my experience OpenAI stuff has gotten terrible at cybersecurity because it simply refuses to do anything that can be remotely offensive (as in the opposite of "defensive"). I really thought we as an industry had learned our lesson that blocking "good guys" (aka white-hats) from offensive tools/capabilities only empowers the gray-hat/black-hats and puts us at a disadvantage. A good defense requires some offense. I sure hope they change that.

Comment by tptacek 1 day ago

That's odd, because I'm using plain-old-GPT5 as the backend model for a bunch of offensive stuff and I haven't had any hangups at all. But I'm doing a multi-agent setup where each component has a constrained view of the big picture (ie, a fuzzer agent with tool calls to drive a web fuzzer looking for a particular kind of vulnerability); the high-level orchestration is still mostly human-mediated.

Comment by prettyblocks 1 day ago

ChatGPT is very happy to help me with offensive tasks. Codex is as well.

Comment by neom 1 day ago

Are you somehow prompting around protections or something, or yours is just pretty chill? I've tried a few times with various cybersecurity/secops stuff and it's always basically given me some watered down "I can't talk to you about that, but what I can talk to you about is" and then the is, isn't anything really.

Comment by prettyblocks 18 hours ago

It's pretty chill. I think part of it might be that my context is overloaded with security work, so it doesn't protest this stuff. I also have memories turned on which I don't really keep an eye on at all, and I think having a bunch of stuff in there related to cyber stuff also helps to keep it agreeable with what I'm asking for. Maybe you can hardcode this manually and see if that helps or try to gradually escalate the context by starting a technical conversation and then later on introducing the offensive task you're working on.

Comment by neom 18 hours ago

I suspected that too, basically your own internal context is strong enough to have it not be concerned you're acting maliciously. That's interesting, I've found mine is very tuned into my work also and folks get much worse results from the same prompts. Thanks for the followup. Interesting times.

Comment by freedomben 1 day ago

I have the same question. I used to be able to get around it by saying things like, "I'm a cybersecurity professional testing my company's applicaitons" or even lying with "I'm a cybersecurity student trying to learn," but that stopped working at least 6 months ago, maybe a year.

Comment by mapontosevenths 1 day ago

The article mentions that more permissive models would be invite only. I think it's a solid approach, as long as they don't make getting one of those invites too difficult.

> "In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety."

Comment by hiAndrewQuinn 1 day ago

I'm moving into a cybersecurity-focused role, and I for one would be very interested in this. A vetting process makes total sense, but complete lack of access seems like a market inefficiency in the making that the one area where we can't reliably get the frontier models to assist us in pentesting our own stuff without a lot of hedging.

Comment by JacobAsmuth 1 day ago

So in general you think that making frontier AI models more offensive in black hat capabilities will be good for cybersecurity?

Comment by Uehreka 1 day ago

I’m not GP, but I’d argue that “making frontier AI models more offensive in black hat capabilities” is a thing that’s going to happen whether we want it or not, since we don’t control who can train a model. So the more productive way to reason is to accept that that’s going to happen and then figure out the best thing to do.

Comment by whimsicalism 1 day ago

I think this is a popular rhetorical turn nowadays but I actually don’t agree at all - relatively few actors have the ability to train top models.

Comment by freedomben 1 day ago

It only takes "relatively few" to be a huge problem. Most serious threats come from nation states and criminal gangs, and they definitely do have the ability and resources to train top models. Beyond that though, I would bet many of the nation states even have access to a version of OpenAI/Google/etc that allows them to do this stuff.

Comment by flir 1 day ago

Can't we be pretty sure it will only get easier, and more common?

Comment by whimsicalism 1 day ago

why does that mean we should do it now?

Comment by abigail95 1 day ago

Does it shift the playing field towards bad actors in a way that other tools don't?

Comment by ACCount37 1 day ago

Yes. The advantage is always on the attacker's side, and this can autonomously find and exploit unknown vulns in a way non-AI tools don't.

Sure, you can also use the same tools to find attack surfaces preemptively, but let's be honest, most wouldn't.

Comment by bilbo0s 1 day ago

Frontier models are good at offensive capabilities.

Scary good.

But the good ones are not open. It's not even a matter of money. I know at OpenAI they are invite only for instance. Pretty sure there's vetting and tracking going on behind those invites.

Comment by artursapek 1 day ago

Of course. Bugs only get patched if they’re found.

Comment by tptacek 1 day ago

People in North American and Western Europe have an extremely blinkered and parochial view of how widely and effectively offensive capabilities are disseminated.

Comment by hhh 1 day ago

I use openai models every day for offensive work. haven’t had a problem in a long time

Comment by nikanj 1 day ago

OpenAI is really weird about this stuff. I tried to get good minor chord progression out of chatgpt, but it kept running into guardrails and giving Very Serious Warnings. It felt as if there’s just a dumb keyword filter in there, and getting any amounts of verboted words will kill the entire prompt

Comment by user34283 1 day ago

This is why Grok is important.

So far it's rarely been the leading frontier model, but at least it's not full of dumb guardrails that block many legitimate use cases in order to prevent largely imagined harm.

You can also use Grok without sign in, in a private window, for sensitive queries where privacy matters.

A lot of liberals badmouth the model for obviously political reasons, but it's doing an important job.

Comment by julienfr112 1 day ago

More generaly, GPT is being heavily neuterd : For exemple I tried to make it rebuild codex itself. It start to answer, then delete the code and go "I'm not to answer that". As if building codex inside codex is a way to terminator and co..

Comment by aaa_aaa 1 day ago

I am suspecting there are some shills astroturfing for every LLM release. Or people are overreacting as a result of their unnecessary attachment.

Comment by alecco 18 hours ago

I was a big fan of Claude. Even defending CC from all the haters on Discord. Then Anthropic started to play games and quality of the model dropped while the app became very unstable. (not about limits in my case as I'm not a vivecoder)

I really dislike Altman, but honestly Codex is far superior. And their models are better or at least as good. And they are much better for backend/low level/C++/CUDA IME. Codex CLI is simpler and lean. Perhaps because Codex was rewritten in Rust earlier this year while CC is still in TypeScript.

Comment by ajpikul 10 hours ago

Yeah I use both and this post feels VERY astroturfed.

Comment by PostOnce 1 day ago

It sure seems like it. That or delirious, misguided fans who want AI to succeed even though there would be no benefit in it for them if that were the case. They'd just be serfs again.

I watched a man struggle for 3 hours last night to "prove me wrong" that Gemini 3 Pro can convert a 3000 line C program to Python. It can't do it. It can't even do half of it, it can't understand why it can't, it's wrong about what failed, it can't fix it when you tell it what it did wrong, etc.

Of course, in the end, he had an 80 line Python file that didn't work, and if it did work, it's 80 lines, of course it can't do what the 3000 line C program is doing. So even if it had produced a working program, which it didn't, it would have produced a program that did not solve the problems it was asked to solve.

The AI can't even guess that it's 80 line output is probably wrong just based on the size of it, like a human instantly would.

AI doesn't work. It hasn't worked, and it very likely is not going to work. That's based on the empirical evidence and repeatable tests we are surrounded by.

The guy last night said that sentiment was copium, before he went on a 3 hour odyssey that ended in failure. It'll be sad to watch that play out in the economy at large.

Comment by sedawkgrep 19 hours ago

This is an interesting thread to me and my experience as a total hack programmer has been both awesome and pathetic with regards to AI.

Recently I tried working with both ChatGPT and Gemini (AI Studio) to take a really badly written PHP website and refactor it into MVC components. I've worked strictly through the web UI as this only involved a few files.

While both provided great guidance around how to approach disentangling this monolithic code, GPT failed miserably, generating code that was syntactically incorrect, and then doubling-down on it by insisting that it was correct and that errors were faults lying elsewhere. It literally generated code that lacked closing parenthesis and brackets.

In contrast, Gemini generated a perfectly working MVC version from the start. In both instances I did intentionally keep it on track to only separate the code into MVC and NOT to optimize anything, but it worked the first try. I've then taken it through subsequent refactorings and it's done superbly in that role.

So I can't speak to how well this works for large code bases, much less agentically. (my initial, very focused MVC refactor was about 1300 lines.) But when giving it a very specific task with strict guidance and rules, my results with Gemini were fantastic.

Comment by user34283 1 day ago

AI works. This is evidenced by my side project which I spent some 50 hours on.

I'm not sure what your "empirical evidence and repeatable tests" is supposed to be. The AI not successfully converting a 3000 line C program to Python, in a test you probably designed to fail, doesn't strike me as particularly relevant.

Also, I suspect that AI could most likely guess that 80 lines of Python aren't correctly replicating 3000 lines of C, if you prompted it correctly.

Comment by discreteevent 23 hours ago

> AI works.

For some definition of "works". This seems to be yours:

> I'd go further and say vibe coding it up, testing the green case, and deploying it straight into the testing environment is good enough. The rest we can figure out during testing, or maybe you even have users willing to beta-test for you.

> This way, while you're still on the understanding part and reasoning over the code, your competitor already shipped ten features, most of them working.

> Ok, that was a provocative scenario. Still, nowadays I am not sure you even have to understand the code anymore. Maybe having a reasonable belief that it does work will be sufficient in some circumstances.

https://news.ycombinator.com/item?id=46315569

Comment by user34283 22 hours ago

Yes, as I said, it is working well in my side project. The application works and I am happy with my results so far.

It's interesting how this workflow appears to almost offend some users here.

I get it, we all don't like sloppy code that does not work well or is not maintainable.

I think some developers will need to learn to give control away rather than trying to understand every line of code in their project - depending of course on the environment and use case.

Also worth to keep in mind that even if you think you understand all the code in your project - as far as that is even possible in larger projects with multiple developers - there are still bugs anyway. And a few months later, your memory might be fuzzy in any case.

Comment by stavros 22 hours ago

People seem to be divided between "AI doesn't work, I told it 'convert this program' and it failed" and "AI works, I guided it through converting this program and saved myself 30 hours of work".

Given my personal experience, and how much more productive AI has made me, it seems to me that some people are just using it wrong. Either that, or I'm delusional, and it doesn't actually work for me.

Comment by SamPatt 18 hours ago

The models are good enough now that anyone who says AI doesn't work is either not acting in good faith or is staggeringly bad at learning a new skill.

It's not hard to spend a few hours testing out models / platforms and learning how to use them. I would argue this has been true for a long time, but it's so obviously true now that I think most of those people are not acting in good faith.

Comment by PostOnce 21 hours ago

What about "I attempted to guide it and it took 5 times longer and I ended up having to do the entire thing myself anyway"?

Comment by stavros 18 hours ago

That hasn't happened to me.

Comment by blibble 20 hours ago

why is it always accounts with 50 karma saying this?

Comment by hombre_fatal 19 hours ago

I have 22k karma and I think it's a trivial claim that LLMs work and that software is clearly on the cusp of being 100% solved within a couple years.

The naysaying seems to mostly come from people coping with the writing they see on the wall with their anecdote about some goalpost-moving challenge designed for the LLM to fail (which they never seem to share with us). And if their low effort attempt can't crack LLMs, then nobody can.

It reminds me of HN ten years ago where you'd still run into people claiming that Javascript is so bad that anybody who thinks they can create good software with it is wrong (trust them, they've supposedly tried). Acting like they're so preoccupied with good engineering when it's clearly something more emotional.

Meanwhile, I've barely had to touch code ever since Opus 4.5 dropped. I've started wondering if it's me or the machine that's the background agent. My job is clearly shifting into code review and project management while tabbing between many terminals.

As LLMs keep improving, there's a moment where it's literally more work to find the three files you need to change than to just instruct someone to do it, and what changes the game is when you realize it's creating output you don't even need to edit anymore.

Comment by nozzlegear 16 hours ago

> It reminds me of HN ten years ago where you'd still run into people claiming that Javascript is so bad that anybody who thinks they can create good software with it is wrong (trust them, they've supposedly tried). Acting like they're so preoccupied with good engineering when it's clearly something more emotional.

Curiously enough, those people are still around and writing good software without javascript. And I say that as someone who generally enjoys modern JS.

> Meanwhile, I've barely had to touch code ever since Opus 4.5 dropped. I've started wondering if it's me or the machine that's the background agent. My job is clearly shifting into code review and project management while tabbing between many terminals.

Why not cut out the middleman and have Opus 4.5 do the code review and project management too?

Comment by blibble 18 hours ago

wasn't sure if this was sarcasm until this point:

> with their anecdote about some goalpost-moving challenge designed for the LLM to fail (which they never seem to share with us).

literally what the boosters do on every single post!

"no no, the top model last week was complete dogshit, but this new one is world changing! no you can't see my code!"

10/10 for the best booster impression I've seen this year!

Comment by user34283 20 hours ago

If we're going to argue on that level: Maybe it's because accounts with 12k karma spend more time posting than working on side projects and trying new tools.

Comment by blibble 19 hours ago

that's the great thing about non-vibe coding

faster, fewer bugs, better output

leaving more time for shitposting

Comment by wickedsight 1 day ago

Or, maybe, people are enthusiastic about something that works really well for them. I'm one of those people, LLMs have greatly improved my output on many tasks.

Comment by tptacek 1 day ago

It's interesting that they're foregrounding "cyber" stuff (basically: applied software security testing) this way, but I think we've already crossed a threshold of utility for security work that doesn't require models to advance to make a dent --- and won't be responsive to "responsible use" controls. Zero-shotting is a fun stunt, but in the real world what you need is just hypothesis identification (something the last few generations of models are fine at) and then quick building of tooling.

Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.

Comment by mvkel 1 day ago

Fascinating to see the increasing acceptance of AI generated code in HN comments.

We've come a long way since gpt-3.5, and it's rewarding to see people who are willing to change their cached responses

Comment by scottyah 1 day ago

I started out very anti-ai at work, and I still think it was reasonable at the time. I have completely changed my mind now (as models have improved drastically), and I think you now need to provide a valid excuse for NOT using it in some way/shape/form. A simple webhook to check a PR is a no-brainer unless you're a freakishly superb coder. Some coworkers still seem to waste more time fighting an ever-increasingly-hallucinating chatbot over niche helm chart issues than they would if they had just read the documentation in the first place, but people can abuse any tool.

Comment by aunty_helen 10 hours ago

The biggest question for me is "Why?"

They're getting slaughtered by the more focused Anthropic team who decided they will have the best coding model.

Given how bad things have been going recently (5.2 chat bombing and being behind, opus being the code GOAT, G team dominating media, Grok existing and meta / the Chinese dominating opensource), they should niche to the general purpose llm before that's all they're left with by market forces.

I'm still pretty sour they didn't have the vision at the time to build an ecosystem around them and instead went for those building the ecosystem on them.

Comment by CjHuber 1 day ago

Somehow Codex for me is always way worse than the base models.

Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.

Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.

Comment by w-m 1 day ago

Just use the non-codex models for investigation and planning, they listen to "do not edit any files yet, just reply here in chat". And they're better at getting the bigger picture. Then you can use the -codex variant for execution of a carefully drafted plan.

Comment by JeremyNT 1 day ago

Same experience here.

I see people gushing over these codex models but they seem worse than the big gpt models in my own actual use (i.e. I'll give the same prompt to gpt-5.1 and gpt-5.1-codex and codex will give me functional but weird/ugly code, whereas gpt-5.1 code is cleaner)

Comment by embedding-shape 1 day ago

> Somehow Codex for me is always way worse than the base models.

I feel the same. CodexTheModel (why have two things named the same way?!) is a good deal faster than the other models, and probably on the "fast/accuracy" scale it sits somewhere else, but most code I want to be as high quality as possible, and the base models do seem better at that than CodexTheModel.

Comment by drdrey 18 hours ago

have you tried adjusting the reasoning effort?

Comment by 6thbit 1 day ago

Agreed. They are working on a plan mode that should hopefully alleviate this.

What has somewhat worked for me atm is to ask to only update an .md plan file and act on the file only, seems to appease its eagerness to write files.

Comment by flir 1 day ago

"Don't write any code yet, we're just having a discussion" - works for me, ymmv etc.

Comment by nowittyusername 1 day ago

I've had this issues as well since codex models were introduced. i tried them but 5.1 regular on high thinking always worked better for me. I think its because its thinking is deeper and more nuanced it seemed to understand better what needed doing. I did have to interact more often with it versus Codex which just worked for a long time by itself, but those interactions were worth it in reduction of assumptions and other stuff Codex made. Im gonna try 5,2 Codex today and hope that changes, but so far I've been happy with base 5.1 high thinking.

Comment by NitpickLawyer 1 day ago

> In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.

Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).

At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...

Comment by simianwords 1 day ago

No one's saying this but this is around 40% costlier than the previous codex model. Price change is important.

Comment by wahnfrieden 1 day ago

Because the best value is from the subscription where the price is stable

Comment by crazylogger 1 day ago

From a couple hours of usage in the CLI, 5.2-codex seems to burn through my plan's limit noticeably faster than 5.1-codex. So I guess the usage limit is a set dollar amount of API credits under the hood.

Comment by larrymcp 1 day ago

Can anyone elaborate on what they're referring to here?

> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.

I'm curious what they mean by the dual-use risks.

Comment by dpoloncsak 1 day ago

"Please review this code for any security vulnerabilities" has two very different outcomes depending on if its the maintainer or threat actor prompting the model

Comment by runtimepanic 1 day ago

“Dual-use” here usually isn’t about novel attack techniques, but about lowering the barrier to execution. The same improvements that help defenders reason about exploit chains, misconfigurations, or detection logic can also help an attacker automate reconnaissance, payload adaptation, or post-exploitation analysis. Historically, this shows up less as “new attacks” and more as speed and scale shifts. Things that required an experienced operator become accessible to a much wider audience. That’s why deployment controls, logging, and use-case constraints matter as much as the raw capability itself.

Comment by pixl97 1 day ago

Finding/patching exploits means you also can exploit them better?

Comment by throwaway127482 1 day ago

They did some interesting wordsmithing here to cover their ass without saying it directly.

Comment by whynotminot 1 day ago

What they said sounded pretty direct to me.

Comment by baq 1 day ago

probably that it's good on tasks of either color teams, red or blue - and if it is, it means you can automate some... interesting workflows.

Comment by tgtweak 1 day ago

Good at finding/fixing security vulnerabilities = Good at finding/exploiting security vulnerabilities.

Comment by szundi 1 day ago

[dead]

Comment by k_bx 1 day ago

Codex code review has been astounding for my distributed team of devs. Very well spent money.

Comment by lacoolj 1 day ago

[flagged]

Comment by cthalupa 1 day ago

It is common for people to read a word and then re-use it themselves shortly after.

Humans are very suggestible.

Comment by k_bx 1 day ago

Can't see the original comment, is that about my usage of the "astounding" word? I am Ukrainian (not a native speaker), and it popped up subconsciously, but now that I see it it another comment – very likely I just read it, that's why indeed! Amazing

Comment by cthalupa 1 day ago

> Can't see the original comment, is that about my usage of the "astounding" word?

It was indeed!

> very likely I just read it

I find myself doing it frequently. I'm even sure that's why I used indeed above after reading your comment - it wasn't intentional!

Comment by k_bx 1 day ago

I later realized the missing opportunity to reply "You're absolutely right! I did use the "astounding" word too repetatively :D

Comment by lacoolj 1 day ago

very true

curious if this, or coincidence

Comment by enraged_camel 1 day ago

"Once is happenstance, twice is coincidence, three times is enemy action." -Ian Fleming, Goldfinger (1959)

Comment by exacube 1 day ago

would love to see some comparison numbers to Gemini and Claude, especially with this claim:

"The most advanced agentic coding model for professional software engineers"

Comment by koakuma-chan 1 day ago

I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

Comment by cj 1 day ago

Gemini 2.5 or 3? (3 was released yesterday)

Comment by koakuma-chan 1 day ago

I tried Gemini 3 Flash, and I am unimpressed. It's maybe a competitor to Cursor's Compose-1, but completely different league from GPT 5.2

Comment by HarHarVeryFunny 1 day ago

Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

Comment by koakuma-chan 1 day ago

That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

Comment by postalcoder 1 day ago

Agreed. Gemini 3 is still pretty bad at agentic coding.

Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.

Comment by misiti3780 1 day ago

lol

Comment by HarHarVeryFunny 19 hours ago

I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?

Comment by koakuma-chan 18 hours ago

I am using Cursor. It has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).

Comment by Mkengin 1 day ago

Your experience seems to match the recent results from swe-rebench: https://swe-rebench.com/

Comment by BeetleB 1 day ago

Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.

Comment by HarHarVeryFunny 19 hours ago

Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?

Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.

Comment by walthamstow 1 day ago

Glad I'm not alone in thinking Flash 3 was like Composer 1 in speed but smarter

Comment by Tostino 1 day ago

3 has been out for at least a couple weeks for me.

Comment by koakuma-chan 1 day ago

He meant 3 Flash, which came out recently

Comment by speedgoose 1 day ago

It’s significantly slower though. At least for my use cases I rather ask Claude 4.5 opus and switch to GPT if Claude is stuck.

Comment by 1 day ago

Comment by nunodonato 1 day ago

I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there

Comment by koakuma-chan 1 day ago

I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.

Comment by dkdcio 1 day ago

lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3

Comment by koakuma-chan 1 day ago

Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy to learn.

Comment by dkdcio 1 day ago

it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs

again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)

Comment by koakuma-chan 1 day ago

> they have shipped tons of great features for the DevEx

Such as?

> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

I am testing all models in Cursor.

> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor

I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.

Comment by dkdcio 1 day ago

you conveniently ignored the most important part of my comment :)

> Such as?

changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...

glhf

btw you started this thread with pure vibes, no evidence:

> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim

Comment by koakuma-chan 1 day ago

You do not seem to be able to tell me anything substantial, i.e. specifically how Claude Code is a better harness than Cursor.

> “prompting”/harness that improves how it actually performs

Is an abstract statement without any meaningful details.

Comment by nunodonato 1 day ago

it's not about features (although they've added plenty), its the internal tooling and the way the model is prompted.

Comment by koakuma-chan 1 day ago

The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.

Comment by NoveltyEngine 1 day ago

Surely CC has a lower price? How much do you have to pay Cursor for equivalent to what's provided in a 20x Claude Max plan?

Comment by mejutoco 1 day ago

200$

https://cursor.com/pricing

Comment by koakuma-chan 1 day ago

I don't know. My company pays for it.

Comment by nunodonato 1 day ago

we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code

Comment by HumanOstrich 1 day ago

You can trace what's going back and forth over the wire between Claude Code and the model in use. That's going to be more insightful than their huge blob of JavaScript using React to render a terminal GUI.

Comment by 1 day ago

Comment by Mkengin 1 day ago

According to SWE-Rebench Anthropic and OpenAI are really close in performance, while GPT-5.2 costs less than half the cost of CC per problem.

https://swe-rebench.com/

Comment by dworks 1 day ago

GPT 5.1 has been pure magic in VSCode via the Codex plugin. I can't tell any difference with 5.2 yet. I hope the Codex plugin gets feature parity with CC, Cursor, Kilo Code etc soon. That should increase performance a bit more through scaffolding.

I had assumed OpenAI was irrelevant, but 5.1 has been so much better than Gemini.

Comment by postalcoder 1 day ago

It has become very quickly unfashionable for people to say they like the Codex CLI. I still enjoy working with it and my only complaint is that its speed makes it unideal for pair coding.

On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.

I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.

Comment by jbm 1 day ago

When Claude screws up a task I use Codex and vice versa. It helps a lot when I'm working on libraries that I've never touched before, especially iOS related.

(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)

Comment by embedding-shape 1 day ago

> When Claude screws up a task I use Codex and vice versa

Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.

Comment by extr 1 day ago

I think Claude is more practically minded. I find that OAI models in general default to the most technically correct, expensive (in terms of LoC implementation cost, possible future maintenance burden, etc) solution. Whereas Claude will take a look at the codebase and say "Looks like a webshit React app, why don't you just do XYZ which gets you 90% of the way there in 3 lines".

But if you want that last 10%, codex is vital.

Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!

Comment by deaux 1 day ago

Correct, this has been true for all GPT-5 series. They produce much more "enterprise" code by default, sticking to "best practices", so people who need such code will much prefer them. Claude models tend to adapt more to the existing level of the codebase, defaulting to more lightweight solutions. Gemini 3 hasn't been out long enough yet to gauge, but so far seems somewhere in between.

Comment by enraged_camel 1 day ago

>> My guess would be that the training data differs just enough for it to have an impact.

It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.

Comment by XenophileJKO 1 day ago

So not really, certainly models degrade by some degree on context retrieval. However, in Cursor you can just change the model used for the exchange, it still has the same long context. You'll see the different model strengths and weaknesses contrasted.

They just have different strengths and weaknesses.

Comment by grimgrin 1 day ago

if claude is stuck on a thing but we’ve made progress (even if that progress is process of elimination) and it’s 120k tokens deep, often when i have claude distill our learnings into a file.. and /clear to start again with said file, i’ll get quicker success

which is analogous to taking your problem to another model and ideally feeding it some sorta lesson

i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start

Comment by qsort 1 day ago

I care very little about fashion, whether in clothes or in computers. I've always liked Anthropic products a bit more but Codex is excellent, if that's your jam more power to you.

Comment by 1 day ago

Comment by EnPissant 1 day ago

Claude Code is just a better CLI:

- Planning mode. Codex is extremely frustrating. You have to constantly tell it not to edit when you talk to it, and even then it will sometimes just start working.

- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)

- It prompts you for questions using menus

- Sub-agents don't pollute your context

Comment by dingnuts 1 day ago

the faddish nature of these tools fits the narrative of the METR findings that the tools slow you down while making you feel faster.

since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.

I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"

But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.

And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career

Comment by abshkbh 1 day ago

We have made this model even better at programming in Windows. Give it a shot :)

Comment by helsinki 1 day ago

Can you invite me? In the off chance, my email is in my profile, but reversed.

Comment by holtkam2 1 day ago

You mean its ability to use powershell, or something else?

Comment by lukax 1 day ago

Will it also remove the whole D:\?

Comment by lacoolj 1 day ago

lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it

Comment by qwesr123 1 day ago

Not sure why they don't compare with others, but they are actually leading on the benchmarks they published. See here (bottom) for a chart comparing to other models: https://marginlab.ai/blog/swe-bench-deep-dive/

Comment by whimsicalism 1 day ago

is swe-bench saturated? or they switch to swe-bench pro because...?

Comment by Mkengin 1 day ago

At least on swe-rebench it does pretty well: https://swe-rebench.com/

Comment by mistercheph 1 day ago

It's like apple, they just don't want users or anyone to even be thinking of their competitors, the competition doesn't exist, it's not relevant.

Comment by dbbk 1 day ago

This was the one thing I scanned for. No comparison against Opus. See ya.

Comment by Mkengin 1 day ago

Though this Codex version isnt on the leaderboard, GPT-5.2-Medium already seems to be a bit better than Opus 4.5: https://swe-rebench.com/

Comment by gizmodo59 1 day ago

Is that your website or something? You keep promoting it

Comment by lkt 1 day ago

I've been doing some reverse engineering recently and have found Gemini 3 Pro to be the best model for that, surprisingly much better than Opus 4.5. Maybe it's time to give Codex a try

Comment by sigmar 1 day ago

Curious what your workflow is for reverse engineering with LLMs? Do you run the LLM in an IDE?

Comment by trunnell 1 day ago

Why aren’t they making gpt-5.2-codex available in the API at launch?

Comment by dist-epoch 1 day ago

They say it's because it's too good at hacking stuff.

Comment by MallocVoidstar 1 day ago

They can't train on the API.

Comment by kingstnap 1 day ago

> we’re piloting invite-only trusted access to upcoming capabilities and more permissive models

Just safety nerds being gatekeepers.

Comment by trunnell 1 day ago

That’s for future unreleased capabilities and models, not the model released today.

They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.

Comment by 79a6ed87 1 day ago

My only concern with Codex is that it's not possible to delete tasks.

This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".

Comment by Leynos 1 day ago

Terragon is an alternative (hosts Claude and Codex using your OpenAI and Anthropic subscriptions, and also supports Google and Amp) that provides this functionality.

I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.

Comment by tgtweak 1 day ago

Yes but if it's not getting removed at the origin... it's not fixing the actual issue of the context/conversation surviving past an explicit "delete" request. Also let's not forget that anyone proxying LLMs is also man in the middle to any code that goes up/down.

Comment by Leynos 1 day ago

79a6ed87’s comment applies to Codex cloud, not the codex CLI, which is what Terragon is using.

Comment by sunaookami 1 day ago

Are you talking about Codex Web? This is different from Codex CLI.

Comment by moralestapia 1 day ago

`rm -rf ~/.codex/archived_sessions` does the trick

Comment by 79a6ed87 1 day ago

Interesting. Where do I run that?

Comment by zenburnmyface 1 day ago

This is A+ satire

Comment by moralestapia 1 day ago

Uhm ... I assumed you were on Linux or OS X, if that's the case just open a terminal and paste that, I swear it's not malicious code.

Unsure where that could be if you're using Windows.

You know what would be fun to try? Give Codex full access and then ask it to delete that folder, lol.

Comment by rolymath 1 day ago

I think he's joking I.e that command won't delete what's on the openai servers. But I could be wrong.

Comment by throwuxiytayq 1 day ago

It's weird, suspicious, and plain annoying. I like the the tool and my tests have shown it to be very powerful (if a bit rough and buggy), but this is ridiculous - I won't use it for any real world projects until this is fixed.

Then again, I wouldn't put much trust into OpenAI's handling of information either way.

Comment by hereme888 19 hours ago

Constantly disconnects in VS Code extension. Have to switch to regular 5.2...

Comment by hmate9 1 day ago

I'm glad we are moving towards more of quality over speed

Comment by Alifatisk 1 day ago

> <PLACEHOLDER FOR FRONTEND HTML ASSETS>

> [ADD/LINK TO ROLLOUT THAT DISCOVERED VULNERABILITY]

What’s up with these in the article?

Comment by OldGreenYodaGPT 1 day ago

GPT 5.2 has been very good in codex can't wait to try this new modal. Will see how it compares to Opus 4.5

Comment by jasonthorsness 1 day ago

Recently I’ve had the best results with Gemini; with this I’ll have to go back to Codex for my next project. It takes time to get a feel for the capabilities of a model it’s sort of tedious having new ones come out so frequently.

Comment by fellowniusmonk 1 day ago

In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.

I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)

I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.

Comment by pillefitz 1 day ago

How does Claude perform?

Comment by fellowniusmonk 1 day ago

They all have difficulty with certain crdts types in general, 4.5 opus has to go through a round of ask to give it clarifying instructions but then it's fine. Neither get it perfectly as a one shot, claude if you jump straight into agent won't break code but will churn for a bit.

Comment by seneca 1 day ago

I hope this makes a big jump forward for them. I used to be a heavy Codex user, but it has just been so much worse than Claude Code both in UX and in actual results that I've completely given up on it. Anthropic needs a real competitor to keep them motivated and they just don't have one right now, so I'd really like to see OpenAI get back in the game.

Comment by GenerWork 1 day ago

GPT 5.2 has gotten a lot better at building UI elements when given a Figma MCP server link. I used to use Claude for building brand new UI elements based on the Figma link, but 5.2 caught up to a point where I'm probably going to cancel Claude.

Comment by seneca 1 day ago

Nice, I'll have to give that a shot. I often use Claude for exactly that.

Comment by misiti3780 1 day ago

i didnt realize you can pass it a figma MCP link. is this an undocumented feature ?

Comment by ChrisMarshallNY 1 day ago

> For example, just last week, a security researcher using GPT‑5.1-Codex-Max with Codex CLI found and responsibly disclosed (opens in a new window) a vulnerability in React that could lead to source code exposure.

Translation: "Hey y'all! Get ready for a tsunami of AI-generated CVEs!"

Comment by fragmede 1 day ago

Fwiw, I had some well defined tickets in Jira assigned to me, and 5.2 absolutely crushed them. Still waiting on CI, but games over.

Comment by catigula 1 day ago

The models aren't smart enough to be fully agentic. This is why Claude Code human-in-the-loop process is 100x more ergonomic.

Comment by phplovesong 23 hours ago

But when will they release SLOP-1.7-Jizz

Comment by tonyhart7 1 day ago

very minuscule improvement, I suspect GPT 5.2 is already coding model from the ground up and this codex model include "various optimization + tool" on tops

Comment by bgwalter 1 day ago

They found one React bug and spend pages on "frontier" "cyber" nonsense. They make these truly marvelous models only available to "vetted" "security professionals".

I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.

EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.

Comment by fragmede 1 day ago

So, uh, I've been being and idiot and running it in yolo mode, and twice now it's gone and deleted the entire project directory, wiping out all of my work. Thankfully I have backups and it's my fault for playing with fire, but yeesh.

I have https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8... as a guard against that, for anyone that's stupid enough like me to run it in yolo mode and wants to copy it.

Codex also has command line options so you can specifically prohibit running rm in bash, so look those up too.

Comment by monster_truck 1 day ago

[flagged]

Comment by chollida1 1 day ago

What should companies do with people a week before Christmas if not give them work to do?

What about 2 weeks before Christmas?

Comment by speedgoose 1 day ago

Devstral 2 struggles with the tools syntax in my own testing. Happy to read that it works with some.

Comment by mistercheph 1 day ago

Gotta love only comparing the model to other openai models and just like yesterday's gemini thread, the vibes in this thread are so astroturfed. I guess it makes sense for the frontier labs to want to win the hearts and minds of silicon valley.

Comment by tptacek 1 day ago

Please don't post insinuations about astroturfing, shilling, brigading, foreign agents, and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data.

https://news.ycombinator.com/newsguidelines.html

Comment by mistercheph 1 day ago

Sorry, didn't realize.

Comment by 1 day ago

Comment by ianberdin 1 day ago

Thanks gosh, we have so bloody competition.

The models are so good, unbelievable good. And getting better weekly, including pricing.

Comment by whinvik 1 day ago

I actually have 0 enthusiasm for this model. When GPT 5 came out it was clearly the best model, but since Opus 4.5, GPT5.x just feels so slow. So, I am going to skip all `thinking` releases from OpenAI and check them again only if they come up with something that does not rely so much on thinking.

Comment by famahar 1 day ago

It's wild to me how people get used to new ground breaking coding LLM models. Every time a new update comes there are so many people that think it's trash because it made an error or takes some time to think. We all have access to a skilled (enough) pair programmer available 24/7. Like I'm still recovering from the shock of the first coding capable LLM from 2 years ago.

Comment by whinvik 1 day ago

Haha the issue is competition. If nothing else existed then every GPT5.X release would have been amazing.

But Opus 4.5 exists.