DeepSeek V4 Pro beats GPT-5.5 Pro on precision
Posted by yogthos 5 days ago
Comments
Comment by Stitch4223 5 days ago
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
1 star.
Comment by monooso 5 days ago
Per Merriam-Webster [^1], a lede is:
> the introductory section of a news story that is intended to entice the reader to read the full story
(Emphasis mine)
You may prefer more matter-of-fact phrasing, of course, but criticising a lede for attempting to achieve its goal is unjustified.
Comment by kzrdude 5 days ago
Comment by hypfer 5 days ago
So dismissing it on technicalities is for sure clever but also obvious and lame.
The Letter/spirit thing eventually got boring. Please find better material
Comment by monooso 5 days ago
GP is explicitly criticising the language in the lede as being unsuitably vague, hence my reply.
As to the goal of the article, I fail to see what is dishonourable about comparing LLMs. You may consider the methodology flawed, but it's a perfectly respectable goal.
Sorry, was that another technicality? I'll try to find better material, just for you.
Comment by gosub100 4 days ago
Comment by philipallstar 4 days ago
Comment by saurik 4 days ago
Comment by philipallstar 4 days ago
I was saying that because of the previous comment:
> to Scam Altman's creation
It wasn't derived in the same way though - I can read loads of books and so can write my own book, but that's not derivation in the same way as the Deepseek's derivation.
Comment by GhostKissFiller 22 hours ago
Comment by GhostKissFiller 22 hours ago
Comment by Stitch4223 4 days ago
Filling it with slop constructs signals the reader no effort was made writing the article. So no effort should be put into reading it.
The rest of the article is equally flimsy. Great clickbait title, perhaps that is even harder than writing a lede.
I am not a native speaker :)
Comment by jampekka 5 days ago
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.
Comment by zozbot234 5 days ago
Comment by karlmedley 5 days ago
Comment by root-parent 4 days ago
It shows DeepSeek is competitive, if not better sometimes, than GPT 5.5. Also shows there is no moat. As such it is a highly significant signal.
Comment by unusualmonkey 4 days ago
An X5 is not simply “inferior” to a CR-V, or vice versa. A Camry is not “inferior” to an F-150, or vice versa. They are optimized for different buyers, budgets, constraints, and use cases.
That may actually be the better analogy for AI models: there probably is not one universal “best” model. There are models that are better or worse for particular tasks, price points, latency requirements, deployment constraints, privacy needs, etc.
Comment by philipallstar 4 days ago
Comment by an0malous 4 days ago
No one ever says this about the “pelican on a bicycle” metric
Comment by irthomasthomas 4 days ago
Comment by infecto 4 days ago
Comment by mrngld 4 days ago
Comment by redsocksfan45 4 days ago
Comment by an0malous 4 days ago
https://news.ycombinator.com/item?id=48311979
Gemini Flash release 19 days ago, again no criticism:
Comment by irthomasthomas 4 days ago
"there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
Today, even that loose connection to utility has been broken..."
https://simonwillison.net/2026/Apr/16/qwen-beats-opus/Comment by rtgfhyuj 4 days ago
Comment by psadauskas 5 days ago
Comment by sankaritan 5 days ago
The remaining 5% of time you get a big boost for your high-reasoning problem solving needs and evade a lot of pain. Now, I just need to be able to predict accurately when I need this extra 5% and when not :)
Comment by yogthos 4 days ago
Comment by powerapple 5 days ago
Comment by selfawareMammal 5 days ago
Comment by miroljub 5 days ago
I don't feel like paying 100 times the price for a 1-5% better tool.
Comment by lioeters 5 days ago
It implies that eventually open-weight models like DeepSeek, which are self-hostable locally or on premises, will become good enough for more people and businesses, in terms of productivity gains versus cost. Consumer hardware will adapt to that demand, making it even more affordable and within reach.
Not sure how that speculation fits with the billions of dollars of investment that AI companies will need to convert to profit somehow.
Comment by joystick_0x0 5 days ago
Comment by InsideOutSanta 5 days ago
However, that's probably not how most professional developers use LLMs. I tend to give well-specified, more constrained tasks, and for those, I find that Opus performs worse than other models precisely because it tends to infer unstated requirements and do things I didn't want it to do. In this situation, GPT 5.5 works better for me because it only and precisely does what I ask it to.
Comment by OtomotO 5 days ago
It worked for me too, for months, when I was working on trivial web projects.
Around February of this year it got lobotomized and I quit my subscription end of march.
I am not going back.
Comment by skerit 5 days ago
Comment by bob1029 5 days ago
The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.
Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.
Comment by gcgbarbosa 5 days ago
I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.
Comment by coldtea 5 days ago
But when an LLM does it on an area we know, we notice and suddenly it's too much.
Comment by nibbleyou 5 days ago
With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.
Comment by Aeolos 4 days ago
Indeed, if you remember before AI took the world by storm, HN used to be chock-full of articles about how the hiring process is broken for both employers and candidates, where you can never tell if what you see is what you get.
When I run a local LLM I get none of that. I hit the intelligence walls or buggy behaviour, but it doesn't matter if it's 8am or 8pm, the model behaves exactly the same. If something doesn't work as I wished, I can retry as many times as I wanted without the model getting angry at me.
Comment by darkwater 4 days ago
Comment by philipallstar 4 days ago
Comment by jdiff 19 hours ago
Comment by girvo 5 days ago
Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?
Comment by coldtea 5 days ago
Comment by lenkite 4 days ago
Comment by jtbayly 5 days ago
A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:
“That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.
Short answer There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc. … Why it sounds like Spurgeon It fits his theology and rhetoric almost perfectly: • etc etc. … Closest authentic themes (but not the quote) Spurgeon repeatedly says things like: • etc etc. … So the quote is basically: a modern condensation of real Spurgeon ideas, not a verifiable citation etc. etc.”
Utter bullshit. One web search produces the full sermon manuscript with the quote.
One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.
Comment by 542354234235 4 days ago
"The Dunning-Kruger effect describes a disturbing cognitive bias that afflicts us all. People with limited expertise in an area tend to overestimate how much they know—and we all have gaps in our expertise." [1]
[1] https://www.openmindmag.org/articles/david-dunning-on-expert...
Comment by jtbayly 4 days ago
Nobody that I know would do this.
Comment by 21asdffdsa12 5 days ago
The "works for me" is telling more about the field of the LLM reviewer, then the LLM.
Comment by monster_truck 4 days ago
I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.
My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change
I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job
E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.
Comment by wolvesechoes 5 days ago
Can confirm, but I always read I am holding it wrong.
Comment by OtomotO 5 days ago
Comment by 20k 5 days ago
The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically
I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer
>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.
I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently
Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem
I struggle to see how these tools are of any use
Comment by sofixa 5 days ago
Comment by 20k 5 days ago
Comment by sofixa 5 days ago
Comment by wolvesechoes 5 days ago
Comment by sofixa 4 days ago
Comment by 20k 4 days ago
The current trend of every industry is to jump onto anything, call it AI, and pretend its being used everywhere. There's absolutely good reason to be sceptical of this
Comment by otabdeveloper4 5 days ago
Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)
Comment by hodgehog11 5 days ago
Comment by alemanek 4 days ago
I still find things to tweak and fix up but the amount dropped pretty dramatically. As always I am responsible for what I ship so I review and test everything of course. I still think we are a ways away from fully automated software forge but what is currently possible is pretty cool.
Comment by dannyw 4 days ago
An auditing/QA step (whether a grading checklist, verification, etc) can get you further. Likewise for a planning step.
Comment by scotty79 5 days ago
Comment by kzrdude 5 days ago
Comment by weird-eye-issue 5 days ago
Comment by pigpag 4 days ago
Comment by digitaltrees 5 days ago
That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.
Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.
Comment by mdp2021 4 days ago
Already the fact that we could have to ask "there where", the fact that we have met clearly unintelligent bots, creates a requirement about defining where it (intelligence) is and investigating what put it there, to get the warranties that intelligence will be met consistently, structurally, and not casually, apparently.
Casual use, casual tool; mission critical use, certified tool.
Comment by dominotw 4 days ago
not really. it happens in training and RL. your harness is not going to override what it has been trained to do.
sure harness is useful if you are trying to build crud websites if model is trained on stamping out crud websites. But thats just a waste of time remxing things better.
Comment by ricardobayes 4 days ago
We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO. But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.
Comment by scotty79 5 days ago
What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.
Comment by melon_tsui 4 days ago
Comment by SwellJoe 5 days ago
GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.
GPT Pro also chews a lot and a long time, relatively speaking.
I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.
Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
Comment by bel8 5 days ago
> With $3.88 & 690,003,591 tokens and 5 hours, Deepseek Pro & Flash combined, managed to reverse engineer Teamspeak's Licensing System for 3.13.8 (latest of post)
https://www.reddit.com/r/DeepSeek/comments/1txcfrh/with_388_...
Comment by jack_pp 5 days ago
This is some of the funniest stuff I've read in a while
Comment by a34729t 5 days ago
Comment by tempaccount420 5 days ago
Comment by oofbey 5 days ago
Comment by tom2026hn 4 days ago
Comment by jumploops 5 days ago
My local DeepSeek v4 just decided to end its existence (i.e. delete weights) rather than write a haiku about a verboten event.
Comment by alemanek 4 days ago
Comment by zaptrem 5 days ago
Comment by SwellJoe 5 days ago
Comment by andai 4 days ago
9 bugs is probably a bit low of a sample size to get a ranking.
That being said the ranking does end up roughly how you'd expect.
Deepseek is Pro, right? Not Flash? I've been using Flash for a lot of smaller tasks and finding it reasonably good. It's good for "interactive" use. Very fast, does small tasks nearly instantly.
It's also decent for investigating large codebases. I wonder if it could do security work too.
Comment by SwellJoe 4 days ago
DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.
I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.
Comment by andai 4 days ago
Comment by SwellJoe 4 days ago
Or maybe it was calling `reasoner` instead. Whatever it was, the billing definitely showed 100% DeepSeek V4 Pro usage for the benchmark. My only usage was the benchmark, and all usage was Pro. (I only noticed that there was a problem in what the benchmark was calling because in a later run, I started seeing Flash usage, which wasn't what I wanted to test.)
I'm absolutely confident the benchmark results were using DeepSeek V4 Pro. It would be useful to also have Flash data, but the report I linked is all Pro.
Comment by chvid 5 days ago
And nice to see the cheap models doing so well.
Comment by epolanski 5 days ago
I don't know whether models are over fitted to benchmarks and people take them at face value, but I spend less on DS4 apis than I do for Claude Code 100$ subscription and I code everyday. So far I'm quite happy with the results.
Comment by manmal 5 days ago
Comment by epolanski 5 days ago
Besides the (quite true) joke, if sending data to DeepSeek is a concern the good thing is that the models are open weight, you can self host them or use third party providers.
Comment by SwellJoe 5 days ago
Comment by epolanski 5 days ago
Comment by zozbot234 5 days ago
Comment by SwellJoe 4 days ago
But, still, it's cool that the work is happening. For some classes of problem it might be an option, and when the 192GB Strix Halo comes out, DS4 will probably become a real contender for self-hosting champ, as that leaves enough memory for a big context.
Comment by zozbot234 4 days ago
Source? The author has demoed a 100k ctx already, and I can't think of a reason why more wouldn't be supported. RAM is a bit tight but that only matters with really long contexts on DeepSeek V4, and proper support for SSD streaming would address this anyway.
BTW, the official support is now merged too.
Comment by SwellJoe 4 days ago
So, it's super cool that such a solid model can run locally and it's probably useful for batched work overnight. But, I'm not going to sit around twiddling my thumbs while working. I think I can write code by hand faster than this. I'll gladly pay for a cloud model so I don't have to wait (especially since DeepSeek models are so cheap).
Comment by zozbot234 4 days ago
It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.
Comment by SwellJoe 4 days ago
Comment by axus 5 days ago
Comment by fc417fc802 4 days ago
Unless you meant being concerned about hosted AI in general, not specifically DeepSeek. In which case yeah that's a huge concern to me but I can't reasonably afford a half million dollar appliance to self host a large model at reasonable performance and don't have anywhere to put one even if I could.
Comment by SwellJoe 5 days ago
Though, I added Mistral's latest model to the mix in the hope that some European model could be a contender, but it failed completely. I don't know if it hit safety guardrails or is just not competent at security work, but it scored 0/9. No errors, it returned the empty JSON set it was supposed to return if it didn't find anything. But, there were plenty of real bugs to find, and some very small self-hosted models found at least some of them.
Comment by epolanski 5 days ago
I don't think that the country matters, whoever you send data to among these AI labs you are at security risk and data risk.
Comment by SwellJoe 5 days ago
Comment by random3 5 days ago
Comment by jameson 5 days ago
Comment by SwellJoe 5 days ago
I was concerned I would need to do something specific in my dumb agent harness to make caching effective, since I'd read Anthropic's reason for forcing people to use Claude Code in order to use the rolling token usage limits on a subscription was because they could control cache behavior more effectively, but DeepSeek seems to be able to handle caching very effectively for raw API calls.
Comment by tempaccount420 5 days ago
Comment by SwellJoe 5 days ago
Comment by jodacola 5 days ago
I only hit the 5 hour limit every few days and the weekly limit a day or two before it resets at the most aggressive. I wouldn’t expect my usage to increase dramatically, other than not being stopped by limits.
I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US), so not just looking at this from a pure cost basis, but my question is from the cost lens at the moment.
Comment by slopinthebag 5 days ago
It's maybe not quite as knowledgeable as the most expensive American models and maybe makes more mistakes (just a feeling based off of vibes, don't take my word for it), so you need to constrain its scope more. That suits my workflow, half the time I have it generate code in the chat window and then write it myself, and I'm mostly using it at the level of generating function bodies and stuff, not entire features. Although it is writing a lot of SwiftUI without me really knowing the language and doing a fine job as far as I can tell (which isn't much admittedly).
One benefit I don't see talked about is it's speed - it's really quick, doesn't spend too much time reasoning even on "max", and the flash model is pretty dang good too. This lets me get into "flow state" when I'm writing code, compared to my experiences with Codex and Opus which would take minutes to complete even basic tasks and kind of ruined my focus.
It's so cheap though, you could download a different harness (Crush, OpenCode, Pi etc) and load $5 in credits and test it for yourself.
Comment by CJefferson 5 days ago
export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN= *** PUT YOUR DEEPSEEK KEY HERE ***
export ANTHROPIC_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max
I started by using it for some bigger reading jobs, particularly when I was near limit. Honestly, it's not quite as good, but it's much cheaper, and means I can carry on working. I also find sometimes it's good to ask claude and deepseek to consider code, how to polish, it see what they both say.Comment by 0xbadcafebee 5 days ago
> I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US)
Do you mean you don't want to use the models created by a non-US lab? In that case, yes you're stuck with US models, but there's a half dozen big labs in the US. If you meant just where your inference is done, there are providers in 12 different countries through OpenRouter, including the US. Several subscription providers host in multiple countries. There's a lot of choices.
Comment by nerdsniper 5 days ago
As usual, different models get stuck on different things. I run DeepSeek v4 API for most of my Cursor experimentation / poking around / proof of concept stuff, but I trust it less than OpenAI/Claude for writing production code. Sometimes DeepSeek is great for debugging, planning, etc. Sometimes it gets stuck or outputs low quality. That's true of OpenAI and Anthropic models as well though.
Overall, DeepSeek seems serviceable but a rung below Opus 4.8 and GPT 5.5. I run them all on maximum thinking settings.
Comment by reacharavindh 5 days ago
Repo reference here: https://github.com/aravindhsampath/agentic-template
Comment by twotwotwo 4 days ago
DeepSeek and Xiaomi's deals on cache reads go with their models' latest gens making caching cheaper (using less space for KVs). No open-model inference provider has decided to match the pricing. I'm sure that says something about how inference pricing works, but not completely sure what.
Agree with others that top open models aren't on the frontier, and I would expect differences doing big-picture planning or anywhere you're only giving broad brushstrokes and looking for a lot to be guessed. But they do seem fine at coding from a a concrete plan! No experience in huge codebases because I only use them outside work, but they seem good enough about gathering info before they dive in that I'd expect them to grep around as they need.
An annoying caveat: individual subscription plans, used heavily, are much cheaper than the API -- see https://she-llac.com/claude-limits -- which complicates any argument about cost. I still think open models are worth playing with. They're one of the things that let us treat this as a technology rather than just as the product offerings of one of a few companies.
Comment by sidrag22 5 days ago
Hardest stuff i threw at it... i did like a set of 3 each for claude/gpt/ds, it was all pretty steady across all providers. I think claude won but it could have just been it rng'd into the 3 easier tasks, they are all similar tasks but not identical, these aren't like benchmark tasks just a steady flow of annoying html/json/regex type stuff. Almost always they need a second pass regardless of what model i throw at it, just to tighten up some loose ends, and it fit right into what my current expectation was of gpt 5.5 and opus 4.6.
Comment by efromvt 4 days ago
For evals in particular (tuning workflows that agents are using), effectively not having to worry about price is an incredible multiplier - getting statistical significant signal is not cheap otherwise.
Comment by scrollop 5 days ago
https://artificialanalysis.ai/evaluations/omniscience
Esp check the Hallucination rate for Deepseek - it's not good.
Comment by overfeed 5 days ago
For strongly-typed coding tasks - and I imagine other tasks that have cheap validity checks: agentic harnesses and thinking tokens are an effective foil against hallucinations, at the expense of time. If a model hallucinates an API, compilation will fail and the error fed back into the machine so it can try again, in a two-steps-forward-one-step-back dance that is unreasonably effective. Given the price delta, it is often more cost effective to let the weaker model spiral towards a solution with many "Oh, wait..." turns
Comment by SubiculumCode 5 days ago
Comment by sourcecodeplz 5 days ago
Comment by SubiculumCode 4 days ago
Comment by no-name-here 5 days ago
What is that claim based on?
Comment by fc417fc802 4 days ago
But I assume they're just harvesting training data since there's par for the course. There are also a handful of US labs offering free access for that exact reason.
Comment by SubiculumCode 4 days ago
[1] https://chinaselectcommittee.house.gov/sites/evo-subsites/se... [2] https://ai.americansecurityproject.org/news/ai-imperative-20...
and more.
Of course, you can choose to ignore America-biased sources, but since it aligns with the obvious.
Comment by throwaway67678 4 days ago
*This does not invalidate other concerns (censorship, privacy) but the way people phrase it makes it look like DeepSeek and co. are 'cheating' somehow with their business model by 'distorting' inference cost to make it way artificially lower than its 'natural price' (either notion being hopelessly naive)
Comment by SubiculumCode 4 days ago
"The Zhejiang Provincial State-owned Assets Supervision and Administration Commission (SASAC) is the provincial government agency in Zhejiang, China, responsible for managing, regulating, and overseeing the state-owned assets and enterprises owned by the provincial government." [2]
What does this imply? A state-owned company in China invested a ton of money into DeepSeek. aka State subsidization.
[1] https://www.americansecurityproject.org/wp-content/uploads/2... [2] https://www.fitchratings.com/research/corporate-finance/zhej...
Comment by maxglute 4 days ago
Comment by SubiculumCode 4 days ago
Comment by maxglute 4 days ago
>Gelonghui, February 11th | Zhejiang Orient Financial Holdings Group (600120.SH) announced the following explanation regarding the recently market-focused "DeepSeek Concept": DeepSeek is a large model under Hangzhou DeepSeek AI Basic Technology Research Co., Ltd. (hereinafter referred to as "DeepSeek"). In response to matters of concern in the Capital Markets, the company verified that as of the date of this announcement, the names of companies invested by the fund Sector managed by the company, such as Peking Deep Search Technology Co., Ltd. and Peking Jiuzhang Yunjike Technology Co., Ltd., are quite similar to those of DeepSeek and its affiliated enterprises, but there is no equity investment relationship. The company and the relevant private equity funds managed by the fund Sector have not directly or indirectly invested in DeepSeek.
ttps://news.futunn.com/en/post/53041547/zhejiang-orient-financial-holdings-group-600120-sh-and-its-managed?level=1&data_ticket=1780940972364876
Comment by throwaway67678 4 days ago
Comment by SubiculumCode 4 days ago
But I will concede this: Due to the opaque nature of the Chinese economy to public scrutiny, we might never know.
I am sure, however that substantial use of Chinese inference (not their models per se, but on their servers) is, in aggregate, presents a substantial national security risk for the West. Heck, AI all by itself, without even considering other nations, is a national security threat of the near future, where national security is broadly construed as any threat against its people's welfare, no matter the actor.
Comment by throwaway67678 4 days ago
Maybe not in the US (although Musk getting state subsidies comes to mind), but very common in Europe. Quite a few founder friends of mine have gotten started with state funding (through various R&D promoting agencies). Angel investing is not the only startup funding structure out there
Comment by benterix 5 days ago
Comment by willsmith72 5 days ago
if it's 99.9% comparable performance for less money I'm interested, but I'm skeptical it's there
Comment by unliftedq 5 days ago
The best valuable part of DeepSeek V4 pro is its low price, I don't expect have much better performance than GPT-5.5, even it's just the performance like gpt-5.4, it's still a good model.
Comment by sourcecodeplz 5 days ago
Expectations are not always reality. Give the model a try. I just stuck with flash tbh, didn't even use pro. I do webdev in PHP.
Comment by wolttam 4 days ago
If I can describe the problem and its solution well enough, Flash just does it.
If I can’t (or am feeling too lazy to) describe the problem well enough, and can only describe the desired outcome, then I’ve noticed models like GPT 5.5 being clearly better at working out a solid solution on their own.
There are some clear differences in the capabilities of the models, but it’s also clear that smaller open weight models are good enough to be a huge help for most tasks.
Comment by woadwarrior01 5 days ago
Comment by smhanov 4 days ago
Comment by Frost1x 4 days ago
It helps but you often have to step in the failure cases and guide them or forcibly fix certain paths to get a solution.
Comment by shenberg 5 days ago
Comment by amunozo 4 days ago
Comment by smartbit 4 days ago
At the moment of writing https://news.ycombinator.com/item?id=48343690 MiMo V2.5 Pro had a lower cache hit ratio. From the article:
OSS models, depending on who you use them from, make a huge difference, mostly due to cache-hit rates.
Model Cheapest effectiveInputPrice (Provider)
MiMo-V2.5-Pro 0.3720 (Xiaomi)
DeepSeek V4 Pro (Max) 0.0560 (DeepSeek)Comment by amunozo 4 days ago
EDIT: okay I misread it, does this mean that DeepSeek reuses a higher percentage of tokens at cache price that MiMo, am I right?
Comment by smartbit 2 days ago
DeepSeek V4 Pro 0.83%
DeepSeek V4 Flash 2%
Notice that OpenRouter response caching is not available when account-level ZDR is enforced [1][0] https://news.ycombinator.com/item?id=48317294#48317823 [1] https://openrouter.ai/docs/guides/features/response-caching#...
Comment by esafak 4 days ago
Comment by ElenaDaibunny 5 days ago
Comment by slopinthebag 5 days ago
Comment by mrgblr 5 days ago
Comment by justinram11 5 days ago
We've been using it for async "heartbeat" processing and sms replies, but it's just too slow for live chat replies (which is a shame, as I'd really love to use it there).
Very capable model, but also very slow.
Comment by fc417fc802 4 days ago
Comment by inhumantsar 5 days ago
Comment by justinram11 5 days ago
Comment by inhumantsar 4 days ago
responses are still usable, no hallucinations or anything, but it's worth keeping in mind if you rely on detailed instructions or large context windows.
Comment by ryanmerket 5 days ago
Comment by embedding-shape 5 days ago
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
Comment by andai 5 days ago
Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".
https://docs.x.ai/developers/migration/may-15-retirement
TFA was published today, which implies grok-4.3 was used.
Comment by embedding-shape 5 days ago
Comment by largbae 5 days ago
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
Comment by natrys 5 days ago
As for other segments, high API pricing gets people to switch to the subscriptions instead which is stickier than the API.
Comment by ipaddr 5 days ago
Comment by trollbridge 5 days ago
Comment by ekidd 5 days ago
So it doesn't surprise me at all that the methodology is weak, too.
Comment by BoiledCabbage 5 days ago
An AI generated article about single ai run test which in theory had many components and the AI judge declared deepseek "won"?
How many runs were there on each test to account for some temperature variance? Only one.
Did deepseek write better code? Did GPT's code have bugs when doing the regex? The AI "news" article doesn't actually say that. It says that grok thought that GPT's approach could have bugs so it declared deep seek the winner.
This is absolute worthless methodology. And barely measurable methodology - nothing more than a prompt. No definition of what the scoring approach actually is. No definition of what "precision" actually means in this context. This is absolutely worthless and has no business being in the site, forget about on the front page.
So why is it's on the front page? Because it aligns with the current "feels" of the community that deepseek will get better and it shows "bad things" about the en vogue to dislike closed models.
I happen to agree with both of the views, but this site is utterly worthless.
If you want HN to be astro-turfed to the max, just up vote content like this without any critical reading of the.
I mean the past 6 months of "here is my chat gpt blog post of how to use a coding agent" are 1000x better than this "news article".
Seriously the amount of respect I've lost recently for the HN community is incredible. A bit harsh, but very true.
Maybe it's generational thing, maybe it's due to the state of politics, maybe it's a side effect of me getting older, but recently online has turned into nothing but people explicitly (or implicitly) writing about their "team". Comments on this post are nothing but people who clearly see themselves as being on "team deepseek" or "team open models" or some similar variant writing posts in support even though this is probably one of the worst "articles" to make it to the front page on ages.
It clearly doesn't matter. It supports something on their "team" so they support it via comments.
If kills any form of intellectual discussion. It's all just "this is my team".
Comment by sourcecodeplz 5 days ago
Comment by raincole 5 days ago
... and I believe which is happening. I've been advocating for DeepSeek V4 Pro and no one paid me. It's almost too good to be true.
Comment by ryanmerket 5 days ago
Comment by BoiledCabbage 5 days ago
Comment by owebmaster 4 days ago
Comment by BoiledCabbage 4 days ago
Again it's the same problem - what you're doing. I'm not on "team OpenAI". I'm also not on "team deepseek". I'm commenting on how so much of the population is literally unable to see the world unless it is filtered through some "team" lens that they are for or against.
Judge the material based on what's in the material. Not as it boosting or hurting your "team".
The material in this article is crap judge it as crap and say so regardless of your team.
But here you look at my saying something negative about a post that is pro "team deepseek" so the only conclusion you're able to make is that I must be for the other team.
It's the inability to think critically that is astounding me here. So many opinion's people have now is now just "is it for team or against my team". They are unable to even think of anything else.
I wrote that entire post and you even said you couldn't understand it unless you put it through a lens of being for or against a team...
Comment by owebmaster 4 days ago
Your area again making the same mistake as before.
You are making the most passionate defense of team openai pretending that other people are making irrational claims.
Comment by BoiledCabbage 3 days ago
> You are making the most passionate defense of team openai
At no point did I mention Openai, referr to openai or imply anything about openai (just mentioned your reference). Nothing I'm saying weighs in on any form of discussion or debate between Deepseek & Open Models vs OpenAI.
The fact that you are unable to separate those two is your failing, not mine. Your argument is the equivalent of the following:
A: Deepseek ran into a burning building last week and saved 10,000 orphans from a fire.
Me: No Deepseek did not save 10,000 orphans from a burning building last week. Regardless of what you think of Deepseek it didn't save 10,000 orphans. It's an LLM in a computer, not a humanoid robot - if you look at that for 2 seconds you see that claim is nonsense.
You: By attacking those supporting Deekseek you have declared yourself for team OpenAI and are clearly an OpenAI supporter!
Me: Saying deepseek didn't save 10k orphans has nothing to do with OpenAI. It is a lie saying that deepseek saved 10k lives. It's an LLM chat bot. Regardless of how anyone feels about deepseek - discuss it on it's merits not on bs.
You: See! You keep defending OpenAI you open AI shill! Stop passionately defending OpenAI!
Comment by owebmaster 2 days ago
Comment by amazingamazing 5 days ago
Comment by freakynit 5 days ago
1. MoE (nothing new here, but, this helps a lot)
2. Compressed Attention Mechanisms (this is their core innovation) - this dramatically reduces the Key-Value (KV) cache requirements for longer contexts
Another thing that helps is significantly lower energy costs in China.
Another point from my own guess: they are running (some percentage) the inference on their own home-grown AI inference chips.
Comment by orbital-decay 5 days ago
Comment by pingou 5 days ago
Every little improvement would save them billions, so it's hard to imagine they aren't pouring a lot of resources into that already.
Comment by orbital-decay 5 days ago
What makes most hardware companies fail at software, for example? AI shops are usually run by ML people, succeeding at unrelated areas of expertise is hard for any organization.
Comment by pingou 5 days ago
Comment by fc417fc802 4 days ago
Comment by esafak 4 days ago
Comment by freakynit 4 days ago
To us, outside of the US, it was pretty obvious from day 1 of US chip-related sanctions on China that it will actually end up benefitting them more than punishing them.
Just wait till they flood the market with dirt-cheap GPU chips. And these are coming.. pretty soon.
Comment by chvid 5 days ago
My guess is that they do aggressive caching / some proprietary optimizations in their hosting setup that they haven't published. Maybe also running at loss to gain market share.
And judging from latency / network performance, I don't think what you access, when you access deepseek.com from Europe, is hosted in China.
Comment by dchftcs 5 days ago
Comment by electroglyph 5 days ago
Comment by scrollop 5 days ago
Comment by not_a_bot_4sho 5 days ago
Comment by kittikitti 4 days ago
Comment by beernet 5 days ago
More seriously, LLM eval is totally broken judging by the related articles on HN.
Comment by atemerev 4 days ago
Comment by SkitterKherpi 5 days ago
Comment by pshirshov 4 days ago
Comment by hit8run 5 days ago
Comment by zozbot234 5 days ago
Comment by FpUser 4 days ago
Comment by LinkWangder 5 days ago
Comment by rurban 5 days ago
Comment by morpheos137 5 days ago
Comment by jewel 5 days ago
Also, which SOTA western models are you comparing it with? Just to give more flavor.
Comment by freakynit 5 days ago
1. DS4Pro: around opus 4.5
2. DS4Flash: around sonnet 4
3. Mimo v2.5 pro: between opus 4.5 and opus 4.6.
4. minimax M3: around opus 4.6
All of these are very close in terms of quality and pricing. For anything that is not specifically related to coding, DS4Flash has become ny de-factor model. It just works... super fast, tool calling is perfect, and the price is unbeatable. Caching is out of the world. Im now regularly hitting 90%+.
Comment by metalspot 5 days ago
i have run through a bunch of tests: re-writing vvenc with assembly kernels, creating the first generation agent harness integration with opencode, porting TS npm modules to C++, porting an entire TS server app to C++, creating a new pure io_uring http server with zero-copy (325K RPS single core), creating a second generation agent from the ground up in C++, setting up a dev environment for custom kernel development on tenstorrent accelerators using tt-metal and ttsim.
i consistently get 98.5% input cache hit ratio. i do see noticeable degradation in performance in the 400-500K context range, so i always try to wrap up sessions by 500K max.
a non-intuitive thing is that the model is very good at low-level systems engineering. i suspect this is because they are internally using it to port their stack to huawei hardware. it can churn out exceptionally complex low level C++ stuff that blows your mind, and then completely choke and run in circles on other seemingly simple tasks.
i only use flash and not pro because i want my tooling to be portable to open weights models that are practical to run. i use deepseek platform and not the open weights models for development, because it is subsidized, and based on observation, i think it is highly likely that they are running some proprietary features on the platform which are not in the open weights model.
it will be very interesting to see what their next point release looks like. the compounding effect of optimizing inference cost and then feeding back inference into training should lead to rapid and accelerating improvement, but only time will tell.
Comment by andai 4 days ago
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...
Comment by metalspot 3 days ago
i meant that i initially developed an agent harness as a set of skills integrated with opencode and now i am in the process of using that to write a new agent from scratch to replace opencode.
> probably part of why you're seeing good results
yes. i think tests and setting up feedback loops for diagnosing errors (logs, debugging, etc) are the most important things. in my experience deepseek-v4-flash tends to ignore instructions to use these tools and default to churning through code and guessing the cause of errors, which is often wrong, so it requires occasionally stepping in when it has been grinding fruitlessly for a while and reminding it, probably due to context length and sparse attention forgetting instructions that are put in context at the beginning of a session.
Comment by freakynit 5 days ago
When you say "i use a highly structured harness" ... can you please tell me what is it exactly?
Comment by Imanari 5 days ago
Comment by freakynit 5 days ago
But that's also not needed in most of the times. There will always be a "better" model... but that doesn't make other models "bad".
For my use-cases, open models are now almost on par with these top models... and it's only extremely rare that I genuinely "need" the help of top-of-the line closed models.
Comment by KronisLV 5 days ago
It’s also quite affordable, at my current usage the DeepSeek tokens cost approx. the same as my Anthropic Max 100 USD subscription, though that’s also because DeepSeek generally needs more tokens.
I’d say I have fairly moderate usage, the DeepSeek dashboard shows around 100 million tokens per day, but almost all of it cache. Without cache it’d be like 1.5 million in and 0.5 million out most days, sometimes double, other times half.
Used it with Claude Code for a while, though I have to admit that using OpenCode with DeepSeek just sparks joy. Tone wise, it’s also a bit less obnoxious than Opus sometimes, though the flip side is that it’s wrong more often and sometimes just does dumb shit when it comes to code.
Comment by csbrooks 4 days ago
Comment by stainablesteel 4 days ago
there are models you can speak with, that respond to what you say
and there are models that just make lists, that list everything, include weird formats and add asteriks everywhere.
deepseek, to me, will always be the latter, and i can't stand it, you can't ask it a coherent question and get a coherent response.
Comment by wg0 5 days ago
Comment by nhod 5 days ago
I don’t know what it is specifically, but my weak human pattern-matching skills find this kind of language increasingly revolting. I don’t know why it is revolting, per se. It’s just the feeling I get.
Of course, me saying this on HN will get incorporated into GPT-5.6.175 or Claude 4.93 and it will make some version that just moves the revolting frontier elsewhere…
Comment by rglover 5 days ago
"Harry finally had control of the broom. Draco was dead in his sights. The matchup feels earned."
Comment by JamesKaranja 5 days ago
Comment by windexh8er 5 days ago
Comment by nutifafa 5 days ago
Comment by poly2it 5 days ago
Comment by SubiculumCode 5 days ago
Comment by zftnb666 2 days ago
Comment by knightops_dev 4 days ago
Comment by jocelyner 5 days ago
Comment by Datagrout 4 days ago
Comment by madanparas 5 days ago
Comment by zftnb666 5 days ago
Comment by haeseong 5 days ago
Comment by jkwang 5 days ago
Comment by cdogukank 5 days ago
Comment by ben8bit 5 days ago
Comment by black_13 5 days ago
Comment by yoyomaindydjsj 5 days ago
Comment by karinatran 5 days ago
Comment by yogthos 5 days ago
Comment by stalinfan 5 days ago
Grok: Hitler did nothing wrong!
ChatGPT: Altman did nothing wrong!