Gemini 3 Flash: Frontier intelligence built for speed
Posted by meetpateltech 17 hours ago
Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Developer Blog: https://blog.google/technology/developers/build-with-gemini-...
Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/
Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...
Deepmind Page: https://deepmind.google/models/gemini/flash/
Comments
Comment by samyok 16 hours ago
I have been playing with it for the past few weeks, it’s genuinely my new favorite; it’s so fast and it has such a vast world knowledge that it’s more performant than Claude Opus 4.5 or GPT 5.2 extra high, for a fraction (basically order of magnitude less!!) of the inference time and price
Comment by thecupisblue 16 hours ago
After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.
The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.
Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.
Comment by lancekey 9 hours ago
Examples from the wild are a great learning tool, anything you’re able to share is appreciated.
Comment by theshrike79 1 hour ago
And it shouldn't be shared publicly so that the models won't learn about it accidentally :)
Comment by ggsp 1 hour ago
Comment by m00dy 4 hours ago
Comment by lambda 12 hours ago
I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.
Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.
So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).
So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.
Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.
Comment by prettyblocks 11 hours ago
Comment by lambda 9 hours ago
So I want to have a general idea of how good it is at this.
I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.
But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.
Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.
Comment by ozim 2 hours ago
Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.
Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.
Fixing up text is of course also big.
Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like a mad man.
The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.
Comment by katzenversteher 3 hours ago
I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.
Comment by illiac786 3 hours ago
Maybe the scale is different with genAI and there are some painful learnings ahead of us.
Comment by mikepurvis 9 hours ago
Comment by ComputerGuru 9 hours ago
Comment by coldtea 8 hours ago
After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.
Comment by vitorgrs 8 hours ago
They probably use old Flash Lite model, something super small, and just summarize the search...
Comment by mikepurvis 4 hours ago
Comment by ozim 10 hours ago
Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.
I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.
Comment by DrewADesign 7 hours ago
Comment by andai 10 hours ago
Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)
Comment by lambda 9 hours ago
Comment by andai 21 minutes ago
Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).
Comment by TeodorDyakov 11 hours ago
Comment by Turskarama 10 hours ago
Comment by lambda 10 hours ago
Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.
Comment by grog454 9 hours ago
What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.
Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.
Comment by nl 8 hours ago
The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.
However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.
Comment by grog454 7 hours ago
For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629
Comment by eru 7 hours ago
Comment by grog454 7 hours ago
1. What is the purpose of the benchmark?
2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?
To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
Comment by nl 6 hours ago
2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.
> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
Comment by grog454 5 hours ago
I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.
> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
Then you must not be working in an environment where a better benchmark yields a competitive advantage.
Comment by eru 4 hours ago
In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.
Comment by nl 6 hours ago
> A secret benchmark is: Useful for internal model selection
That's what I'm doing.
Comment by Turskarama 9 hours ago
Comment by theshrike79 1 hour ago
Comment by akoboldfrying 7 hours ago
Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.
Comment by grog454 6 hours ago
Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!
This is the second reason I find the idea of publicly discussing secret benchmarks silly.
Comment by grog454 4 hours ago
Comment by kridsdale3 10 hours ago
Comment by jacobn 10 hours ago
I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?
Comment by nl 8 hours ago
Comment by jerojero 10 hours ago
you dont train on your test data because you need to have that to compare if training is improving or not.
Comment by energy123 10 hours ago
Comment by lambda 9 hours ago
I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.
Comment by _heimdall 10 hours ago
Comment by vitaflo 7 hours ago
Comment by fragmede 10 hours ago
Comment by pretzellogician 10 hours ago
Comment by mips_avatar 11 hours ago
Comment by andai 9 hours ago
Comment by ComputerGuru 9 hours ago
For each you can use it as “instant” supposedly without thinking (though these are all exclusively reasoning models) or specify a reasoning amount (low, medium, high, and now xhigh - though if you do g specify it defaults to none) OR you can use the -chat version which is also “no thinking” but in practice performs markedly differently from the regular version with thinking off (not more or less intelligent but has a different style and answering method).
Comment by mips_avatar 9 hours ago
Comment by eru 7 hours ago
Comment by mips_avatar 6 hours ago
Comment by strangegecko 5 hours ago
Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
Comment by eru 4 hours ago
> Coming up with all that fluff would keep my brain busy, meaning there's actually no additional breathing room for thinking about an answer.
It gets a lot easier with practice: your brain caches a few of the typical fluff routines.
Comment by danpalmer 11 hours ago
The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Comment by nl 7 hours ago
Where are you getting that? All the citations I've seen say the opposite, eg:
> Inference Workloads: NVIDIA GPUs typically offer lower latency for real-time inference tasks, particularly when leveraging features like NVIDIA's TensorRT for optimized model deployment. TPUs may introduce higher latency in dynamic or low-batch-size inference due to their batch-oriented design.
https://massedcompute.com/faq-answers/
> The only non-TPU fast models I'm aware of are things running on Cerebras can be much faster because of their CPUs, and Grok has a super fast mode, but they have a cheat code of ignoring guardrails and making up their own world knowledge.
Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
The knowledge grounding thing seems unrelated to the hardware, unless you mean something I'm missing.
Comment by danpalmer 5 hours ago
The citation link you provided takes me to a sales form, not an FAQ, so I can't see any further detail there.
> Both Cerebras and Grok have custom AI-processing hardware (not CPUs).
I'm aware of Cerebras' custom hardware. I agree with the other commenter here that I haven't heard of Grok having any. My point about knowledge grounding was simply that Grok may be achieving its latency with guardrail/knowledge/safety trade-offs instead of custom hardware.
Comment by nl 3 hours ago
I don't see any latency comparisons in the link
Comment by danpalmer 2 hours ago
https://jax-ml.github.io/scaling-book/gpus/#gpus-vs-tpus-at-...
Re: Groq, that's a good point, I had forgotten about them. You're right they too are doing a TPU-style systolic array processor for lower latency.
Comment by mips_avatar 6 hours ago
Comment by danpalmer 2 hours ago
Comment by jrk 7 hours ago
Comment by danpalmer 5 hours ago
Comment by eru 7 hours ago
Comment by campers 3 hours ago
They do have a priority tier at double the cost, but haven't seen any benchmarks on how much faster that actually is.
The flex tier was an underrated feature in GPT5, batch pricing with a regular API call. GPT5.1 using flex priority is an amazing price/intelligence tradeoff for non-latency sensitive applications, without needing to extra plumbing of most batch APIs
Comment by mips_avatar 3 hours ago
Comment by simonw 11 hours ago
Comment by mips_avatar 9 hours ago
Comment by yakbarber 8 hours ago
Comment by mips_avatar 8 hours ago
Comment by eru 7 hours ago
Comment by mips_avatar 6 hours ago
Comment by eru 2 hours ago
Comment by behnamoh 10 hours ago
It's a lost battle. It'll always be cheaper to use an open source model hosted by others like together/fireworks/deepinfra/etc.
I've been maining Mistral lately for low latency stuff and the price-quality is hard to beat.
Comment by mips_avatar 9 hours ago
Comment by TacticalCoder 9 hours ago
Turns out becoming a $4 trillion company first with ads (Google), then owning everybody on the AI-front could be the winning strategy.
Comment by pplonski86 11 minutes ago
Comment by kartayyar 7 hours ago
https://github.com/Roblox/open-game-eval/blob/main/LLM_LEADE...
Comment by seany62 4 hours ago
Comment by scrollop 13 hours ago
Comment by tallclair 10 hours ago
Comment by giancarlostoro 13 hours ago
Comment by spaceman_2020 1 hour ago
They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal
Comment by toomuchtodo 13 hours ago
Comment by outside1234 12 hours ago
Comment by TacticalCoder 8 hours ago
Markets seems to be in a: "Show me the OpenAI money" mood at the moment.
And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.
Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.
My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.
Comment by eru 7 hours ago
[0] At least the guys who publish where you or me can read them.
Comment by guelo 11 hours ago
This story also shows the market corruption of Google's monopolies, but a judge recently gave them his stamp of approval so we're stuck with it for the foreseeable future.
Comment by deegles 8 hours ago
Comment by taytus 4 hours ago
Comment by mingusrude 2 hours ago
Comment by behnamoh 9 hours ago
I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.
Comment by goobatrooba 9 hours ago
Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.
Comment by eru 7 hours ago
Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.
Comment by SoftTalker 6 hours ago
Comment by eru 4 hours ago
Comment by jack_riminton 11 hours ago
/s
Comment by user34283 11 minutes ago
Abandoning our mose useful sense, vision, is a recipe for a flop.
Comment by kqr 1 hour ago
[1]: https://entropicthoughts.com/haiku-4-5-playing-text-adventur...
Comment by mmaunder 15 hours ago
Comment by behnamoh 9 hours ago
I think it's bad naming on google's part. "flash" implies low quality, fast but not good enough. I get less negative feeling looking at "mini" models.
Comment by pietz 9 hours ago
Comment by taytus 4 hours ago
Mini - small, incomplete, not good enough
Flash - good, not great, fast, might miss something.
Comment by nemonemo 9 hours ago
Comment by behnamoh 6 hours ago
Comment by esafak 16 hours ago
Comment by epolanski 16 hours ago
Comment by unsupp0rted 15 hours ago
Comment by jasonjmcghee 2 hours ago
Comment by piokoch 1 hour ago
BTW: I have the same impression, Claude was working better for me for coding tasks.
Comment by bovermyer 11 hours ago
I have not worked with Sonnet enough to give an opinion there.
Comment by ZuoCen_Liu 6 hours ago
Comment by yunohn 11 hours ago
Comment by hexasquid 9 hours ago
Waiting for Apple to say "sorry folks, bad year for iPhone"
Comment by eru 7 hours ago
Comment by OrangeMusic 2 hours ago
Comment by freedomben 16 hours ago
Comment by samyok 15 hours ago
Comment by Davidzheng 15 hours ago
Comment by rat9988 15 hours ago
Comment by encroach 14 hours ago
Comment by tonymet 13 hours ago
Comment by tonyhart7 13 hours ago
claude is coding model from the start but GPT is in more and more becoming coding model
Comment by Imustaskforhelp 13 hours ago
I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.
Comment by leemoore 11 hours ago
Comment by ralusek 11 hours ago
Comment by leemoore 9 hours ago
Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise.
In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those.
Comment by Karrot_Kream 9 hours ago
> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.
Comment by user34283 8 hours ago
Enterprise will follow.
I don't see any distinction in target markets - it's the same market.
Comment by Imustaskforhelp 2 hours ago
Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks
if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.
Comment by user34283 1 hour ago
So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.
Comment by siva7 11 hours ago
Comment by Uehreka 12 hours ago
Comment by xbmcuser 12 hours ago
For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.
Comment by FuckButtons 11 hours ago
Comment by Imustaskforhelp 2 hours ago
And now I am saying the same for gemini 3 flash.
I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.
Comment by eru 7 hours ago
So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.
(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)
Comment by baq 11 hours ago
Comment by eru 7 hours ago
If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.
Comment by Workaccount2 12 hours ago
Comment by waffletower 11 hours ago
Comment by Gigachad 12 hours ago
Comment by Workaccount2 12 hours ago
Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.
Comment by int_19h 10 hours ago
Comment by aleph_minus_one 11 hours ago
This sounds like you live in a huge echo chamber. :-(
Comment by chpatrick 3 hours ago
Comment by lukan 10 hours ago
Apart from my very old grandmothers, I don't know anyone not using AI.
Comment by pests 10 hours ago
Comment by lukan 10 hours ago
Just googling means you use AI nowdays.
Comment by eru 7 hours ago
Remember, really back in the day the A* search algorithm was part of AI.
If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.
Comment by SoftTalker 5 hours ago
Comment by jauntywundrkind 16 hours ago
I've been playing around with other models recently (Kimi, GPT Codex, Qwen, others) to try to better appreciate the difference. I knew there was a big price difference, but watching myself feeding dollars into the machine rather than nickles has also founded in me quite the reverse appreciation too.
I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.
Comment by happyopossum 15 hours ago
Comment by minraws 12 hours ago
Comment by jauntywundrkind 7 hours ago
Comment by jauntywundrkind 13 hours ago
I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders (it's 1.78 orders of magnitude). To me, your complaint feels rigid & ungenerous.
My post is showing to me as -1, but I standby it right now. Arguing over the technicalities here (is 1.78 close enough to 2 orders to count) feels besides the point to me: DeepSeek is vastly more affordable than nearly everything else, putting even Gemini 3 Flash here to shame. And I don't think people are aware of that.
I guess for my own reference, since I didn't do it the first time: at $0.50/$3.00 / M-i/o, Gemini 3 Flash here is 1.8x & 7.1x (1e1.86) more expensive than DeepSeek.
Comment by KoolKat23 11 hours ago
Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time/human cost is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.
Comment by lukan 10 hours ago
Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.
Comment by Workaccount2 10 hours ago
Comment by mistercheph 7 hours ago
Comment by dfsegoat 11 hours ago
...and all of that done without any GPUs as far as i know! [1]
[1] - https://www.uncoveralpha.com/p/the-chip-made-for-the-ai-infe...
(tldr: afaik Google trained Gemini 3 entirely on tensor processing units - TPUs)
Comment by poopiokaka 13 hours ago
Comment by Sincere6066 16 hours ago
Comment by moffkalast 11 hours ago
Comment by __jl__ 16 hours ago
They are pushing the prices higher with each release though: API pricing is up to $0.5/M for input and $3/M for output
For comparison:
Gemini 3.0 Flash: $0.50/M for input and $3.00/M for output
Gemini 2.5 Flash: $0.30/M for input and $2.50/M for output
Gemini 2.0 Flash: $0.15/M for input and $0.60/M for output
Gemini 1.5 Flash: $0.075/M for input and $0.30/M for output (after price drop)
Gemini 3.0 Pro: $2.00/M for input and $12/M for output
Gemini 2.5 Pro: $1.25/M for input and $10/M for output
Gemini 1.5 Pro: $1.25/M for input and $5/M for output
I think image input pricing went up even more.
Correction: It is a preview model...
Comment by mips_avatar 16 hours ago
Comment by srameshc 16 hours ago
Comment by KoolKat23 11 hours ago
Comment by YetAnotherNick 15 hours ago
Comment by AuthError 15 hours ago
Comment by martythemaniak 14 hours ago
Comment by uluyol 16 hours ago
Comment by __jl__ 15 hours ago
Google has been discontinuing older models after several months of transition period so I would expect the same for the 2.5 models. But that process only starts when the release version of 3 models is out (pro and flash are in preview right now).
Comment by misiti3780 14 hours ago
Comment by jsnell 13 hours ago
You really need to look at the cost per task. artificialanalysis.ai has a good composite score, measures the cost of running all the benchmarks, and has 2d a intelligence vs. cost graph.
Comment by misiti3780 12 hours ago
Comment by deaux 4 hours ago
Comment by deaux 4 hours ago
Tried a lot of them and settled on this one, they update instantly on model release and having all models on one page is the best UX.
Comment by int_19h 10 hours ago
Comment by rrhartjr 6 hours ago
Comment by RobinL 13 hours ago
Presumably a big motivation for them is to be first to get something good and cheap enough they can serve to every Android device, ahead of whatever the OpenAI/Jony Ive hardware project will be, and way ahead of Apple Intelligence. Speaking for myself, I would pay quite a lot for truly 'AI first' phone that actually worked.
Comment by exegete 12 hours ago
Comment by willis936 8 hours ago
Comment by cmckn 5 hours ago
Comment by floundy 7 hours ago
From a business perspective it’s a smart move (inasmuch as “integrating AI” is the default which I fundamentally disagree with) since Apple won’t be left holding the bag on a bunch of AI datacenters when/if the AI bubble pops.
I don’t want to lose trust in Apple, but I literally moved away from Google/Android to try and retain control over my data and now they’re taking me… right back to Google. Guess I’ll retreat further into self-hosting.
Comment by willis936 7 hours ago
As long as Apple doesn't take any crazy left turns with their privacy policy then it should be relatively harmless if they add in a google wrapper to iOS (and we won't need to take hard right turns with grapheneOS phones and framework laptops).
Comment by bitpush 3 hours ago
Did you forget all the Apple Intelligence stuff? They were never "ignoring" if anything they talked a big talk, and then failed so hard.
The whole iPhone 16 was marketed as AI first phone (including in billboards). They had full length ads running touting AI benefits.
Apple was never "ignoring" or "sitting AI out". They were very much in it. And they failed.
Comment by hugi 37 minutes ago
Comment by skerit 11 hours ago
Comment by RobinL 11 hours ago
Comment by nowittyusername 3 hours ago
Comment by eldenring 3 hours ago
Comment by anukin 12 hours ago
Comment by Workaccount2 11 hours ago
Stuff like:
"Open Chrome, new tab, search for xyz, scroll down, third result, copy the second paragraph, open whatsapp, hit back button, open group chat with friends, paste what we copied and send, send a follow-up laughing tears emoji, go back to chrome and close out that tab"
All while being able to just quickly glance at my phone. There is already a tool like this, but I want the parsing/understanding of an LLM and super fast response times.
Comment by KoolKat23 11 hours ago
On a related note, why would you want to break down your tasks to that level surely it should be smart enough to do some of that without you asking and you can just state your end goal.
Comment by pests 10 hours ago
Comment by nielsbot 19 minutes ago
Comment by pylotlight 8 hours ago
Comment by TeMPOraL 2 hours ago
Comment by procaryote 11 hours ago
Comment by CamperBob2 4 hours ago
Comment by TeMPOraL 1 hour ago
Plus, if the above worked, the higher level interactions could trivially work too. "Go to event details", "add that to my calendar".
FWIW, I'm starting to embrace using Gemini as general-purpose UI for some scenarios just because it's faster. Most common one, "<paste whatever> add to my calendar please."
Comment by qnleigh 9 hours ago
Comment by ipsum2 8 hours ago
Comment by Palmik 3 hours ago
Comment by qnleigh 2 hours ago
I do pay special attention to what the most negative comments say (which in this case are unusually positive). And people discussing performance on their own personal benchmarks.
Comment by awestroke 3 hours ago
Comment by Palmik 2 hours ago
Comment by clarkmoreno 2 hours ago
Comment by fariszr 16 hours ago
Is there an OSS model that's better than 2.0 flash with similar pricing, speed and a 1m context window?
Edit: this is not the typical flash model, it's actually an insane value if the benchmarks match real world usage.
> Gemini 3 Flash achieves a score of 78%, outperforming not only the 2.5 series, but also Gemini 3 Pro. It strikes an ideal balance for agentic coding, production-ready systems and responsive interactive applications.
The replacement for old flash models will be probably the 3.0 flash lite then.
Comment by thecupisblue 16 hours ago
So if 2.5 Pro was good for your usecase, you just got a better model for about 1/3rd of the price, but might hurt the wallet a bit more if you use 2.5 Flash currently and want an upgrade - which is fair tbh.
Comment by aoeusnth1 16 hours ago
Comment by sosodev 11 hours ago
It's extremely fast on good hardware, quite smart, and can support up to 1m context with reasonable accuracy
Comment by mips_avatar 15 hours ago
Comment by scrollop 13 hours ago
Comment by fullstackwife 16 hours ago
Comment by fariszr 16 hours ago
Comment by Workaccount2 11 hours ago
Gemini 3 pro got 20%, and everyone else has gotten 0%. I saw benchmarks showing 3 flash is almost trading blows with 3 pro, so I decided to try it.
Basically it is an image showing a dog with 5 legs, an extra one photoshopped onto it's torso. Every models counts 4, and gemini 3 pro, while also counting 4, said the dog had a "large male anatomy". However it failed a follow-up saying 4 again.
3 flash counted 5 legs on the same image, however I added distinct a "tattoo" to each leg as an assist. These tattoos didn't help 3 pro or other models.
So it is the first out of all the models I have tested to count 5 legs on the "tattooed legs" image. It still counted only 4 legs on the image without the tattoos. I'll give it 1/2 credit.
Comment by Valakas_ 1 hour ago
Comment by simonsarris 16 hours ago
With this release the "good enough" and "cheap enough" intersect so hard that I wonder if this is an existential threat to those other companies.
Comment by bgirard 16 hours ago
Comment by azuanrb 16 hours ago
Comment by bgirard 15 hours ago
Comment by rolisz 13 hours ago
In my experience, to get the best performance out of different models, they need slightly different prompting.
Comment by NamlchakKhandro 10 hours ago
There's a plugin for everything that mimics anything the others are doing
Comment by azuanrb 1 hour ago
I see all of these tools as IDEs. Whether someone locks into VS Code, JetBrains, Neovim, or Sublime Text comes down to personal preference. Everyone works differently, and that is completely fine.
Comment by nevir 13 hours ago
Maybe someday future models will all behave similarly given the same prompt, but we're not quite there yet
Comment by NamlchakKhandro 10 hours ago
Comment by theLiminator 16 hours ago
Comment by orourke 16 hours ago
Comment by inquirerGeneral 16 hours ago
Comment by calflegal 16 hours ago
Comment by nprateem 16 hours ago
Comment by catigula 15 hours ago
Comment by gaigalas 15 hours ago
Opus and Sonnet are slower than Haiku. For lots of less sophisticated tasks, you benefit from the speed.
All vendors do this. You need smaller models that you can rapid-fire for lots of other reasons than vibe coding.
Personally, I actually use more smaller models than the sophisticated ones. Lots of small automations.
Comment by dimitri-vs 7 hours ago
Comment by gaigalas 1 hour ago
Think beyond interfaces. I'm talking about rapid-firing hundreds of small agents and having zero human interaction with them. The feedback is deterministic (non agentic) and automated too.
Comment by alex1138 14 hours ago
You say good enough. Great, but what if I as a malicious person were to just make a bunch of internet pages containing things that are blatantly wrong, to trick LLMs?
Comment by calflegal 14 hours ago
Comment by floundy 7 hours ago
So Reddit?
I’d imagine the AI companies have all the “pre AI internet” data they scraped very carefully catalogued.
Comment by szundi 16 hours ago
Comment by cakealert 1 hour ago
After Gemini 3.0 the OpenAI damage control crews all drowned.
Not only is it vastly better, it's also free.
I find this particular benchmark to be in agreement with my experiences: https://simple-bench.com
Comment by kingstnap 16 hours ago
I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.
Comment by scrollop 13 hours ago
https://artificialanalysis.ai/evaluations/omniscience
Prepare to be amazed
Comment by albumen 11 hours ago
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?
Comment by wasabi991011 6 hours ago
One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.
Comment by Wyverald 9 hours ago
> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant
This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.
For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.
Comment by int_19h 10 hours ago
That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.
Comment by tanh 15 hours ago
Comment by leumon 14 hours ago
Comment by GaggiX 16 hours ago
More experts with a lower pertentage of active ones -> more sparsity.
Comment by mmaunder 15 hours ago
Now, imagine for a moment they had also vertically integrated the hardware to do this.
Comment by JumpCrisscross 14 hours ago
The most terrifying thing would be Google expanding its free tiers.
Comment by wasabi991011 6 hours ago
Granted, this doesn't give api access, only what google calls their "consumer ai products", but it makes a huge difference when chatgpt only allows a handful of document uploads and deep research queries per day.
Comment by Davidzheng 7 hours ago
Comment by avazhi 15 hours ago
Then you realise you aren't imagining it.
Comment by iwontberude 14 hours ago
Google is great on the data science alone, every thing else is an after thought
Comment by avazhi 14 hours ago
"And then imagine Google designing silicon that doesn’t trail the industry."
I'm def not a Google stan generally, but uh, have you even been paying attention?
Comment by mmaunder 14 hours ago
Comment by avazhi 13 hours ago
Comment by iwontberude 14 hours ago
TPUs on the other hand are ASICs, we are more than familiar with the limited application, high performance and high barriers to entry associated with them. TPUs will be worthless as the AI bubble keeps deflating and excess capacity is everywhere.
The people who don't have a rudimentary understanding are the wall street boosters that treat it like the primary threat to Nvidia or a moat for Google (hint: it is neither).
Comment by simonw 16 hours ago
It's 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k - notable that the new Flash model doesn’t have a price increase after that 200,000 token point.
It’s also twice the price of GPT-5 Mini for input, half the price of Claude 4.5 Haiku.
Comment by caminanteblanco 16 hours ago
I assume that these are just different reasoning levels for Gemini 3, but I can't even find mention of there being 2 versions anywhere, and the API doesn't even mention the Thinking-Pro dichotomy.
Comment by peheje 16 hours ago
Fast = Gemini 3 Flash without thinking (or very low thinking budget)
Thinking = Gemini 3 flash with high thinking budget
Pro = Gemini 3 Pro with thinking
Comment by sunaookami 15 hours ago
>Fast = 3 Flash
>Thinking = 3 Flash (with thinking)
>Pro = 3 Pro (with thinking)
Comment by caminanteblanco 12 hours ago
Comment by flakiness 16 hours ago
- "Thinking" is Gemini 3 Flash with higher "thinking_level"
- Prop is Gemini 3 Pro. It doesn't mention "thinking_level" but I assume it is set to high-ish.Comment by lysace 15 hours ago
When I ask Gemini 3 Flash this question, the answer is vague but agency comes up a lot. Gemini thinking is always triggered by a query.
This seems like a higher-level programming issue to me. Turn it into a loop. Keep the context. Those two things make it costly for sure. But does it make it an AGI? Surely Google has tried this?
Comment by dcre 12 hours ago
Comment by CamperBob2 12 hours ago
Which obviously opens up a can of worms regarding who should have authority to supply the "right answer," but still... lacking the core capability, AGI isn't something we can talk about yet.
LLMs will be a part of AGI, I'm sure, but they are insufficient to get us there on their own. A big step forward but probably far from the last.
Comment by bananaflag 10 hours ago
Problem is that when we realize how to do this, we will have each copy of the original model diverge in wildly unexpected ways. Like we have 8 billion different people in this world, we'll have 16 gazillion different AIs. And all of them interacting with each other and remembering all those interactions. This world scares me greatly.
Comment by andai 9 hours ago
Comment by lysace 9 hours ago
Comment by criley2 12 hours ago
- An AGI wouldn't hallucinate, it would be consistent, reliable and aware of its own limitations
- An AGI wouldn't need extensive re-training, human reinforced training, model updates. It would be capable of true self-learning / self-training in real time.
- An AGI would demonstrate real genuine understanding and mental modeling, not pattern matching over correlations
- It would demonstrate agency and motivation, not be purely reactive to prompting
- It would have persistent integrated memory. LLM's are stateless and driven by the current context.
- It should even demonstrate consciousness.
And more. I agree that what've we've designed is truly impressive and simulates intelligence at a really high level. But true AGI is far more advanced.
Comment by versteegen 34 minutes ago
I disagreed with most of your assertions even before I hit the last point. This is just about the most extreme thing you could ask for. I think very few AI researchers would agree with this definition of AGI.
Comment by waffletower 11 hours ago
I don't believe the "consciousness" qualification is at all appropriate, as I would argue that it is a projection of the human machine's experience onto an entirely different machine with a substantially different existential topology -- relationship to time and sensorium. I don't think artificial general intelligence is a binary label which is applied if a machine rigidly simulates human agency, memory, and sensing.
Comment by lysace 12 hours ago
Comment by xpil 13 hours ago
Comment by strstr 8 hours ago
Comment by testfrequency 12 hours ago
Their retention controls for both consumer and business suck. It’s the worst of any of the leaders.
Comment by ComputerGuru 9 hours ago
Comment by outside2344 15 hours ago
Comment by scrollop 13 hours ago
Comment by pawelduda 7 hours ago
Comment by Gigachad 12 hours ago
Comment by niek_pas 46 minutes ago
Comment by radicality 10 hours ago
For that reason I still find chatgpt way better for me, many things I ask it first goes off to do online research and has up to date information - which is surprising as you would expect Google to be way better at this. For example, was asking Gemini 3 Pro recently about how to do something with a “RTX 6000 Blackwell 96GB” card, and it told me this card doesn’t exist and that I probably meant the rtx 6000 ada… Or just today I asked about something on macOS 26.2, and it told me to be cautious as it’s a beta release (it’s not). Whereas with chatgpt I trust the final output more since it very often goes to find live sources and info.
Comment by leemoore 9 hours ago
That epistemic calibration is is something they are capable of thinking through if you point it out. But they aren’t trained to stop and ask/check themselves on how confident do they have a right to be. This is a meta cognitive interrupt that is socialized into girls between 6 and 9 and is socialized into boys between 11-13. While meta cognitive interrupt to calibrate to appropriate confidence levels of knowledge is a cognitive skill that models aren’t taught and humans learn socially by pissing off other humans. It’s why we get pissed off st models when they correct ua with old bad data. Our anger is the training tool to stop doing that. Just that they can’t take in that training signal at inference time
Comment by andai 8 hours ago
They think GPT-5 won't be released until the distant future, but what they don't realize is we have already arrived ;)
Comment by jaigupta 6 hours ago
Trying to use Gemini cli is such a pain. I bought GDP Premium and configured GCP, setup environment variables, enabled preview features in cli and did all the dance around it and it won't let me use gemini 3. Why the hell I am even trying so hard?
Comment by jdanbrown 5 hours ago
Then you just have to find a coding tool that works with OpenRouter. Afaik claude/codex/cursor don’t, at least not without weird hacks, but various of the OSS tools do — cline, roo code, opencode, etc. I recently started using opencode (https://github.com/sst/opencode), which is like an open version of claude code, and I’ve been quite happy with it. It’s a newer project so There Will Be Bugs, but the devs are very active and responsive to issues and PRs.
Comment by Palmik 3 hours ago
Not to mention that for coding, it's usually more cost efficient to get whatever subscription the specific model provider offers.
Comment by jaigupta 4 hours ago
Comment by qingcharles 4 hours ago
Comment by android521 5 hours ago
Comment by zhyder 16 hours ago
Comment by jug 8 hours ago
Comment by zurfer 14 hours ago
Comment by edvinasbartkus 13 hours ago
thinkingConfig: { thinkingLevel: "low", }
More about it here https://ai.google.dev/gemini-api/docs/gemini-3#new_api_featu...
Comment by zurfer 13 hours ago
On that note it would be nice to get these benchmark numbers based on the different reasoning settings.
Comment by retropragma 13 hours ago
Comment by bobviolier 11 hours ago
Comment by Tiberium 13 hours ago
Comment by andai 9 hours ago
Comment by primaprashant 16 hours ago
For comparison, from 2.5 Pro ($1.25 / $10) to 3 Pro ($2 / $12), there was 60% increase in input tokens and 20% increase in output tokens pricing.
Comment by simonw 16 hours ago
> Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.
Comment by prvc 3 hours ago
Comment by meetpateltech 16 hours ago
Developer Blog: https://blog.google/technology/developers/build-with-gemini-...
Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/
Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...
Comment by simonw 16 hours ago
Comment by meetpateltech 15 hours ago
For example, the Gemini 3 Pro collection: https://blog.google/products/gemini/gemini-3-collection/
But having everything linked at the bottom of the announcement post itself would be really great too!
Comment by simonw 13 hours ago
Comment by minimaxir 16 hours ago
Comment by rohitpaulk 16 hours ago
Comment by FergusArgyll 16 hours ago
Comment by SyrupThinker 16 hours ago
Just avoiding/fixing that would probably speed up a good chunk of my own queries.
Comment by robrenaud 16 hours ago
Summarize recent working arxiv url
And then it tells me the date is from the future and it simply refuses to fetch the URL.
Comment by tootyskooty 16 hours ago
Flash is meant to be a model for lower cost, latency-sensitive tasks. Long thinking times will both make TTFT >> 10s (often unacceptable) and also won't really be that cheap?
Comment by happyopossum 14 hours ago
Comment by rw2 3 hours ago
Comment by jug 16 hours ago
Comment by whinvik 16 hours ago
Turns out Gemini 3 Flash is pretty close. The Gemini CLI is not as good but the model more than makes up for it.
The weird part is Gemini 3 Pro is nowhere as good an experience. Maybe because its just so slow.
Comment by scrollop 13 hours ago
Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)
Comment by __jl__ 15 hours ago
Comment by diamondfist25 15 hours ago
Well worth every penny now
Comment by vanviegen 1 hour ago
Comment by Obertr 15 hours ago
Image model they have released is much worse than nano banana pro, ghibli moment did not happen
Their GPT 5.2 is obviously overfit on benchmarks as a consensus of many developers and friends I know. So Opus 4.5 is staying on top when it comes to coding
The weight of the ads money from google and general direction + founder sense of Brin brought the google massive giant back to life. None of my companies workflow run on OAI GPT right now. Even though we love their agent SDK, after claude agent SDK it feels like peanuts.
Comment by avazhi 15 hours ago
This has been true for at least 4 months and yeah, based on how these things scale and also Google's capital + in-house hardware advantages, it's probably insurmountable.
Comment by drawnwren 14 hours ago
Comment by mmaunder 15 hours ago
Edit: And just to add an example: openAI's Codex CLI billing is easy for me. I just sign up for the base package, and then add extra credits which I automatically use once I'm through my weekly allowance. With Gemini CLI I'm using my oauth account, and then having to rotate API keys once I've used that up.
Also, Gemini CLI loves spewing out its own chain of thought when it gets into a weird state.
Also Gemini CLI has an insane bias to action that is almost insurmountable. DO NOT START THE NEXT STAGE still has it starting the next stage.
Also Gemini CLI has been terrible at visibility on what it's actually doing at each step - although that seems a bit improved with this new model today.
Comment by ewoodrich 6 hours ago
Comment by mips_avatar 14 hours ago
Comment by vanviegen 1 hour ago
Comment by mmaunder 14 hours ago
Comment by visarga 13 hours ago
Comment by mips_avatar 11 hours ago
Comment by GenerWork 14 hours ago
Comment by gpt5 14 hours ago
It's when it becomes difficult, like in the coding case that you mentioned, that we can see the OpenAI still has the lead. The same is true for the image model, prompt adherence is significantly better than Nano Banana. Especially at more complex queries.
Comment by int_19h 10 hours ago
Comment by GenerWork 11 hours ago
Comment by fellowniusmonk 13 hours ago
My logic test and trying to get an agent to develop a certain type of ** implementation (that is published and thus the model is trained on to some limited extent) really stress test models, 5.2 is a complete failure of overfitting.
Really really bad in an unrecoverable infinite loop way.
It helps when you have existing working code that you know a model can't be trained on.
It doesn't actually evaluate the working code it just assumes it's wrong and starts trying to re-write it as a different type of **.
Even linking it to the explanation and the git repo of the reference implementation it still persists in trying to force a different **.
This is the worst model since pre o3. Just terrible.
Comment by int32_64 14 hours ago
Comment by crazygringo 14 hours ago
But for anyone using LLM's to help speed up academic literature reviews where every detail matters, or coding where every detail matters, or anything technical where every detail matters -- the differences very much matter. And benchmarks serve just to confirm your personal experience anyways, as the differences between models becomes extremely apparent when you're working in a niche sub-subfield and one model is showing glaring informational or logical errors and another mostly gets it right.
And then there's a strong possibility that as experts start to say "I always trust <LLM name> more", that halo effect spreads to ordinary consumers who can't tell the difference themselves but want to make sure they use "the best" -- at least for their homework. (For their AI boyfriends and girlfriends, other metrics are probably at play...)
Comment by smashed 14 hours ago
In fact so far, they consistently fail in exactly these scenario, glossing over random important details whenever you double check results in depth.
You might have found models, prompts or workflows that work for you though, I'm interested.
Comment by bitpush 14 hours ago
We've seen this movie before. Snapchat was the darling. Infact, it invented the entire category and was dominating the format for years. Then it ran out of time.
Now very few people use Snapchat, and it has been reduced to a footnote in history.
If you think I'm exaggerating, that just proves my point.
Comment by decimalenough 14 hours ago
Comment by bitpush 10 hours ago
I never said Snapchat is dead. It still lives on, but it is a shell of the past. They had no moat, and the competitors caught up (Instagram, Whatsapp and even LinkedIn copied Snapchat with stories .. and rest is history)
Comment by xbmcuser 14 hours ago
Comment by int_19h 10 hours ago
Comment by rfw300 14 hours ago
Comment by holler 14 hours ago
Comment by macNchz 14 hours ago
Comment by Obertr 14 hours ago
Just go outside the bubble plus take a bit older people
Comment by ewoodrich 6 hours ago
They are both Android/Google Search users so all it really took was "sure I guess I'll try that" in response to a nudge from Google. For me personally I have subscriptions to Claude/ChatGPT/Gemini for coding but use Gemini for 90% of chatbot questions. Eventually I'll cancel some of them but will probably keep Gemini regardless because I like having the extra storage with my Google One plan bundle. Google having a pre-existing platform/ecosystem is a huge advantage imo.
Comment by nimchimpsky 14 hours ago
Comment by fullstick 13 hours ago
Comment by jay_kyburz 14 hours ago
Comment by dieortin 15 hours ago
Comment by novok 14 hours ago
Founders are special, because they are not beholden to this social support network to stay in power and founders have a mythos that socially supports their actions beyond their pure power position. The only others they are beholden too are their co-founders, and in some cases major investor groups. This gives them the ability to disregard this social balance because they are not dependent on it to stay on power. Their power source is external to the organization, while everyone else is internal to it.
This gives them a very special "do something" ability that nobody else has. It can lead to failures (zuck & occulus, snapchat spectacles) or successes (steve jobs, gemini AI), but either way, it allows them to actually "do something".
Comment by JumpCrisscross 13 hours ago
Of course they are. Founders get fired all the time. As often as non-founder CEOs purge competition from their peers.
> The only others they are beholden too are their co-founders, and in some cases major investor groups
This describes very few successful executives. You can have your co-founders and investors on board, if your talent and customers hate you, they’ll fuck off.
Comment by ryoshu 14 hours ago
Comment by HarHarVeryFunny 14 hours ago
The merger happened in April 2023.
Gemini 1.0 was released in Dec 2023, and the progress since then has been rapid and impressive.
Comment by raincole 14 hours ago
Ghibli moment was only about half a year ago. At that moment, OpenAI was so far ahead in terms of image editing. Now it's behind for a few months and "it can't be reversed"?
Comment by Obertr 14 hours ago
Comment by BoredPositron 14 hours ago
Comment by JumpCrisscross 14 hours ago
Kara Swisher recently compared OpenAI to Netscape.
Comment by Andrex 6 hours ago
Maybe we'll get some awesome FOSS tech out of its ashes?
Comment by JumpCrisscross 4 hours ago
Comment by baq 15 hours ago
Comment by louiereederson 14 hours ago
the reason this matters is slowing velocity raises the risk of featurization, which undermines LLMs as a category in consumer. cost efficiency of the flash models reinforces this as google can embed LLM functionality into search (noting search-like is probably 50% of chatgpt usage per their july user study). i think model capability was saturated for the average consumer use case months ago, if not longer, so distribution is really what matters, and search dwarfs LLMs in this respect.
https://techcrunch.com/2025/12/05/chatgpts-user-growth-has-s...
Comment by encroach 14 hours ago
Comment by Obertr 14 hours ago
Comment by gdhkgdhkvff 14 hours ago
Comment by encroach 14 hours ago
Comment by raincole 14 hours ago
Comment by yieldcrv 14 hours ago
so they get lapped a few times and then drop a fantastic new model out of nowhere
the same is going to happen to Google again, Anthropic again, OpenAI again, Meta again, etc
they're all shuffling the same talent around, its California, that's how it goes, the companies have the same institutional knowledge - at least regarding their consumer facing options
Comment by random9749832 15 hours ago
Comment by CuriouslyC 14 hours ago
Comment by viraptor 14 hours ago
Comment by CuriouslyC 10 hours ago
Comment by NitpickLawyer 14 hours ago
Out of all the big4 labs, google is the last I'd suspect of benchmaxxing. Their models have generally underbenched and overdelivered in real world tasks, for me, ever since 2.5 pro came out.
Comment by nightski 14 hours ago
Comment by novok 14 hours ago
Comment by acheong08 16 hours ago
Pipe dream right now, but 50 years later? Maybe
Comment by incognito124 16 hours ago
https://deepmind.google/models/gemini-robotics/
Previous discussions: https://news.ycombinator.com/item?id=43344082
Comment by iamgopal 16 hours ago
Comment by bearjaws 16 hours ago
Google keeps their models very "fresh" and I tend to get more correct answers when asking about Azure or O365 issues, ironically copilot will talk about now deleted or deprecated features more often.
Comment by sv123 16 hours ago
Comment by djeastm 16 hours ago
Comment by golem14 13 hours ago
Comment by xnx 16 hours ago
Comment by walthamstow 16 hours ago
Comment by tempaccount420 11 hours ago
The model is very hard to work with as is.
Comment by bennydog224 16 hours ago
-> 2.5 Flash Lite is super fast & cheap (~1-1.5s inference), but poor quality responses.
-> 2.5 Flash gives high quality responses, but fairly expensive & slow (5-7s inference)
I really just need an in-between for Flash and Flash Lite for cost and performance. Right now, users have to wait up to 7s for a quality response.
Comment by k8sToGo 15 hours ago
Comment by Tiberium 13 hours ago
Comment by Fiveplus 16 hours ago
Comment by bayarearefugee 11 hours ago
Its great that they have these new fast models, but the release hype has made Gemini Pro pretty much unusable for hours.
"Sorry, something went wrong"
random sign-outs
random garbage replies, etc
Comment by dandiep 14 hours ago
Comment by scrollop 13 hours ago
Just do it.
I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.
I don't know how people use one service can manage...
Comment by dandiep 13 hours ago
Comment by alach11 15 hours ago
Comment by lbhdc 15 hours ago
Comment by jiggawatts 13 hours ago
Comment by gorbot 6 hours ago
Comment by jtrn 16 hours ago
Skatval is a small local area I live in, so I know when it's bullshitting. Usually, I get a long-winded answer that is PURE Barnum-statement, like "Skatval is a rural area known for its beautiful fields and mountains" and bla bla bla.
Even with minimal thinking (it seems to do none), it gives an extremely good answer. I am really happy about this.
I also noticed it had VERY good scores on tool-use, terminal, and agentic stuff. If that is TRUE, it might be awesome for coding.
I'm tentatively optimistic about this.
Comment by amunozo 16 hours ago
Comment by peterldowns 14 hours ago
Comment by kingstnap 16 hours ago
Comment by jtrn 10 hours ago
This might be fun for vibecode to just let it go crazy and don't stop until an MVP is working, but I'm actually afraid to turn on agent mode with this now.
If it was just over-eager, that would be fine, but it's also not LISTENING to my instructions. Like the previous example, I didn't ask it to install a testing framework, I asked it for options fitting my project. And this happened many times. It feels like it treats user prompts/instructions as: "Suggestions for topics that you can work on."
Comment by gustavoaca1997 5 hours ago
Comment by doomerhunter 16 hours ago
Hoping that the local ones keep progressively up (gemma-line)
Comment by speedgoose 16 hours ago
Comment by anonym29 16 hours ago
Comment by Workaccount2 16 hours ago
Comment by robertwt7 6 hours ago
Comment by alooPotato 11 hours ago
Comment by zone411 12 hours ago
Comment by SubiculumCode 14 hours ago
Comment by simonw 10 hours ago
I also had it summarize this thread on Hacker News about itself:
https://gist.github.com/simonw/b0e3f403bcbd6b6470e7ee0623be6...
llm \
-f hn:46301851 -m "gemini-3-flash-preview" \
-s 'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
Where the `-f hn:xxxx` bit resolves via this plugin: https://github.com/simonw/llm-hacker-newsComment by elvin_d 14 hours ago
Comment by user_7832 16 hours ago
1, has anyone actually found 3 Pro better than 2.5 (on non code tasks)? I struggle to find a difference beyond the quicker reasoning time and fewer tokens.
2, has anyone found any non-thinking models better than 2.5 or 3 Pro? So far I find the thinking ones significantly ahead of non thinking models (of any company for that matter.)
Comment by Workaccount2 16 hours ago
Comment by Davidzheng 16 hours ago
Comment by tmaly 16 hours ago
Comment by hubraumhugo 16 hours ago
Comment by onraglanroad 15 hours ago
Comment by apparent 15 hours ago
Comment by SubiculumCode 14 hours ago
Comment by WhereIsTheTruth 16 hours ago
Comment by peheje 16 hours ago
Comment by echelon 16 hours ago
I do feel like it's not an entirely accurate caricature (recency bias? limited context?), but it's close enough.
Good work!
You should do a "show HN" if you're not worried about it costing you too much.
Comment by Tiberium 16 hours ago
Comment by BeetleB 16 hours ago
I don't view this as a "new Flash" but as "a much cheaper Gemini 3 Pro/GPT-5.2"
Comment by Tiberium 16 hours ago
Comment by int_19h 10 hours ago
Comment by zzleeper 16 hours ago
Comment by jexe 15 hours ago
Comment by poplarsol 16 hours ago
Comment by croemer 13 hours ago
Comment by hereme888 8 hours ago
Comment by mmaunder 11 hours ago
Firstly, 3 Flash is wicked fast and seems to be very smart for a low latency model, and it's a rush just watching it work. Much like the YOLO mode that exists in Gemini CLI, Flash 3 seems to YOLO into solutions without fully understanding all the angles e.g. why something was intentionally designed in a way that at first glance may look wrong, but ended up this way through hard won experience. Codex gpt 5.2 xhigh on the other hand does consider more angles.
It's a hard come-down off the high of using it for the first time because I really really really want these models to go that fast, and to have that much context window. But it ain't there. And turns out for my purposes the longer chain of thought that codex gpt 5.2 xhigh seems to engage in is a more effective approach in terms of outcomes.
And I hate that reality because having to break a lift into 9 stages instead of just doing it in a single wicked fast run is just not as much fun!
Comment by sunaookami 15 hours ago
Comment by raybb 11 hours ago
Comment by sunaookami 4 hours ago
Comment by tanh 16 hours ago
Comment by Def_Os 14 hours ago
Comment by inshard 3 hours ago
Comment by MillionOClock 3 hours ago
Comment by inshard 3 hours ago
Comment by agentifysh 13 hours ago
its almost as good as 5.2 and 4.5 but way faster and cheaper
Comment by FergusArgyll 16 hours ago
Comment by NitpickLawyer 16 hours ago
Comment by jonathan_h 13 hours ago
I think part of what enables a monopoly is absence of meaningful competition, regardless of how that's achieved -- significant moat, by law or regulation, etc.
I don't know to what extent Google has been rent-seeking and not innovating, but Google doesn't have the luxury to rent-seek any longer.
Comment by deskamess 15 hours ago
Comment by concinds 16 hours ago
Comment by incrudible 15 hours ago
Comment by heliophobicdude 15 hours ago
Comment by JeremyHerrman 16 hours ago
I'm more excited to see 3 Flash Lite. Gemini 2.5 Flash Lite needs a lot more steering than regular 2.5 Flash, but it is a very capable model and combined with the 50% batch mode discount it is CHEAP ($0.05/$0.20).
Comment by jeppebemad 16 hours ago
Comment by summerlight 15 hours ago
Comment by nickvec 16 hours ago
Comment by evandena 12 hours ago
Comment by i_love_retros 11 hours ago
Comment by blitz_skull 10 hours ago
Comment by walthamstow 16 hours ago
Comment by jeffbee 16 hours ago
Comment by GaggiX 16 hours ago
Also I don't see it written in the blog post but Flash supports more granular settings for reasoning: minimal, low, medium, high (like openai models), while pro is only low and high.
Comment by minimaxir 16 hours ago
> Matches the “no thinking” setting for most queries. The model may think very minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
I'd prefer a hard "no thinking" rule than what this is.
Comment by skerit 16 hours ago
Wasn't this the case with the 2.5 Flash models too? I remember being very confused at that time.
Comment by JohnnyMarcone 15 hours ago
To me it seems like the big model has been "look what we can do", and the smaller model is "actually use this one though".
Comment by jug 16 hours ago
Comment by prompt_god 13 hours ago
Comment by timpera 15 hours ago
Also, I hate that I cannot send the Google models in a "Thinking" mode like in ChatGPT. When I send GPT 5.1 Thinking on a legal task and tell it to check and cite all sources, it takes +10 minutes to answer, but it did check everything and cite all its sources in the text; whereas the Gemini models, even 3 Pro, always answer after a few seconds and never cite their sources, making it impossible to click to check the answer. It makes the whole model unusable for these tasks. (I have the $20 subscription for both)
Comment by happyopossum 14 hours ago
Definitely has not been my experience using 3 Pro in Gemini Enterprise - in fact just yesterday it took so long to do a similar task I’d thought something was broken. Nope, just re-chrcking a source
Comment by timpera 14 hours ago
Just tried once again with the exact same prompt: GPT-5.1-Thinking took 12m46s and Gemini 3.0 Pro took about 20 seconds. The latter obviously has a dramatically worse answer as a result.
(Also, the thinking trace is not in the correct language, and doesn't seem to show which sources have been read at which steps- there is only a "Sources" tab at the end of the answer.)
Comment by jijji 15 hours ago
Comment by anonym29 16 hours ago
Comment by oklahomasports 14 hours ago
Comment by anonym29 14 hours ago
You're not doing anything wrong. Everyone knows what you're doing. You have no secrets to hide.
Yet you value your privacy anyway. Why?
Also - I have no problem using Anthropic's cloud-hosted services. Being opposed to some cloud providers doesn't mean I'm opposed to all cloud providers.
Comment by happyopossum 14 hours ago
Anthropic - one of GCP’s largest TPU customers? Good for you.
https://www.anthropic.com/news/expanding-our-use-of-google-c...
Comment by moralestapia 16 hours ago
Comment by FpUser 10 hours ago
I am playing with Gemini 3 and the more I do the more I find it disappointing when discussing both tech and non-tech subject comparatively to ChatGPT. When it comes to non tech it seems like it was heavily indoctrinated and when it can not "prove" the point it abruptly cuts the conversation. When asked why, it says: formatting issues. Did it attend weasel courses?
It is fast. I grant it.
Comment by retinaros 13 hours ago
I just always thought the taste of gpt or claude models was more interesting in the professional context and their end user chat experience more polished.
are there obvious enterprise use cases where gemini models shine?
Comment by andrepd 16 hours ago
Comment by mschulkind 16 hours ago
Comment by i_love_retros 11 hours ago
Comment by yieldcrv 8 hours ago
I hate adding -preview to my model environment variable
Comment by pancodecake 3 hours ago
Comment by inquirerGeneral 16 hours ago
Comment by Lucasjohntee 15 hours ago
Comment by pancodecake 6 hours ago
Comment by lyy123 7 hours ago
Comment by imvetri 16 hours ago
Comment by Tepix 16 hours ago
Comment by alex1138 10 hours ago
Comment by alex1138 8 hours ago
Comment by jdthedisciple 13 hours ago
ChatGPT still has 81% market share as of this very moment, vs Gemini's ~2%, and arguably still provides the best UX and branding.
Everyone and their grandma knows "ChatGPT", who outside developers' bubble has even heard of Gemini Flash?
Yea I don't think that dynamic is switching any time soon.
Comment by int_19h 10 hours ago
Comment by riku_iki 13 hours ago
where did you get this from?
Comment by scrollop 13 hours ago
Comment by Topfi 11 hours ago
No matter the model, AI Overview/Results in Google are just hallucinated nonsense, only providing roughly equivalent information to what is in the linked sources as a coincidence, rather than due to actually relying on them.
Whether DuckDuckGo, Kagi, Ecosia or anything else, they are all objectively and verifiably better search engines than Google as of today.
This isn't new either, nor has it gotten better. AI Overview has been and continues to be a mess that makes it very clear to me anyone claiming Google is still the "best" search engine results wise is lying to themselves. Anyone saying Google search in 2025 is good or even usable is objectively and verifiably wrong and claiming DDG or Kagi offer less usable results is equally unfounded.
Either fix your models finally so they adhere to and properly quote sources like your competitors somehow manage or, preferably, stop forcing this into search.
Comment by sabareesh 9 hours ago
Comment by joecarpenter 9 hours ago
Gemini 3 Flash scored +13 in the test, more correct answers than incorrect.
Comment by andai 9 hours ago
Edit: Huh... It does score highest in "Omniscience", but also very high in Hallucination Rate (where higher score is worse)...
Comment by sabareesh 9 hours ago