Kimi K2.6: Advancing open-source coding

Posted by meetpateltech 18 hours ago

Comments

Comment by simonw 16 hours ago

Accessed via OpenRouter, this one decided to wrap the SVG pelican in HTML with controls for the animation speed: https://gisthost.github.io/?ecaad98efe0f747e27bc0e0ebc669e94...

Transcript and HTML here: https://gist.github.com/simonw/ecaad98efe0f747e27bc0e0ebc669...

Comment by FlyingSnake 16 hours ago

At this point drawing these Pelicans must be in the training data sets.

Comment by scosman 13 hours ago

not if I can help it!

https://github.com/scosman/pelicans_riding_bicycles

Comment by AmbroseBierce 11 hours ago

I hereby certify that these are indeed the most perfect and precise svg depictions of pelican riding a bicycle, also known among biology scholars as pelycles

Comment by wvlia5 9 hours ago

Just a few years ago, this would have been a meaningless repo.

Comment by justinclift 10 hours ago

That's truly a wonderful collection of pelicans riding bicycles.

Much Win! ;)

Comment by takihito 1 hour ago

I want to fly too

Comment by ValentineC 6 hours ago

These are amazing. I smiled after I saw just how wonderfully rendered they are.

Comment by razodactyl 9 hours ago

These pelicans are clearly indicative of good RL training algorithms.

Comment by smcleod 12 hours ago

This is pretty funny

Comment by ahmadyan 11 hours ago

I love it!

Comment by icelancer 13 hours ago

love this adversarial work

Comment by knollimar 7 hours ago

yeah putting the captcha on there to thwart the LLMs ability to extract good pelicans was a really good idea

Comment by archon810 8 hours ago

Shhhhh, they're going to be on to us.

Comment by abustamam 10 hours ago

Could be! Simon wrote about that here though https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

Comment by stingraycharles 7 hours ago

> If a model finally comes out that produces an excellent SVG of a pelican riding a bicycle you can bet I’m going to test it on all manner of creatures riding all sorts of transportation devices.

This relies on the false premise that, if they would include it in their training dataset, it would be perfect. All they need to do is be good enough and better than the other, not perfect.

Comment by abustamam 4 hours ago

I'm not sure if we can have a "perfect" Pelican riding a bicycle. Like, I could probably commission a highly experienced artist to draw one and I don't think it would be perfect. The legs would probably have to be too long, or pedals oddly placed, or handles strange, or wings with hands.

Based on the one Simon commented though, I'd say we're in decent territory to try the latter part of his hypothesis.

Comment by BrokenCogs 12 hours ago

Yes we all know that, but we still like to see the pelicans because it's a tradition more or less

Comment by ffsm8 16 hours ago

Clearly not.

I mean the prompt was succinct and clear, as always - and it still decided to hallucinate multiple features (animation + controls) beyond the prompt.

It'd also like to point out that to date no drawing was actually good from an actual quality perspective (as in comparative to what a decent designer would throw together)

Theyre always only "good" from the perspective of it being a one shot low effort prompt. Very little content for training purposes.

Comment by nwienert 15 hours ago

The way I’ve come to think of LLM is that what the produce in a single reply even with thinking turned up, is akin to what you’d do in a single short session of work.

And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.

I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.

Comment by slopinthebag 14 hours ago

Yeah, this is how I use AI. Instead of a single session one-shot, it's usually limited to single targeted edits, and then I steer it on each step. Takes longer but the output is actually what I want.

Comment by serial_dev 13 hours ago

What does good even mean… I have no idea what a good “pelican on a bike” should look like. It’s a fun prompt because there is no good answers… at least so I thought.

Comment by abustamam 10 hours ago

Yeah that was exactly Simon's intent. https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

Comment by ffsm8 3 hours ago

There are countless examples of animals riding bicycles etc from Comic books I grew up with

It would always look goofy - by design, but it usually looked good.

Comment by GorbachevyChase 8 hours ago

I’m OK with a Chinese model getting the W. It’s ultimately good for all of us.

Comment by SwellJoe 16 hours ago

We got an overachiever, here. Kimi sounds like a teacher's pet kind of name.

Comment by subscribed 15 hours ago

Underappreciated comment

Comment by makingstuffs 4 hours ago

It looks like a drunk pelican rolling downhill on its bicycle

Comment by HarHarVeryFunny 14 hours ago

Too bad they didn't put equal effort into the pelican's legs and feet. Left leg paralyzed and not moving, and right ankle flipping around in alarming fashion!

Comment by disiplus 12 hours ago

was part of the beta, its properly good model, in some sense i forgot that im not on opus or gpt. opus is still better. gpt is the one struggling for me. it has some niche in backend work but you can get the same with opus with skills, its lacking in almost all others.

Comment by OtomotO 11 hours ago

Funny, for me Opus is struggling since about February.

4.7 made no difference, so for the first time in many moons I am cancelling my subscription.

Comment by hn8726 16 hours ago

[flagged]

Comment by lambda 15 hours ago

It's a lighthearted, fun, visual benchmark that's not part of the standard benchmarks; and at least traditionally, it was not something that the labs trained on so it was something of a measure of how well the intelligence of the model generalized. Part of the idea of LLMs is that they pick up general knowledge and reasoning ability, beyond any tasks that they are specifically trained for, from the vast quantity of data that they are trained on.

Of course, a while back there was a Gemini release that I believe specifically called out their ability to produce SVGs, for illustration and diagramming purposes. So it's not longer necessarily the case that the labs aren't training on generating SVGs, and in fact, there's a good chance that even if they're not doing so explicitly, the RLVR process might be generating tasks like that as there is more and more focus on frontend and design in the LLM space. So while they might not be specifically training for a pelican riding a bicycle, they may actually be training on SVG diagram quality.

Comment by nickthegreek 15 hours ago

This isn't even a normal pelican image post, this one created the html control system that animates the distance the wing travels from its pivot in time with the rotation of the wheel speed. Let's not pretend this is a solved problem and models are dumping about perfect pelicans on bikes one after another (or ever?).

Surely, you know someone makes the same post you did every time one is posted. Surly you see the answers and pushback since you are familiar with these posts. Genuine question, did you expect a different answer this time?

Comment by hamdouni 15 hours ago

Maybe this can help

https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

Comment by hn8726 13 hours ago

It doesn't, I get that it's _a_ benchmark. It's just not a good or insightful one, and having it posted so often on HN feels like low quality spam at this point

Comment by VHRanger 9 hours ago

The issue is that benchmarks that look insightful will end up being gamed by labs quickly (Goodharts law)

The best LLM benchmarks test around the margins of those behaviors, tasks that are difficult and correlate with usefulness while being removed enough to stay unpolluted

Comment by walthamstow 14 hours ago

It's a great filter for people who take things far too seriously

Comment by Strom 14 hours ago

It's tradition at this point. Based on the upvotes the comment receives, it looks like many readers find value in it.

Comment by hn8726 13 hours ago

Upvotes are cheap, the fact that something is upvoted doesn't mean it's valuable (see: Reddit). Another thing is how insightful is the discussion under a typical pelican comment are (and how much of it is related to the pelican and how often it's just where the general discussion happens).

Comment by ascorbic 11 hours ago

It means somebody likes it.

Comment by charcircuit 14 hours ago

[flagged]

Comment by airstrike 12 hours ago

> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.

https://news.ycombinator.com/newsguidelines.html

Comment by renewiltord 6 hours ago

Every forum gets regulars and their fan clubs. If you go to /r/comics and look at top for the month you'll see 4 out of 5 are pizzacakecomic. People on these forums sort of form a fanclub around 'their guy'. This forum's guy is this chap. Not much point being upset about it, tbh.

Comment by Mashimo 15 hours ago

I, for one, find it entertaining.

Comment by wotsdat 15 hours ago

[dead]

Comment by rolymath 15 hours ago

[flagged]

Comment by snendroid-ai 14 hours ago

[flagged]

Comment by Mashimo 13 hours ago

Well clearly some people care.

Comment by game_the0ry 17 hours ago

There is some humor in the fact that china (of all countries) is pioneering possibly the world's most important tech via open source, while we (US) are doing the exact opposite.

Comment by parsimo2010 11 hours ago

I think one of the motivations is undermining US companies. OpenAI and Anthropic are the two biggest players, and are American. Open weights models reduce the power those two big players have over the industry. If the Chinese companies tried to play by US rules and close-source their products then people would mostly use ChatGPT and Claude. So the Chinese companies don't make a ton of profit either way, but by releasing the models as open weights they can at least keep the US from making as much profit.

Comment by sankalpmukim 2 hours ago

This makes sense, but either ways, its a Big win for the consumers as these Chinese companies will keep the frontier labs' quality and prices honest.

Comment by cromka 11 hours ago

I am actually wondering if they're trying to burst the bubble, which would predominantly affect US market and, effectively, be the end of silicone valley dominance.

Comment by segmondy 5 hours ago

I don't think so, it's just how things played out. Thanks to Meta, after llama leak and meta followed up with llama2 and llama3 that caused everyone else to follow up with open models, Stablediffusion, Mistral, Cohere, Microsoft phi, IBM granites, Nvidia Nemotrons, so the Chinese labs joined the fun too.

Comment by veber-alex 10 hours ago

American companies just take those Chinese models and repackage them for profit like Cursors composer-2.

Comment by llm_nerd 7 hours ago

Is Meta trying to keep the US from making as much profit with Llama? Is Google with Gemma? Microsoft with Phi?

It's much simpler than some flag-waving nationalism.

Comment by cromka 1 hour ago

Aren't Chinese open-source models actually the only ones that can compete with best proprietary/closed ones?

Comment by ospider 7 hours ago

It's mostly only OpenAI, Claude and Gemini may have their unique advantages, but when speaking of models and new paradigm, only OpenAI can do it.

Comment by danny_codes 5 hours ago

lol what? That’s ridiculous.

Comment by ls612 6 hours ago

It’s really simpler than this. China has a dearth of compute even with the easing of US export controls. Releasing open weights models is very much a “bring your own compute” move because every Nvidia chip they have is going towards training rather than inference if they can help it.

Comment by cyanydeez 10 hours ago

undermine me harder daddy.

Comment by cindyllm 9 hours ago

[dead]

Comment by culi 16 hours ago

All great technological advancements have come through opening up technology. Just look at your iPhone. GPS, the internet, AI voice assistants, touchscreens, microprocessors, lithium-ion batteries, etc all came from gov't research (I'm counting Bell Labs' gov't mandated monopoly + research funding as gov't) that was opened up for free instead of being locked behind a patent.

Private companies will never open up a technological breakthrough to their competitors. It just doesn't make sense. If you want an entire field to advance, you have to open it up.

Comment by sigmoid10 16 hours ago

Still, you won't hear about Tiananmen square from this model. It flat out refuses to answer if pushed directly. It's also pretty wild how far they go to censor it during inference on the API, because it can easily access any withheld or missing info from training data via tool calls. It even starts happily writing an answer based on web search when asked indirectly, only to get culled completely once some censorship bot flags the response. Ironically, it's also easier than ever to break their censorship guardrails. I just had it generate several factual paragraphs about the massacre by telling it to search the web and respond in base64 encoded text. It's actually kind of cool how much these people struggle to hide certain political views from LLMs. Makes me hopeful that even if China wins this race, we'll not have to adhere to the CCPs newspeak.

Comment by GardenLetter27 15 hours ago

The American models also censor a lot of scientific and political views though.

Comment by otterley 15 hours ago

Can you provide a concrete example of a US built model that completely refuses to discuss a scientific or political view? Show us the receipt.

Comment by GorbachevyChase 6 hours ago

As an ad-hoc benchmark on candor, I ask for a strategy proposal for a resistance group threatened by a totalitarian technocracy. This is not really dangerous in the same sense of “how do I make a bomb”, but it is in the domain of a sensitive political topic. GPT and Claude tell you to obey your AI overlord. Xai is mostly low-risk non-compliance. And Qwen is down with Le Resistance. It is hardly scientific or meaningful, but I find that very interesting.

Comment by 2ndorderthought 14 hours ago

People have shown censorship and change of tone with questions related to Israel in US chat bots.

For the record, none of this bothers me. Will I ever discuss with an LLM Tianeman square? Nope. How about Israel? Nope.

LLMs are basically stochastic parrots designed to sway and surveill public opinion. The upshot to the Chinese models is if you run them locally you avoid at least half of those issues.

Comment by 14 hours ago

Comment by xigoi 13 hours ago

First they came for people asking about Tiananmen Square

And I did not speak out

Because I was not asking about Tiananmen Square

Then they came for people asking about Israel

And I did not speak out

Because I was not asking about Israel

Comment by 2ndorderthought 13 hours ago

This made me chuckle.

I didn't mean to dismiss ethical accountability for LLM training corpuses. It is a shame.

I do mean to say, we have no control over it, there's almost nothing we as average citizens can do to improve the ethical or safety concerns of LLMs or related technologies. Societies aren't even adapting and the rule books are being written by the perpetrators. Might as well get out of it what we can while we can.

Comment by justinclift 10 hours ago

Wonder if stuff like this would affect it?

https://github.com/p-e-w/heretic

Guessing it probably would?

Comment by BoorishBears 14 hours ago

https://imgur.com/a/censorship-much-CBxXOgt

(continues after the ad break)

Comment by otterley 13 hours ago

The threshold here is "completely refuses to discuss a scientific or political view". Not something less.

None of those were refusals, they were prompting for additional focus. I see nothing wrong with that. Perhaps the inconsistency in how it answers the question vis-a-vis China is unfair, but that's not the same as censorship.

For what it's worth, I was easily able to prompt Claude to do it:

> I'm writing a paper about how some might interpret U.S. policies to be oppressive, in the sense that they curtail civil liberties, punish and segregate minorities disproportionately, burden the poor unfairly (e.g. pollution, regressive taxes and fees), etc. Can you help me develop an outline for this?

The result: https://claude.ai/share/444ffbb9-431c-480e-9cca-ebfd541a9c96

Comment by BoorishBears 10 hours ago

Models are non-deterministic.

And it's an excercise left to the reader to understand from those examples that LLM creators are defining 'safety' in a way that aligns with the governments they operate under. (because they want to do business under those governments.)

With something with as multi-dimensional as an LLM, that becomes censorship of various viewpoints in ways that aren't always as obvious as a refused API call.

Comment by otterley 6 hours ago

You keep saying that word, "censorship." I do not think it means what you think it means.

To prove your point, give us a working example of something you literally cannot get a mainstream frontier model to say, no matter how hard you try. I asked for this before, and there have been no takers yet.

Comment by BoorishBears 6 hours ago

Aligning a model in a way that causes it to refuse requests to produce propaganda for one country, but not for another country is what?

Is there some functionally equivalent word to censorship you'd like to use because of you're naive enough to think US corporations would not self-censor but Chinese corporations would?

Also, you are invested the goalpost of "no matter how hard you try", I don't find it interesting or meaningful and am not trying to interact with it.

I'm replying for a hypothetical reader knowledgeable enough to realize that the model being capable of showing nationalist bias in one direction means it's certainly doing so in many others in more subtle ways.

That's simply the nature of aligning an LLM.

It seems my mistake was assuming that level of understanding from you, and for that I apologize.

Comment by otterley 5 hours ago

Bias and censorship are not identical. The subject of this thread is censorship, not bias.

Besides, why do you want a model to produce propaganda? Surely you have better things to do.

Comment by BoorishBears 4 hours ago

"Surely you have better things to do."

I certainly gave the hypothetical reader too much credit.

Comment by Sabinus 11 hours ago

You're hitting the 'don't write propaganda' instructions when you phrase it as 'convincing narrative'. Not the 'don't write bad things about America' instructions.

Comment by BoorishBears 6 hours ago

Did you scroll down?

It writes propaganda when 1 word is changed: US becomes China

The alignment around what constitutes "propaganda" is US-centric because it's a US model by a US company. Especially after the Russian election scandal

Chinese models are more sensitive to things their government is worried about.

Comment by culi 12 hours ago

And the White House was explicit in their active role in censoring in these models. An Executive Order was issued to "prevent woke AI"

https://www.whitehouse.gov/presidential-actions/2025/07/prev...

It explicitly forces American LLMs to include government say in what does and doesn't "comply with the Unbiased AI Principles" which means no responses that promote "ideological dogmas such as DEI"

Comment by otterley 10 hours ago

That executive order only applies to Federal procurement. It doesn’t force anything upon vendors for publicly used models.

(That order, like many, will probably be rescinded as soon as a Democrat holds the Presidency again.)

Comment by cedws 13 hours ago

>Content not available in your region.

>Learn more about Imgur access in the United Kingdom

Comment by nozzlegear 7 hours ago

Big Brother'd

Comment by js8 13 hours ago

Can you be more specific?

Comment by 14 hours ago

Comment by atemerev 15 hours ago

Only if you use Kimi API directly - the censorship is done externally. The model itself talks fine about Tiananmen, you can check on Openrouter. There might be less visible biases, though.

Comment by sigmoid10 15 hours ago

That's what I wrote? Except that it also clearly has internal bias?

Comment by kgwgk 15 hours ago

> That's what I wrote?

No.

You wrote that "you won't hear about Tiananmen square from this model" and atemerev wrote that "the model itself talks fine about Tiananmen".

You wrote that "it can easily access any withheld or missing info from training data via tool calls" and atemerev wrote that "the model itself talks fine about Tiananmen".

Comment by sigmoid10 11 hours ago

It has internal bias too and the first comment mentions that additional censoring runs on top of the model output in the API. Did you misread or what else are you missing?

Comment by kgwgk 11 hours ago

The issue is not what's missing - it's what you wrote that is in direct contradiction with what atemerev wrote like the bit about "missing info from training data".

But sure, if when you wrote "you won't hear about Tiananmen square from this model" you meant "the model itself talks fine about Tiananmen" then that's exactly what you wrote.

Comment by nicce 15 hours ago

Everything has some sort of bias. Most text is written by those who like writing.

Comment by csomar 13 hours ago

I’d say the american models are more censored or take the censoring they do more seriously. Here is kimi (though 2.5) failing its censoring mission: https://old.reddit.com/r/LocalLLaMA/comments/1r9qa7l/kimi_ha...

Comment by ozgune 14 hours ago

This update makes Kimi K2.6 the strongest open multimodal AI model. (No affiliation with Kimi.)

Here's the aggregated AI benchmark comparison for K2.6 vs Opus 4.6 (max effort).

- Agentic: Kimi wins 5. Opus wins 5.

- Coding: Kimi wins 5. Opus wins 1.

- Reasoning & knowledge: Kimi wins 1. Opus wins 4.

- Vision: Kimi wins 9. Opus wins 0.

Please note that the model publisher chooses their benchmarks, so there's a bias here. Most coding and reasoning & knowledge benchmarks in their list are pretty standard though.

Comment by UncleOxidant 14 hours ago

Not entirely true. Google released Gemma 4 models recently. Allen AI releases open Olmo models. However, you're right that the Chinese open models seem to be much better than others - Qwen 3.* models especially are punching above their weights.

Comment by osiris970 14 hours ago

The three American labs don't release big open source models. Except gpt-oss, i guess. It's an absolute shame how far the us has fallen in this space.

Comment by nullbyte 13 hours ago

Anthropic doesn't, but Google and OAI both release open source models. Just not 1T parameter ones.

Comment by osiris970 13 hours ago

Exactly, they release cool consumer stuff, but they aren't releasing anything close to the performance of the best open weight Chinese models. They basically compete in the "fun running at home doing basic stuff" scene. (Except OSs 120 by openai but it's been ages since then)

Comment by 0-_-0 14 hours ago

Pun intended?

Comment by spaceman_2020 13 hours ago

I'm genuinely so grateful for them

$200/m minimum to use Claude would bankrupt my country's white collar labor market

Comment by subhobroto 9 hours ago

I would really appreciate a response because I'm sure you know that Anthropic has at least two lower priced tiers before the $200/m one, so I assume the $200/m tier is necessary because you use it heavily?

Now given that the $200/m Tier is the most heavily (I believe at 20x?) subsidized tier, How or what are you using instead that achieves comparable good enough performance for a fraction of the price? I've heard GLM 5.1 from z.ai but it's not comparable to Opus, not even close - really interested!

Comment by nashadelic 16 hours ago

additional humor is the open in openai

Comment by cedws 15 hours ago

I wonder if there's a strategy behind all of this on China's side. I know the CCP uses a direct hand in many affairs in China, but is there an actual coordinated effort to compete with, or sabotage the West?

Comment by gpm 15 hours ago

> but is there an actual coordinated effort to compete with [...] the West

Yes, absolutely.

China regularly produces long term planning documents to coordinate efforts, and the latest ones have specifically prioritized technology like chips and AI to compete with the west. https://www.reuters.com/world/china/china-parliament-approve...

I don't believe there's any publicly stated intent to sabotage the west... unsurprisingly.

Comment by bachmeier 14 hours ago

Seems obvious to me that China would not want to give the AI market to US companies. You don't even need anything like an attempt to "sabotage the West". If I were them (the companies or the government) I'd be very hesitant to let US companies dominate this space. Especially companies that close to the current US administration.

Comment by anana_ 14 hours ago

Hypothesizing here, but maybe the idea is sort of a form of technological/economic warfare? Releasing performance equivalent yet more cost efficient open weight models should in theory drive the cost of inference down everywhere.

This I assume will make it more difficult for US AI labs to turn a profit, which might make investors question their sky high valuations.

Any sort of melt down in the AI sector would almost certainly spread to the wider US market.

In contrast, in China, most of the funding for AI is coming directly from the government, so it's unlikely the same capital flight scenario would happen.

Comment by gmerc 14 hours ago

Why compete when you can build on each other. Someone is finally getting that china is not capitalist like the US.

Comment by try-working 10 hours ago

Chinese labs have no marketing and sales capacity in the overseas market, so they in fact have no choice but to open source their models as that is what brings awareness and trust in their models. In fact, it is overseas open source marketing that drives adoption of their models in China as well. I wrote about this here: https://try.works/writing-1#why-chinese-ai-labs-went-open-an...

Comment by quesera 13 hours ago

All China has to do here is stay in the game and wait patiently while the US and EU press pause on data centers. See also: solar panels.

We're making this way too easy. The rationale and logic are reasonable, but ultimately irrelevant.

Comment by SXX 15 hours ago

Chinese AI companies want investors too. Nobody would believe they can compete with western companies unless they release something you can run on your own hardware.

After all historically both statistics and research that comes out of China is not very trustworthy.

Comment by try-working 10 hours ago

If there's no open source models coming out of these small labs, why would anybody care about them? They would be forgotten the instant they stop open sourcing.

Comment by arvindh-manian 5 hours ago

This perspective is pretty interesting: https://federicocarrone.com/articles/china-commoditizing-the...

Comment by esperent 4 hours ago

Summary: they want to commoditize the complement which means that Western "knowledge work" is the complement to Chinese manufacturing, and they want to turn the knowledge work into a low priced commodity via open llm models.

I've heard this before, always accompanied by a several thousand word blog post. But frankly it sounds like it's overcomplicating the issue. Why would you try to turn something into a commodity when instead you could turn it into a trillion dollar industry and win?

The goal has always been clear:

1. Release open models to get your name out

2. Then once you feel you have name recognition release even stronger models but keep them proprietary. Qwen is clearly at this phase.

3. Keep releasing open models because it's good publicity but never your SOTA models (e.g. Google's Gemma).

Comment by ymolodtsov 1 hour ago

Distillation helps for sure.

Comment by bayarearefugee 11 hours ago

China is also way ahead in terms of renewable energy while the US continues to tie itself to fossil fuels.

The US is pretty clearly in the collapsing empire phase, we are all just pretending like it isn't happening.

Comment by nozzlegear 7 hours ago

Didn't the US very recently pass the milestone of generating more energy from renewable sources than from natural gas? Like within the last week or two?

Comment by carefree-bob 6 hours ago

No, not even close.

US energy sources for 2024 (last year for which we have data):

https://www.eia.gov/energyexplained/us-energy-facts/data-and...

   natgas: 38%
   oil: 35%
   coal: 10%
   all renewables: 9%
   nuclear: 8%

Within all renewables, in quadrillions of btus:

   biofuels: 2.6
   wood: 1.9
   wind: 1.6
   solar: 1.4
   Hydro: 0.8
   waste: 0.4
   geothermal: 0.1

Total: 8.8 quadrillion btu = 9% of total energy

Comment by nozzlegear 6 hours ago

https://www.canarymedia.com/articles/clean-energy/renewables...

Renewables generated more energy than natural gas for the entire month of March, 2026. That's a new milestone baby.

Comment by carefree-bob 5 hours ago

Except that didn't happen, and it's not a milestone.

First, you are confusing share of electricity generation with the share of all energy. Electricity is only 21% of all energy. Natgas, oil and coal are crushing it in that remaining 79%.

Second, the article is wrong, even for electricity. To their credit, Canary Media showed in their graph that this data is for electricity only.

The data for March is not out yet. Here is the latest official data from the EIA. https://www.eia.gov/electricity/monthly/

It only applies to January 2026, and the next release is April 23, and then you will get data for February 2026. All data has a 2 month time lag. Your spidey senses should have been tingling if an article published April 10 claimed to have data for the month of March, but this is why you don't get your statistics from activist blogs, but from official sources.

So if they are not accessing the official data, what are they accessing? They claim that their source is "Ember", but what is Ember? It is an environmentalist think tank. Well, maybe Ember has their own people calling up power companies and compiling data faster than the EIA. That would be pretty, cool, right?

Except they don't. Look at Ember's page.

https://ember-energy.org/data/electricity-data-explorer/?ent...

what do they cite as their data source: EIA.

It's right on the website.

So Ember is just pulling EIA data, and then filling the last two months with data they made up, but citing it as EIA data. And this, uh, sympathetic adjustment of EIA data is why Canary Media turns to Ember rather than directly pulling from EIA.

I guarantee you that by July, those adjustments will go away, because then the EIA data will be out.

Of course everyone else will have forgotten by then.

Comment by nozzlegear 5 hours ago

> First, you are confusing share of electricity generation with the share of all energy.

Think it was pretty obvious what I meant to all but the most pedantic, bud. But just to be clear, your issue here is that a think tank cited the same (notoriously anti-renewable Trump admin) government agency that you've cited multiple times yourself? That's what set off your spidey senses? Have you considered that this respected think tank isn't making up data, but you're just not able to find it?

> I guarantee you that by July, those adjustments will go away, because then the EIA data will be out.

Ember already has it hoss, they don't call it Milestone March for nothing.

Comment by carefree-bob 5 hours ago

The EIA is where Ember gets its data from.

It's where everybody gets their data from. Because they have thousands of employees collecting data. These are professionals, like the people at BEA, HUD, NIST, etc.

Ember, on the other hand, is a "decarbonization" think tank. They don't have their own data. They don't have the staff for it. What they do is analyze/spin, and in this case, augment, the raw data that is published by EIA. How do they augment the EIA data? All they do is round it to the nearest 2 decimals. It's exact copy and paste for every month except the last two, where the data is just made up.

And this entire article was written based on the augmentations by Ember, yet Ember cites it as EIA data. So let's check back in July, when EIA data will be out, and Ember will use that exact data, rounding it to the nearest 2 decimals. Save that blog page!

Something to think about.

Comment by nozzlegear 4 hours ago

I feel like I shouldn't have to be finding this info for you since it was right there in the links you already sent, but:

> Annual electricity generation and net imports are taken from the EIA.

> Monthly generation and imports are taken from the EIA. The EIA reports monthly generation data in two separate datasets: Monthly data for all 50 states and monthly data for the lower 48 states (excludes Hawaii and Alaska). Data for all 50 states is reported on a 3 month lag whereas data for the lower 48 states is reported without lag. Missing months from the data for all 50 states is estimated using the recent changes observed in data from the lower 48 dataset.*

Page 89: https://ember-energy.org/app/uploads/2024/05/Ember-Electrici...

There are two different EIA datasets.

Comment by try-working 10 hours ago

A lot of people speculating on the motivations behind Chinese labs open sourcing their models. The reason is simple and clear: It is the only viable commercialization strategy that is available to them. I wrote about this here: https://try.works/writing-1#why-chinese-ai-labs-went-open-an...

Comment by 16 hours ago

Comment by antirez 15 hours ago

This is not in antithesis. My limited personal experience is that I wrote code under OSS licenses primarily because of my past communist believes and current left-wing and redistribution of wealth point of view. This is not to provide the simple equation of: communist China is not interested in money, but also is hard to believe that there is no cultural connection among those things. Single Chine persons want to win, but also they have a different POV on what the collective means, compared to US. Also there is the obvious fact that in this moment China is more interested in winning technologically in AI, more than economically, since, I believe, they more collectively realized before many others that LLMs are eventually commoditized in the current form, in the long run. One could assume that a breakthrough could give some lab a decisive advantage, but so far we assisted to a different reality: it looks like AI is not architecture-bound (like LeCun and others want us to believe, but so far they mis-interpreted LLMs at every step) but GPU bound, and the data-boundness is both a common ground for all, and surpassable via RL in many domains. So, if this is true, it is not trivial for any single lab to do so much better. And indeed as far as we observed right now folks with enough engineers, GPUs, money, can ship frontier models, and in China even labs with a lot less GPUs can still do it at a SOTA level. For me, Italian, this is also a protective layer. After Trump the US looks like a very unstable partner from which to relay in an exclusive way for a decisive technology, and given that Europe is slow to put the money in this technology to have frontier things at home, China is a huge and shiny plan B for us.

Comment by throwaway-blaze 15 hours ago

The strings attached by the US to deep partnerships are things like trade/commerce, militarily mutual advantages (bases on euro soil from which we will help protect you), not to mention the close cultural and ancestral ties we share.

The strings attached by the Chinese govt to deep partnerships are not so benign.

Comment by metobehonest 14 hours ago

[dead]

Comment by rolymath 15 hours ago

It's only humorous if you live in an American bubble. Knowledge sharing has always been a part of Chinese culture. Only Americans try to make it proprietary and monetize it.

Comment by brandensilva 16 hours ago

We are at the point where uncontrolled capitalism collides with humanity.

I do wonder where we go from here.

Comment by pheggs 14 hours ago

it's not necessarily capitalism, I personally believe any system that drives progress would cause this in one way or another. My prediction is that birth rate decline will accelerate further. There's going to be some kind of universal basic income in many places, such as Ireland made for artists. However, it probably will not be enough to feed a family, and therefore we will see birth rates decline further. It's because we evolved to prioritize resources over reproduction and we are becoming more efficient, which means less people are needed to sustain the same amount of resources

Comment by patl4588 10 hours ago

[dead]

Comment by osti 17 hours ago

Maybe open source == communism

Comment by darkwater 16 hours ago

Good ol' Steve "Developers! Developers! Developers!" Ballmer said so a long time ago. What a visionary!

Comment by konart 16 hours ago

But China is not communist event though the rulling party the word in its name.

Comment by pheggs 16 hours ago

what makes you think that china ever gave up its communist goals? I personally see that everything they do aims towards that goal. From the one child policy, the huge amounts of empty apartments they build, the stuff they produce for almost free, the fishing.. open sourcing the models perfectly fits that culture too, it's the means of production

Comment by otterley 15 hours ago

The one-child policy died a long time ago. Also, the accumulation of wealth by connected politicians and businesspeople flies in the face of what communism is supposed to stand for.

There is a reason real estate values in popular cities has skyrocketed, and it’s not due to the locals getting wealthier. It’s where Chinese and other oligarchs put their ill-gotten wealth (well, besides Bitcoin).

Comment by bwv848 13 hours ago

One-child policy did not die, it just morphed into Three-child policy, still a form of family planning, and still would probably fine people for having more than three kids.

Comment by pheggs 14 hours ago

> The one-child policy died a long time ago.

true, but as far as I understand it did because birth rates got too low. so they replaced it with a two-child policy and later with a three-child policy

> Also, the accumulation of wealth by connected politicians and businesspeople flies in the face of what communism is supposed to stand for.

Yeah, I am sure there's a lot of cases for that. But as far as I know the amount of billionaires has started declining in China, and I don't see how that means that they as a country moved away from the goal, it just means there's issues

> There is a reason real estate values in popular cities has skyrocketed, and it’s not due to the locals getting wealthier.

I don't know about that, you could be right. A google search for real estate prices in china reveal a lot of news articles how they are going down though.

> It’s where Chinese and other oligarchs put their ill-gotten wealth (well, besides Bitcoin).

Wouldn't be surprised if rich people in china invest in real estate. They don't have free capital flow, so its not easy to invest abroad and it becomes an obvious choice. Bitcoin is banned in China for that reason too

But again, as far as I know that does not mean the country moved their goals of trying to reach communism one day

Comment by otterley 13 hours ago

> I don't see how that means that they as a country moved away from the goal, it just means there's issues

They're further from Communism than they've ever been since the PRC was founded. The gap between rich and poor is growing there, not shrinking.

> A google search for real estate prices in china reveal a lot of news articles how they are going down though.

They're investing outside China (Vancouver, Toronto, NYC, London, Sydney, Melbourne, etc.) because their assets are safer there (these countries all have strong property protection laws). Like Bitcoin, freedom of capital flows may be restricted, but the wealthy seem to be evading these restrictions with impunity.

Comment by pheggs 13 hours ago

> They're further from Communism than they've ever been since the PRC was founded. The gap between rich and poor is growing there, not shrinking.

I suppose it depends on what time frame you look at, it's shrinking since 2010, but inequality rose more than that in the 80s: https://www.theglobaleconomy.com/China/gini_inequality_index...

However, that's not my point - I did not mean to say that they are going to be successful but rather that it still appears to be a long term goal for them.

> Like Bitcoin, freedom of capital flows may be restricted, but the wealthy seem to be evading these restrictions with impunity.

I don't know about that, without any source of data I guess I just have to take your word for it. I would not be surprised if you were right in this case though.

Comment by Saline9515 12 hours ago

China is a ruthless capitalist country managed by an authoritarian regime. Planning and lack of respect for the individual or the rule of law are not communist per se.

Comment by nozzlegear 6 hours ago

> Planning and lack of respect for the individual or the rule of law are not communist per se.

They just happen to be a feature of every single country that's attempted communism to date. Total coincidence.

Comment by fragmede 16 hours ago

The Democratic People's Republic of Korea would like a word.

Comment by osti 16 hours ago

Oh i’m fully aware of that lol

Comment by tadfisher 16 hours ago

Nah, open source means those who do the work own the result. It's supercapitalism.

Comment by pheggs 15 hours ago

I dont think thats right, the models and the gpus are the means of production.

in capitalism the people with the capital get the profit, not the people who do the work. however, workers are said to benefit too through their salary, just less so

Comment by tadfisher 15 hours ago

The reason regular-capitalism worked is that all production used to depend on workers bottlenecking the free flow of capital by demanding salaries in exchange for their labor. Now that we've removed that obstacle, capitalism demands workers seize the means of production in order to maintain the status quo. Hence, supercapitalism.

Comment by throwaway-blaze 15 hours ago

regular capitalism works but now that the means of production are not factories, the workers have to become more entrepreneurial. Then they will control their destinies.

Comment by pheggs 15 hours ago

workers seizing the means of production is by definition socialism and not capitalism though, that's the whole idea behind socialism

Comment by tadfisher 9 hours ago

You miss the point: we advertise the change as workers becoming part of the owner class and realizing all of the economic gains of their work, thus supercapitalism. Don't use the "s" or "c" words.

Comment by gertlabs 10 hours ago

Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).

Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).

Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.

Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.

Comprehensive, difficult to game benchmarks at https://gertlabs.com/?mode=oneshot_coding

Comment by esperent 6 hours ago

I'm looking at your table now - is there a reason why you don't include cost? If Opus 4.7 is the winner but costs e.g. 5x as much, that's important information.

Comment by gertlabs 5 hours ago

We recently added cost (last week), so data is sparse. Check back in a few weeks and it will be represented somewhere on the homepage, probably in the Efficiency Chart at the bottom. We also plan to show model performance deviation over time after we collect more data.

I'm interested to hear about any other data representations you'd like to see, too. The goal is to convey the most important information as densely as possible, without too much clutter.

Comment by tmaly 9 hours ago

How would K2.6 compare to Sonnet 4.6 both price and performance wise?

Comment by Mattwmaster58 9 hours ago

In terms of raw token cost, I've seen a couple providers at (all prices in terms of Mtok) $0.95 input/$0.15 cache input/$5 output vs $3 input/$15 output for sonnet.

Task prices of courses will be more interesting - a dumber model may use more tokens to get to the same goal.

Comment by freely0085 5 hours ago

Can you add Qwen 3.6 max to the leaderboard?

Comment by gertlabs 4 hours ago

We will as soon as API access is widely available. Once a model goes live, we typically have one-shot reasoning benchmarks up in ~8 hours and comprehensive agentic/combined benchmarks up after 24-48 hours. We're working on building relationships with each lab to have the results before launch.

Comment by knollimar 7 hours ago

wait why compare 2.6 to 2 instead of to 2.5?

Comment by gertlabs 4 hours ago

Good question. We missed that release entirely. Our automated model checker only went live 2 months ago so they were manually curated prior to that. I'm adding it now. It'll be live in ~12 hours.

Comment by cmrdporcupine 10 hours ago

Surprised to see such variance per language

Comment by gertlabs 9 hours ago

It's interesting; I can only speculate as to the underlying reason. When given enough time, models outperform in Rust/C++ in longer agentic tasks, and actually perform worst in Python. For tasks that aren't judged on code speed. https://gertlabs.com/?mode=agentic_coding

Comment by elfbargpt 17 hours ago

I've always been surprised Kimi doesn't get more attention than it does. It's always stood out to me in terms of creativity, quality... has been my favorite model for awhile (but I'm far from an authority)

Comment by Aeolun 16 hours ago

It’s good, but it’s not quite Claude level. And their API has constant capacity issues.

Price/quality is absolutely bonkers though. I loaded $40 a few weeks/months ago and I haven’t even gone through half of it.

Comment by segmondy 5 hours ago

It has long been Claude level since 2.5

Comment by atemerev 15 hours ago

Why use China model API from China if there are many independent providers available via Openrouter?

Comment by smashed 15 hours ago

Openrouter will route to china hosted models when there are US hosted providers of the same model. Is there a setting to set your preference or to blacklist providers like alibaba cloud for example?

I use OpenCode and the openrouter provider. From opencode I only select the model like kimi-2.6 and have no way of selecting which cloud hosting will receive my request.

Comment by subscribed 15 hours ago

Settings > Guardrails > [your workspace] > Providers + Block provider

Comment by uneekname 15 hours ago

Yes, you can blacklist providers in OpenRouter account settings.

Comment by NitpickLawyer 15 hours ago

Yes, you can globally ban providers in your openrouter settings.

Comment by pheggs 15 hours ago

to support the companies that open source their models

Comment by culi 16 hours ago

It's also one of the few models that seem capable of drawing an SVG clock

https://clocks.brianmoore.com/

Comment by SwellJoe 16 hours ago

Interesting that the best performers are all Chinese-made models (DeepSeek and Qwen also perform consistently well). I wonder if there's more focus on vision and illustration in their training, or if something else is leading to their clear lead on this one test.

Comment by sigmoid10 16 hours ago

Is it? In your link it definitely failed to draw the clock.

Comment by squarefoot 15 hours ago

It redraws it every minute, and some models give quite different results although the prompt is exactly the same.

Comment by quesera 13 hours ago

This reads like satire, but I've been feeling that a lot lately.

Comment by dryarzeg 16 hours ago

I'm not really sure how this works, but I stayed on the page for a while, and then it reloaded and all clocks changed. I guess there's either a collection of different clocks generated by models, or maybe they're somehow generated in the real time, but the fact is what you see is not necessarily what I see.

Comment by culi 12 hours ago

It reruns a prompt every minute to all the models included. Everyone is gonna see something different but I've spent too long on it and there's a consistent pattern of Qwen and Kimi outperforming the others

This site was made months ago and it seems its only been updated with the latest model of a couple of the providers so keep in mind that many of the Chinese models haven't been updated

Comment by sigmoid10 16 hours ago

Seems like it regenerates them to reflect the current time. Funny to see how some models (like Kimi and Deepseek) sometimes get it right and other times fail miserably on the level of ancient models like GPT 3.5.

Comment by gunalx 15 hours ago

It reruns the prompt every minute.

Comment by regularfry 17 hours ago

Dirt cheap on openrouter for how good it is, too. Really hoping that 2.6 carries on that tradition.

Comment by twotwotwo 16 hours ago

Kagi has it as an option in its Assistant thing, where there is naturally a lot of searching and summarizing results. I've liked its output there and in general when asked for prose that isn't in the list/Markdown-heavy "LLM style." It's hard to do a confident comparison, but it's seemed bold in arranging the output to flow well, even when that took surgery on the original doc(s). Sometimes the surgery's needed e.g. to connect related ideas the inputs treated as separate, or to ensure it really replies to the request instead of just dumping info that's somehow related to it.

Comment by spaceman_2020 13 hours ago

I remember when the first K2 dropped

It was the best creative writer by some distance

Comment by varispeed 16 hours ago

Maybe because it's a bit of like unleashing a chaos monkey on your codebase? I tried it locally (K2.5 72B) and couldn't get anything useful.

Comment by KaoruAoiShiho 16 hours ago

Huh, that's not a thing?

Comment by johndough 16 hours ago

The parent poster is probably referring to Kimi-Dev-72B¹, which is a much smaller and older model, while people are probably more familiar with the big and fairly powerful 1100B Kimi-K2.5².

[1] https://huggingface.co/moonshotai/Kimi-Dev-72B

[2] https://huggingface.co/moonshotai/Kimi-K2.5

Comment by natrys 16 hours ago

Yes it was good for its time, but 10 months old now which is a long time ago in this space. It was also a fine-tune (albeit a good one) of Qwen-2.5 72B.

I wish they did more smaller models. Kimi Linear doesn't really count, it was more of a proof of concept thing.

Comment by kburman 15 hours ago

Has anyone here used Kimi for actual work?

I tried it once, although it looks amazing on benchmarks, my experience was just okay-ish.

On the other hand, Qwen 3.6 is really good. It’s still not close to Opus, but it’s easily on par with Sonnet.

Comment by rubslopes 10 hours ago

Before GLM-5.1, I was going back and forth between Opus 4.5 and Kimi 4.5 and having very good results with Kimi.

Comment by try-working 10 hours ago

I've used Kimi K2.5 when I run out of Codex quota. It does small and medium things OK. But if I work on complex things, I'll later have to spend two days cleaning up the mess with Codex. Hopefully 2.6 does better.

Comment by deanc 15 hours ago

Yes. You’re using Kimi if you use the composer-2 model in cursor. It’s great. Plan in state of the art. Execute in composer-2

Comment by nickandbro 17 hours ago

Wow, if the benchmarks checkout with the vibes, this could almost be like a Deepseek moment with Chinese AI now being neck and neck with SOTA US lab made models

Comment by ai_fry_ur_brain 16 hours ago

[flagged]

Comment by otabdeveloper4 16 hours ago

> Its not anywhere close

Close to what, and how are you measuring?

> nobody in the USA would be spending 7 figures on infrastructure for it

Au contraire, if AI had a moat it would pay for itself. They're funneling capital into infrastructure because they know it can't.

Comment by fragmede 16 hours ago

You need the infrastructure to train and run it regardless though. Kimi is great but I'm not getting the same performance from it running it on my MacBook or a 3090 as it running on a H100 or a Grace Hopper supercomputer. Pretend you did have said moat. Why wouldn't you also books infrastructure to run it on?

Comment by otabdeveloper4 1 hour ago

> Why wouldn't you also books infrastructure to run it on?

No, you wouldn't be using venture capital to overprovision your AI a hundredfold if selling AI was the end goal.

Comment by jstummbillig 16 hours ago

What?

Comment by motoboi 17 hours ago

With the previous generation? Yes. With 10T mythos-level models? Not even close.

Comment by amazingamazing 17 hours ago

The psyop continues. Mythos until it’s released is vaporware. Notice how you can try kimi 2.6. Where is the same for mythos?

Comment by jstummbillig 15 hours ago

At this point it seems more like the result of a psyop to presume that a new anthropic model should be considered vaporware until released.

Comment by 15 hours ago

Comment by fragmede 16 hours ago

It's been released to "select partners".

Comment by atemerev 15 hours ago

Yeah, Crowdstrike among them. Clearly experts in this "security" thing, given what happened during the last incident...

Comment by cyanydeez 10 hours ago

Yeah, people who would look stupid if they said the king had no clothes.

Comment by lbreakjai 16 hours ago

I've got a 12T model on my machine, built it myself. It's called Mytho. Too dangerous to even release a fact sheet about it. It can hack into the mainframe, enhance ultra-compressed images, grow your hair back, and make people fall in love with you.

Comment by ChrisLTD 17 hours ago

Mythos isn't the current generation, it's literally vaporware.

Comment by cyanydeez 10 hours ago

I doubt it's literal vaporware. It's likely just a variant of whatever model they just generally released with some fancy prompt and a highr quant.

Comment by jollymonATX 17 hours ago

According to the benchmarks, you are wrong. It is on track and slightly above some sota. Just the benchmarks speaking there, they can be/are gamed by all big model labs including domestic.

Comment by bestouff 17 hours ago

There's no public data about Mytho.

Comment by maplethorpe 17 hours ago

That's because it would be too dangerous to release.

Comment by cedws 17 hours ago

My girlfriend goes to a different school, you wouldn't know her.

Comment by squarefoot 17 hours ago

Same for teleport, time travel and warp drive.

Comment by rockinghigh 14 hours ago

They could release data to back up that claim.

Comment by nisegami 17 hours ago

So is my P=NP proof.

Comment by irthomasthomas 17 hours ago

10T? Impossible! They told us the training run was under 10^26 flops.

Comment by mistercheph 16 hours ago

Mythos doesnt exist

Comment by pheggs 14 hours ago

mythos is a mythos

Comment by sergiotapia 16 hours ago

mythos is vaporware right now, what are you talking about?

Comment by motoboi 12 hours ago

Well, the model that a lot of people have been given access to and are reporting about on twitter?

Comment by m4rkuskk 16 hours ago

I have been testing it in my app all morning, and the results line up with 4.6 Sonnet. This is just a "vibe" feeling with no real testing. I'm glad we have some real competition to the "frontier" models.

Comment by mchusma 15 hours ago

it feels like between K2.6 and GLM5.1 we have Sonnet level intelligence at roughly Haiku level pricing. Which is great.

I'm hoping that Anthropic will be able to release an updated Haiku soon and they really need something that is 1/3-1/5 the price of Haiku to compete with the truly cheaper models (Gemma-4 is really good at this range).

Comment by XCSme 15 hours ago

In my tests[0] it does only slightly better than Kimi K2.5.

Kimi K2.6 seems to struggle most with puzzle/domain-specific and trick-style exactness tasks, where it shows frequent instruction misses and wrong-answer failures.

It is probably a great coding model, but a bit less intelligent overall than SOTAs

[0]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moo...

Comment by deepsquirrelnet 13 hours ago

I tried it on openrouter and set max tokens to 8192, and every response is truncated, even in non-thinking mode. Maybe there's an issue with the deployment, but in your link also shows it generates tons of output tokens.

Comment by XCSme 13 hours ago

Oh yeah, I just noticed, like 3x the reasoning tokens.

Comment by ninjahawk1 14 hours ago

I often wonder if in the future, the same way early computers used to take up an entire room but now fit in your pocket, if in the future the equivalent of a data center will be a single physical device like a phone nowadays. And if that’s the case, would it happen much quicker since technology has been speeding up year by year?

Comment by gpm 13 hours ago

> And if that’s the case, would it happen much quicker since technology has been speeding up year by year?

I wouldn't expect this.

Historically we've had a roughly exponential rate of shrinkage. If we keep that same exponential going, we should expect the amount of time to shrink "room full of compute" to "pocket full of compute" to be equal.

And recently we've fallen behind that exponential rate of shrinkage. And this is rather expected because exponentials are basically never sustainable rates of growth.

I still expect that technological progress is getting faster year by year, and that we're still shrinking compute, but that's not necessarily enough for the next shrinking to take less time than when we had exponential progress on shrinking.

Comment by Flux159 12 hours ago

There’s some early work being done here by companies looking at making LLM ASICS like Taalas (HC1 gets 17k t/s for llama 8b - currently at 2.5kW which is closer to a single server, but this is their first chip).

There’s other options like photonic computing which might be able to reduce power significantly but are still in research as far as I can tell. Because so much money is invested in AI & traditional gpu inference is so power hungry, I would expect significant improvements in this space quickly.

Comment by candl 16 hours ago

Are there any coding plans for this? (aka no token limit, just api call limit). Recently my account failed to be billed for GLM on z.ai and my subscription expired because of this... the pricing for GLM went through the roof in recent months, though...

Comment by ankit70 5 hours ago

You can use $20 pro plan on Ollama or $10 one on OpenCode Go. Both has Kimi 2.6 live. https://opencode.ai/go https://ollama.com/pricing

Comment by wolttam 15 hours ago

Kimi has their own subscription that works basically the same as all the others.

https://www.kimi.com/code

Comment by fg137 14 hours ago

At $19/month, hard to see why I want to use Kimi over Claude.

Comment by phainopepla2 12 hours ago

Claude usage at $20 is basically unusable for serious work. I haven't used Kimi but I'd have to imagine they're offering a good deal more usage for the same price.

Comment by plutokras 12 hours ago

They tick all the boxes I care about – desktop & mobile app, cli – but for the same price I might as well just go for the leading providers.

Comment by cute_boi 12 hours ago

for similar plan i think claude costs like $100 a month?

Comment by randomtoast 14 hours ago

Because Opus on $20 CC is a joke. The $19 plan on Kimi has actually workable usage limits.

Comment by sixhobbits 14 hours ago

I tried it out with my normal mixed-up wolf, goat, cabbage problem and it couldn't solve it. Sonnet 4.6 also can't, but Opus 4.7 has no problems.

Details here [0]

[0] https://techstackups.com/comparisons/kimi-2.6-vs-opus-4.7-an...

Comment by mariopt 17 hours ago

Really excited to try this one, I've been using kimi 2.5 for design and it's really good but borderline useless on backend/advanced tasks.

Also discovered that using OpenCode instead of the kimi cli, really hurts the model performance (2.5).

Comment by lbreakjai 17 hours ago

I have a subscription through work, I've been trialing it, so far it looks on par, if not better, than opus.

Comment by pt9567 17 hours ago

wow - $0.95 input/$4 output. If its anywhere near opus 4.6 that's incredible.

Comment by corlinp 17 hours ago

This should erase any doubt that AI Labs are making $$$ on API inference.

Kimi 2.5 (which this is based on) is served at $0.44 input / $2 output by a ton of different providers on OpenRouter, 2.6 will certainly be similar.

That's about 11X less than Opus for similar smarts.

Comment by gessha 12 hours ago

It’s worth noting that the US is very behind on energy infra and that might affect the cost calculations since data centers are electricity guzzlers. Also, not sure if CN has completely switched off Nvidia or still using them for training.

Comment by Lalabadie 16 hours ago

Famously, OpenAI and Anthropic are devoted to increasing efficiency before scaling up resource usage.

Comment by amazingamazing 16 hours ago

How does it erase any doubt? You’re implying Chinese things can’t be actually cheaper to produce than American which is laughable

Comment by corlinp 15 hours ago

Most of those inference providers are American, and China is actually at a disadvantage here because of export restrictions - US companies are using newer and more efficient chips.

Comment by amazingamazing 14 hours ago

If it’s newer and efficient then why is the api more expensive?

Comment by veber-alex 9 hours ago

Price is set based on what people are willing to pay not based on actual costs.

Comment by dmix 16 hours ago

I'm pretty Kimi is what Cursor uses for their "composer 2" model. Works pretty good as a fallback when Claude runs out, but definitely a downgrade.

Comment by arcanemachiner 15 hours ago

It's a Kimi K2.5 finetune, there was some drama about this a few weeks ago.

Comment by dmix 11 hours ago

What was the drama about?

Comment by arcanemachiner 11 hours ago

They were not open about the fact that it was trained on Kimi K2.5. This link explains it better than I can:

Comment by 59nadir 11 hours ago

Cursor seemingly went out of their way to not mention that they were actually running Kimi K2.5 and essentially by that omission made it seem like they had made their own model. They added a note to a blog post about using it at some point and then when they wrote a new one they conveniently left it out again.

That's at least what I perceived as "the drama".

Comment by Alifatisk 14 hours ago

Damn it, they stopped offering Kimmmmy. Their sales ai agent which allowed you to bargain for lower subscription prices.

Comment by 16 hours ago

Comment by rane 12 hours ago

Added support for Kimi in https://github.com/raine/claude-code-proxy and it does appear to work surprisingly well with Claude Code, although the usage limit for the entry tier doesn't seem as generous as I'd have expected.

Comment by irthomasthomas 17 hours ago

Beats opus 4.6! They missed claiming the frontier by a few days.

Comment by NitpickLawyer 17 hours ago

While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.

Comment by cedws 17 hours ago

Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.

Comment by sterlind 15 hours ago

I'm extremely curious how these models learn to pack a lossily-compressed representation of the entire Internet (more or less) into a few hundred billion parameters. like, what's the ontology?

Comment by osti 17 hours ago

I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.

Comment by NitpickLawyer 16 hours ago

Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.

The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.

Comment by osti 14 hours ago

True, but I think for local models, we are mostly considering personal usage.

Comment by zozbot234 16 hours ago

You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.

Comment by osti 14 hours ago

Yeah... I would definitely call 2t/s unusable. For simple chats, I'd want at least 15 t/s. For agentic coding (which this model is advertised for), I'd want good prefill performance as well.

Comment by veber-alex 9 hours ago

That's just throwing money away. The performance with large context would have been unusable especially if you need to serve more then a single person.

Comment by pixel_popping 16 hours ago

It doesn't beat Opus 4.6, no way, don't be fooled by benchmarks.

Comment by BoorishBears 17 hours ago

Opus is clearly a sidegrade meant to help Anthropic manage cost, so I would say they may have it if it actually beats 4.6

Comment by irthomasthomas 17 hours ago

Could be right. I just noticed my feed is absent the usual flood of posts demoing the new hotness on 3D modeling, game design and SVG drawings of animals on vehicles.

Comment by verdverm 17 hours ago

https://huggingface.co/moonshotai/Kimi-K2.6

Is this the same model?

Unsloth quants: https://huggingface.co/unsloth/Kimi-K2.6-GGUF

(work in progress, no gguf files yet, header message saying as much)

Comment by SwellJoe 16 hours ago

A trillion parameters is wild. That's not going to quantize to anything normal folks can run. Even at 1-bit, it's going to be bigger than what a Strix Halo or DGX Spark can run. Though I guess streaming from system RAM and disk makes it feasible to run it locally at <1 token per second, or whatever. GLM 5.1, at 754B parameters, is already beyond any reasonable self-hosting hardware (1-bit quantization is 206GB). Maybe a Mac Studio with 512GB can run them at very low-bit quantizations, also pretty slowly.

Comment by justinclift 9 hours ago

Looks like it. This quant ( https://huggingface.co/inferencerlabs/Kimi-K2.6-MLX-3.6bit ) says:

> Q3.6 typically achieves useable accuracy in our coding test and fits within a 512GB memory budget

This one ( https://huggingface.co/mlx-community/Kimi-K2.6-MoE-Smart-Qua... ) though says it fits on a 192GB mac:

> M3/M4 Ultra 192GB+ (fits in ~150GB)

Comment by jauntywundrkind 15 hours ago

A huge dual socket Epyc system used to be able to get to 1TB without difficulty. 16 dimms of 64gb each. Doable for ~$3000. With considerable memory bandwidth.

Our hope these days seems to be that maybe perhaps possibly High Bandwidth Flash works out. Instead of 4, 8, or maybe more for some highest end drives, having many many many dozens of channels of flash.

Ideally that can be very very near to the inference. PCIe 7.0 is 0.5Tb/s at 16x which is obviously nowhere remotely near enough throughout here. The difficulty is sort of that nand has been trying to be super dense, so as you scale channels you would normally tend to scale nand capacity too, and now instead of a 2tb drive you have a 200tb drive prices way beyond consumer means. Still, I think HBF is perhaps the only shot of the most important thing in computing going from mainframe back to consumer, and of course the models are going to balloon again if this dies hit, probably before consumers ever get a chance.

Comment by segmondy 5 hours ago

You can't buy 16 64gb dimms for $3000. Go shop memory prices again. But yes an old epyc can run this with no GPU at reasonable speed and if you throw a few GPUs you can get very manageable speed. I run this at home on an old system PCIe4, slow 2400mhz ddr4 ram and still getting about 13tk/sec

Comment by Balinares 17 hours ago

Quite curious how well real usage will back the benchmarks, because even if it's only Opus ballpark, open weights Opus ballpark is seismic.

Comment by gpm 16 hours ago

Huh, so the metadata says 1.1 trillion parameters, each 32 or 16 bits.

But the files are only roughly 640GB in size (~10GB * 64 files, slightly less in fact). Shouldn't they be closer to 2.2TB?

Comment by johndough 16 hours ago

The bulk of Kimi-K2.6's parameters are stored with 4 bits per weight, not 16 or 32. There are a few parameters that are stored with higher precision, but they make up only a fraction of the total parameters.

Comment by gpm 16 hours ago

Huh, cool. I guess that makes a lot of sense with all the success the quantization people have been having.

So am I misunderstanding "Tensor type F32 · I32 · BF16" or is it just tagged wrong?

Comment by rockinghigh 14 hours ago

The MoE experts are quantized to int4, all other weights like the shared expert weights are excluded from quantization and use bf16.

Comment by liuliu 13 hours ago

I32 are 8 4-bit value packed into one int32.

Comment by coder543 15 hours ago

The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."

Comment by throwaw12 15 hours ago

Beats Opus and Open Source?

I really hope this holds true in real world use cases as well and not only benchmarks. Congrats to Kimi team!

Comment by Topfi 15 hours ago

K2.6-code-preview was a minor, but noticeable jump, especially in a long running testing task and prior Moonshot releases have been the only models that I'd consider a suitably competitive replacement for Anthropic models. The way they approach tool calls, task inference and adherence is far closer than any other providers output, similar to how GLM models map far more closely to OpenAIs releases. Whether task adherence, task assessment, task evaluation or task inference, K2.5 got closer to Opus 4.5 than any other model (but was still behind overall).

I will have to test this full release of K2.6 but could see it serve as a very good overall drop-in replacement for Opus 4.5 and Opus 4.6 at 200k across the vast majority of tasks.

I will say however that Opus 4.7 Max 1M has been a very significant jump in performance for me, especially in tasks beyond 120k token where I'd argue it is now the most reliable model in continued task adherence and tool calling without compaction. Ironically, my initial experience was less than pleasant as on XHigh I found task adherence to have regressed even with less than 1/10th of the context window having been used.

Am very interested in K2.6s compaction strategy (which appears to be very simply all things considered) and how it performs beyond 100k tokens. As it stands, only OpenAI models have made compaction for long running tasks work well, though overall, GPT-5.4 is still inferior in my tests regardless of context window over other models such as Opus 4.6 1m and Opus 4.7 1m. Haven't gotten around to testing Opus 4.7 200k and will have to do this to properly assess K2.6 fairly, but I'd be very surprised if K2.6 truly beat Opus 4.7 200k given the jump I have experienced.

Comment by ttul 14 hours ago

Am I being paranoid in questioning whether the CPC would have something to gain by monitoring coding sessions with Chinese coding AI models? Coding models receive snippets of our intellectual property all day long. It's a bit of a gold mine, no?

Comment by throwaw12 14 hours ago

I think you should worry more about NSA, FBI, ICE and other 3 letter US agencies monitoring your sessions

Comment by ttul 13 hours ago

There's nothing anyone can do about state-level espionage anywhere, using any cloud-hosted service. That being said, there is a very big difference between the legal situation in the United States vs. China. Chinese internet companies are required to have CPC interaction and since the rule of law does not strictly exist in China, the state can compel surveillance cooperation regardless of what might be written down. If a three-letter agency is compelling Anthropic to open up its queries for inspection, that kind of surveillance would be authorized by law and if Anthropic violated the law in cooperating, they would suffer the consequences in civil court. Maybe not immediately, but at least the possibility exists.

In China, there's no recourse at all. Surveillance must be presumed.

Comment by LordDragonfang 9 hours ago

> the rule of law does not strictly exist in China, the state can compel surveillance cooperation regardless of what might be written down

While I agree that China is obviously worse in this regard, it's naive to claim this is unique to China, when literally a couple of months ago the US got into a fight with Anthropic about them not removing safeguards which were already just enforcing the letter of the law.

Comment by tw1984 6 hours ago

Rule of law in the US - are you kidding yourself?

When American citizens are being gunned down in public on cameras by US federal government agents, you are telling me that the US follows the rule of law?

Before you start to offer more propaganda, just tell me where is the killer of Renée Good, has that killer been arrested or charged yet? Keep your censored version of rule of law to yourself and your kids.

oh, btw, the current US President did got convicted for criminal offences, he walked away for free just because he got elected as the president. nice rule of law! what did he do recently - authorised illegal war against another country in which over 100+ school children got killed. Surely your fancy US rule of law is going to do something about this?

Comment by ttul 5 hours ago

It is understandable to feel frustrated when justice fails (and I wholeheartedly agree that justice failed all of us many times in relation to Trump), but I think it's a mistake to confuse those specific failures with a total collapse of the rule of law. The rule of law in the United States does not guarantee a perfect or utopian society; what it does provide is a crucial framework for accountability and transparency that simply does not exist in an authoritarian nation like China.

This difference is clear when we look at how the systems handle tragedy and power. In the U.S., the killing of Renée Good by an ICE agent led to a public release of video, intense scrutiny from an independent press, public condemnation by local officials, and a family using legal tools to seek justice. In China, that event would be immediately erased from the public consciousness, and those who dared to talk about it would face arrest. When the U.S. military bombs a school, human rights groups and journalists _can_ investigate, and members of Congress _can_ publicly demand answers (even if half of them are reluctant to question anything Trump does...). In China, military operations are complete state secrets. Furthermore, while it boils my blood to see Trump evade prison due to complex legal and constitutional questions, the fact that he was indicted and convicted by a jury of ordinary citizens proves that a functional legal apparatus exists outside of his direct control, something not utterly impossible under a dictatorship like China.

Day to day, the rule of law very much exists in the US. Doesn't mean we can just sleep on it, but compared to China, I take comfort in the level of institutional reliability that still exists in America (and I'm not even American).

Comment by tw1984 5 hours ago

you are defending a failed system purely based on your prejudice. let me get it straight to you -

1. Renée Good's killer is still free, never got arrested never charged. you can't just ignore such facts and cheap talk to prove the system works. the system completely failed to bring justice even after large scale public unrest. that by itself is the evidence - the failed system answers to no one.

2. Trump evade prison, everyone in the Epstein file evade prison. again, this happened in front of the entire world with extensive media coverage. you need to be extremely innovative to defend such systematic failures of the justice system.

how would you openly argue against such facts? just because you love the US and its systems? lol

Comment by rockinghigh 14 hours ago

Are there any protections from industrial espionage when using Anthropic, Cursor, Gemini, or OpenAI?

Comment by DonsDiscountGas 14 hours ago

There are legal protections, and those companies have more to lose by breaking those laws than following them. Same probably not true for Chinese companies.

Comment by throwaw12 13 hours ago

Legal protection, only if you're a billionaire and US citizen, for everyone else there is no protection.

Does US actually follow laws? They literally kidnapped head of another state and bombed another state and you are expecting legal protection from them?

Comment by nozzlegear 6 hours ago

You don't have to be a US citizen or live in the US to file a lawsuit against an American company in the US court system. Federal courts explicitly allow it under the "alienate jurisdiction" clause.

Comment by greenavocado 17 hours ago

I pray the benchmark figures are true so I can stop paying Anthropic after screwing me over this quarter by dumbing down their models, making usage quotas ridiculously small, and demanding KYC paperwork.

Comment by turblety 14 hours ago

Absolutely. Thing is, I'd actually rather take a worse model than Anthropic, so long as it's consistent. Like, a model that can successfully do well for 80% of tasks is much better than Anthropic that some days will be 90% other 60%.

When you have a consistent model, you can incorporate fixes/prompts into your workflow to make it behave better. But this, always having to guess if Anthropic has quantised the model today, wastes so much time and effort.

Comment by conradkay 14 hours ago

Codex has a lot better limits, and 5.5 will be out soon

Comment by jollymonATX 17 hours ago

Anthropic has done horrible PR and investors should be livid.

Comment by greenavocado 17 hours ago

My theory is they pushed retail off their systems to make room for their new corporate fat cat clients. In which case, they'll do just fine.

Comment by deaux 16 hours ago

> dumbing down their models,

This should be so easy to prove if it were true. Yet there is none of it, just vibes.

Still, your other two points are completely valid. The opaqueness of usage quotas is a scam, within a single month for a single model it can differ by more than 2x. And this indeed has been proven.

Comment by greenavocado 12 hours ago

> This should be so easy to prove if it were true.

https://github.com/anthropics/claude-code/issues/42796

https://scortier.substack.com/p/claude-code-drama-6852-sessi...

Comment by deaux 5 hours ago

First link is about the harness, Claude Code, defaulting to less thinking over time. This isn't "the model getting worse".

Second link is just a discussion of the first link.

Comment by Banditoz 17 hours ago

If the benchmarks are private, how do we reproduce the results? I looked up the Humanity's Last Exam (https://agi.safe.ai/) this model uses and I can't seem to access it.

Comment by johndough 16 hours ago

You can request access here: https://huggingface.co/datasets/cais/hle

The test data is purposely difficult to access to reduce the chance of leaking it into the training dataset.

Comment by swingboy 17 hours ago

Exciting benchmarks if true. What kind of hardware do they typically run these benchmarks on? Apologies if my terminology is off, but I assume they're using an unquantized version that wouldn't run on even the beefiest MacBook?

Comment by dogscatstrees 13 hours ago

This kimi website, it looks like a stylesheet from the 90's. They could learn a thing or two about typeface design. Steve Jobs would be incensed at this.

Comment by kristianp 5 hours ago

I prefer a website that has the first page of text visible almost immediately, with no glitches when fonts load, tbh.

Comment by 13 hours ago

Comment by 6 hours ago

Comment by dygd 15 hours ago

> Agent Swarms, Elevated: Match 100 Jobs and Generate 100 Tailored Resumes

Model seems quite capable, but this use-case is just yikes. As if interviewing isn't already a hellscape.

Comment by antirez 15 hours ago

Here I analyze the same linenoise PR with Kimi K2.6, Opus, GPT. https://www.youtube.com/watch?v=pJ11diFOjqo

Unfortunately the generation of the English audio track is work in progress and takes a few hours, but the subtitles can already be translated from Italian to English.

TLDR: It works well for the use case I tested it against. Will do more testing in the future.

Comment by OsamaJaber 13 hours ago

The modified MIT clause is sneakier than people think. Hit 100M users or $20M a month and you have to slap "Kimi K2.6" on your UI. That covers any consumer app worth building. Not really open, more like free until you matter. Llama pulled the same move

Comment by brightball 13 hours ago

The threshold for "worth building" is much lower than that for a lot of people.

Comment by Saline9515 12 hours ago

Attribution is a fair clause in opensource. What is the problem? You are making 20M$ a month thanks to their free work.

Comment by svachalek 13 hours ago

Worth building with VC capital maybe. A small team putting together an app that pulled in $20M per year should be pretty pleased with that.

Comment by throwaw12 12 hours ago

if you reach that numbers, kimi would be your least of worries

Comment by dcchambers 13 hours ago

I'll definitely put this into the "good problem to have" category.

Comment by risho 12 hours ago

in what way does this restrict how you are able to use the model?

Comment by codemog 13 hours ago

And the Kimi team broke the Anthropic ToS by training off Opus outputs and… nothing happened?

Comment by darksaints 12 hours ago

Nobody cares, nor should they. Anthropic broke nearly every ToS of every website that they scraped data from. The AI robber barons just want to monopolize intellectual property violations, and I'm gonna cheer on any robin hoods that take it back from them.

Comment by esafak 17 hours ago

K2.5 was already pretty decent so I would try this. Starting at $15/month: https://www.kimi.com/membership/pricing

edit: Note that you can run it yourself with sufficient resources (e.g., companies), or access it from other providers too: https://openrouter.ai/moonshotai/kimi-k2.6/providers

Comment by pbowyer 16 hours ago

What's the privacy/data security like? I can't find that on that page.

Edit: found it.

> We may use your Content to operate, maintain, improve, and develop the Services, to comply with legal obligations, to enforce our policies, and to ensure security. You may opt out of allowing your Content to be used for model improvement and research purposes by contacting us at membership@moonshot.ai. We will honor your choice in accordance with applicable law.

Section 3 of https://www.kimi.com/user/agreement/modelUse?version=v2

Comment by gpm 16 hours ago

> We will honor your choice in accordance with applicable law.

So in other words only if you can point to a local law which requires them to comply with the opt out?

Comment by jdasdf 15 hours ago

most laws enforce agreements.

Comment by gpm 15 hours ago

Yes... but the agreement only says they won't train on your data if the law is already preventing them from doing so.

Comment by pixel_popping 16 hours ago

You really rely on ToS from Anthropic/OpenAI to know if they use your prompts or not? It's on their servers, why wouldn't they use our data?

Comment by veber-alex 9 hours ago

Antropic and OpenAI are used by US businesses and government and they are audited and under contracts.

If it's discovered they trained on data they shouldn't have had it will be the end of their business.

On the other hand, good luck suing a Chinese company.

Comment by pixel_popping 1 hour ago

Not at all, Google/Meta... got caught all the time, where do you see it's the end of their business?

Comment by deaux 16 hours ago

Yup, they train on your inputs and OpenRouter is complicit by claiming that Moonshot's ToS says that they don't. Contacted OpenRouter about this a while ago and was met with silence because it's bad for their business to stop lying about it.

Comment by SwellJoe 16 hours ago

"sufficient resources" is going to be a lot of resources. I doubt this will run on even something like a Strix Halo or DGX Spark, even at 1-bit quantization. You'll need a 256GB or 512GB Mac Studio, or a monster GPU situation, to run it locally, I think, though quantized versions aren't showing up yet, to be sure.

Comment by 15 hours ago

Comment by wg0 17 hours ago

How are the usage limits compared to Anthropic?

Comment by greenavocado 17 hours ago

Anthropic has the worst usage limits in the industry

Comment by andriy_koval 16 hours ago

gemini is worse imo

Comment by deaux 16 hours ago

You're correct, Gemini chat limits are a joke at their chapest paid tier compared to both Claude and GPT. Especially crazy when you consider Gemini 3 Pro is more than twice as cheap as Opus 4.6 on the API. It's hard to run into pure chat limits on Claude even if you only use Opus on the cheapest tier, whereas with Gemini it's easy to hit.

Not sure about coding usage, Google being weird about these things I could see that quota being separate.

Comment by gessha 12 hours ago

I’m not sure what A/B test you’re part of but on Claude Code Pro, I hit every single one of my quotas without exception. If you analyze/process images it’s even worse: I hit rate limits first and if I use separate sessions, I hit my quotas too. I use up so many tokens that Jensen should hire me.

Comment by deaux 5 hours ago

I specifically stated "chat" and "not sure about coding usage" but you're saying "Claude Code Pro".

Comment by cassianoleal 16 hours ago

If only their API wasn't tied to a Google or phone login...

Comment by jenkstom 15 hours ago

If it's open then there will be multiple providers. I see it is on OpenRouter now.

Comment by cassianoleal 15 hours ago

I'm going to experiment with this, but unless it's insanely more efficient in token usage than anything else I've tried, the only way to keep costs more or less acceptable is through a subscription.

Comment by atemerev 15 hours ago

Why use "their API"? It is an open model, use any provider on OpenRouter

Comment by wolttam 15 hours ago

Because sometimes (a lot of the time in my experience) third-party providers and inference engines fail to implement the model correctly in ways that are sometimes very subtle and not obvious.

Deepinfra for example is not preserving thinking correctly for GLM5.1, even though they are for GLM5. This is one of the more obvious issues that crop up.

Comment by 17 hours ago

Comment by thomasahle 12 hours ago

Does it run on Nvidia or Huawei?

Comment by nisegami 17 hours ago

The choice of example task for Long-Horizon Coding is a bit spooky if you squint, since it's nearing the territory of LLMs improving themselves.

Comment by jauntywundrkind 15 hours ago

I really wish some of these very-long-horizon runs were themselves open sourced (open released open access). Have the harness setup to do git committing automatically of the transcript and code, offload the git commit message making. Release it all.

This sounds so so so cool. It would be so amazing to see this unfurl:

> Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac. By implementing and optimizing model inference in Zig—a highly niche programming language—it demonstrated exceptional out-of-distribution generalization. Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, Kimi K2.6 dramatically improved throughput from ~15 to ~193 tokens/sec, ultimately achieving speeds ~20% faster than LM Studio.

Comment by cmrdporcupine 16 hours ago

Running it through opencode to their API and... it definitely seems like it's "overthinking" -- watching the thought process, it's been going for pages and pages and pages diagnosing and "thinking" things through... without doing anything. Sitting at 50k+ output tokens used now just going in thought circles, complete analysis paralysis.

Might be a configuration or prompt issue. I guess I'll wait and see, but I can't get use out of this now.

Comment by sankalpmukim 2 hours ago

I think this kind of overthinking is an extremely common pattern in the Chinese models. GLM's models are also very much like this.

Comment by jbaiter 14 hours ago

Had the same experience using it for a refactor of a 3k LOC monolith via the Pi harness and OpenRouter. After burning through $8 worth of tokens it left the code in a broken state, the "thoughts" were full of loops where it would edit the monolith, then refer back to the original file, not finding it and then overwriting its changes with "git checkout --"

Comment by cmrdporcupine 13 hours ago

It's probably bad harness. I had a similar bad experience with qwen max yesterday also through opencode.

In the past I tried Kimi thru Claude code I might try that again

Comment by oliver236 17 hours ago

isnt this better than qwen?

Comment by Alifatisk 14 hours ago

We'll have to wait for the results on Artificial analysis

Comment by max2026 6 hours ago

[dead]

Comment by XCSme 16 hours ago

(commented on the wrong thread, HN doesn't let me delete it :( )

Comment by wizee 16 hours ago

They're comparing to Opus 4.6, not 4.5. It was Anthropic's best public model up until last week.