Trinity large: An open 400B sparse MoE model

Posted by linolevan 22 hours ago

Counter59Comment19OpenOriginal

Comments

Comment by mynti 15 hours ago

They trained it in 33 days for ~20m (that includes apparently not only the infrastructure but also the salaries over a 6 month period). And the model is coming close to QWEN and Deepseek. Pretty impressive

Comment by zamadatix 1 hour ago

The price/scaling of training another same class model always seems to be dropping through the floor but training models which score much better seems to be hitting a brick wall.

E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

The exception seems to be net new benchmarks/benchmark versions. These start out low and then either quickly get saturated or hit a similar wall after a while.

Comment by gwern 6 minutes ago

> E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

Why do you care about LM Arena? It has so many problems, and the fact that no one would suggest using GPT-4o for doing math or coding right now, or much of anything, should tell you that a 'win rate of 70%' does not mean whatever it looks like it means. (Does GPT-4o solve roughly as many Erdos questions as gemini-3-pro...? Can you write roughly as good poetry?)

Comment by linolevan 22 hours ago

I'm particularly excited to see a "true base" model to do research off of (https://huggingface.co/arcee-ai/Trinity-Large-TrueBase).

Comment by mwcampbell 1 hour ago

Given that it's a 400B-parameter model, but it's a sparse MoE model with 13B active parameters per token, would it run well on an NVIDIA DGX Spark with 128 GB of unified RAM, or do you practically need to hold the full model in RAM even with sparse MoE?

Comment by timschmidt 1 hour ago

Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.

That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.

Comment by antirez 1 hour ago

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

Comment by notpublic 27 minutes ago

Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...

Comment by Scipio_Afri 13 minutes ago

Pretty sure those use to be $1999 ... but not entirely sure

Comment by notpublic 1 minute ago

[delayed]

Comment by syntaxing 16 minutes ago

So refreshing to see open source models like this come from the US. I would love for a 100Bish size one that can compete against OSS-120B and GLM air 4.5

Comment by Alifatisk 32 minutes ago

What did they do to make the loss drop so much in phase 3?

Also, why are they comparing with Llama 4 Maverick? Wasn’t it a flop?

Comment by QuadmasterXLII 13 minutes ago

you can’t directly compare losses because they changed the data distribution for each phase ( I think. 100% guaranteed they change the data distribution after the 10 trillion token mark, that’s when they start adding in instruction following data, but I don’t know for sure if the other phase changes also include data distribution changes.)

Comment by frogperson 1 hour ago

What exactly does "open" mean in this case? Is it weights and data or just weights?

Comment by someotherperson 1 hour ago

It's always open weights.

Comment by jetpackjoe 43 minutes ago

It's never open data

Comment by jacquesm 38 minutes ago

Well, it is, it's your data to begin with after all but admitting that would create some problems.

Comment by linolevan 28 minutes ago

This model is sort of interesting since it seems to be using a lot of synthetic training data – but your point stands

Comment by greggh 1 hour ago

The only thing I question is the use of Maverick in their comparison charts. That's like comparing a pile of rocks to an LLM.

Comment by observationist 6 hours ago

This is a wonderful release.