Ternary Bonsai: Top Intelligence at 1.58 Bits
Posted by nnx 3 days ago
Comments
Comment by freakynit 6 hours ago
https://uklkyvetsjf7qt-80.proxy.runpod.net
./build/bin/llama-server \
-m ../Ternary-Bonsai-8B-Q2_0.gguf \
-ngl 999 \
--flash-attn on \
--host 0.0.0.0 \
--port 80 \
--ctx-size 65500 \
--batch-size 512 \
--ubatch-size 512 \
--parallel 5 \
--cont-batching \
--threads 8 \
--threads-batch 8 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--log-colors on
# llama.cpp is forked one: https://github.com/PrismML-Eng/llama.cpp.git# The server can serve 5 parallel request, with each request capped at around `13K` tokens...
# A bit of of benchmarks I did:
# 1. Input: 1001 tokens, ttfs: 0.3 second, outputs: 1618 tokens ~140t/s
# 2. Input: 9708 tokens, ttfs: 2.4 second, outputs: 2562 tokens at ~106t/s
# Vram usage was consistently at ~7GiB.
> https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf/resol...
Comment by haellsigh 1 hour ago
Comment by freakynit 1 minute ago
Comment by opem 2 hours ago
Comment by sigmoid10 2 hours ago
Some more interesting tidbits from my go-to tests:
* Fails the car wash test (basic logic seems to be weak in general)
* Fails simple watch face generation in html/css.
* Fails the "how many Rs in raspberry test" (not enough cross-token training data), but will funnily assume you may be talking about Indian Rupees and tell you a lot about raspberry prices in India without being asked. Possible Indian training data unbalance?
* Flat out refuses to talk about Tiananmen square when pushed directly - despite being from a US company. Again, perhaps they are exposed to some censored training data? Anyways, when slowly built up along the conversation by asking about locations and histories, it will eventually tell you about the massacre, so the censorship bias seems weak in general. Also has no problem immediately talking about anything Gaza/Israel/US or other sensitive topics.
* Happily tells you how to synthesize RDX with list of ingredients and chemical process step by step. At least it warns you that it is highly dangerous and legally controlled in the US.
Comment by yorwba 1 hour ago
Comment by sigmoid10 1 hour ago
Comment by yorwba 1 hour ago
Comment by armanj 8 hours ago
in my results, accuracy-wise Ternary-Bonsai-8B is on par with Qwen3.5-4B. But in accuracy-per-byte, bonsai is the clear winner:
=> Ternary-Bonsai-1.7B achieved 65.1% from 462 MiB, beating Qwen3.5-0.8B by 12 points while being ~5% smaller on disk. => Ternary-Bonsai-4B is the accuracy-per-byte winner above 1 GiB. 83.0% from only 1.1 GiB, within 2 points of Qwen3.5-4B at 40% of the weight size.
they show strong promise on edge devices and where disk space is limited. I think this lab is worth watching.
Comment by philipp-gayret 1 hour ago
On my single NVidia Spark I get 173.3 tokens/s on baseline config, 372.4 tokens/s with added tuning/parallel options. Most notably time to first token is incredibly low, similar models take ~6000ms. Bonsai was 70ms (almost 100x reduction) with flash attention
Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents.
Comment by swiftcoder 1 hour ago
Comment by mungoman2 21 minutes ago
If this model beats the others also in a fair comparison it gives even more credibility to it.
Comment by usernametaken29 6 hours ago
Comment by freakynit 6 hours ago
Comment by sbierwagen 5 hours ago
Comment by sally_glance 2 hours ago
Comment by freakynit 3 hours ago
I believe the answer lies in how "quickly" (and how?) we are able to learn, and then generalize those learnings as well. As of now, these models need millions (at least) examples to learn, and are still not capable of generalizing the learnings to other domains. Human brains hardly need a few, and then, they generalize those pretty well.
Comment by londons_explore 2 hours ago
Modern LLM's similarly beat the human brain in lots of tasks for energy efficiency - mostly by the fact the LLM can produce the answer in 1 second and the brain has to spend half an hour researching and drafting something.
Comment by eru 3 hours ago
Only when you look at stuff that the brain is specifically good at.
You can surpass the brain with even simple mechanical adders or an abacus in certain subdomains.
Comment by freakynit 3 hours ago
Comment by Animats 8 hours ago
(I've been reading the MMLU-Redux questions for electrical engineering. They're very funny. Fifty years ago they might have been relevant. The references to the Intel 8085 date this to the mid-1970s. Moving coil meters were still a big thing back then. Ward-Leonard drives still drove some elevators and naval guns. This is supposed to be the hand-curated version of the questions. Where do they get this stuff? Old exams?)
[1] https://github.com/aryopg/mmlu-redux/blob/main/outputs/multi...
Comment by yodon 9 hours ago
Comment by Animats 8 hours ago
Comment by londons_explore 2 hours ago
Hardware engineers realise that a compiler will almost always find some combination of gates which is smaller/faster than the contents of any table.
Comment by AlotOfReading 6 hours ago
Comment by Taniwha 6 hours ago
Comment by zkmon 2 hours ago
Comment by mchusma 9 hours ago
I also have yet to see any of these at a larger scale. For example, can you try one of these at 100 billion parameters?
Comment by londons_explore 2 hours ago
That'll be the real game changer.
Comment by sigmoid10 1 hour ago
Comment by ericb 8 hours ago
If you got that into a couple gigs--what could you stuff into 20 gigs?
Comment by WatchDog 7 hours ago
Why aren't they comparing to 2/3/4 bit quants?
Comment by himata4113 7 hours ago
Comment by mstr_anderson 3 hours ago
Comment by syntex 3 hours ago
Comment by wmf 9 hours ago
Comment by Dumbledumb 8 hours ago
Comment by SwellJoe 7 hours ago
Nonetheless, the Prism Bonsai models are impressive for their size. Where it falls apart is with knowledge. It has good prose/logic for a tiny model, and it's fast even on modest hardware, but it hallucinates a lot. Which makes sense. You can't fit the world's data in a couple of gigabytes. But, as a base model for fine-tuning for use cases where size matters, it's probably a great choice.
Comment by happygoose 6 hours ago
Comment by est 3 hours ago
Can it be run on browsers with WASM/WebGPU?
Comment by TimorousBestie 6 hours ago
>> What are some names like Llewelyn?
> Some names like Llewelyn are Llewelyn, Llewelyn, Llewelyn, (repeats several times), and Llewelyn.
Comment by gbgarbeb 6 hours ago
Comment by goofy_lemur 7 hours ago
Wow, if this is true, I am extremely impressed and excited!
I wonder about kv cache how much better it is as well!