Launch HN: General Instinct (YC P26) – Frontier models on edge devices

Posted by guanming0717 4 days ago

Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/).

After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.

The models that performed best were usually designed around datacenter assumptions: large GPUs, lots of memory bandwidth, and reliable network access. But most physical systems have the opposite constraints.

That led us down the path of figuring out how much of a frontier model could be preserved while still making it practical to run on edge hardware.

As part of that work, we recently open sourced InstinctRazor (https://github.com/General-Instinct/InstinctRazor)

One result we're excited about is compressing Qwen3.5-122B-A10B, a roughly 245 GB BF16 MoE model, into a 48 GiB GGUF. The resulting model is actually smaller than Gemma-4-26B-A4B while outperforming it on benchmarks like MMLU-Pro and GPQA-D etc. we preserve the parts that are always active (router, norms, Gated-DeltaNet/SSM layers, vision pathway, etc.) and quantize the routed experts much more aggressively. We then use on-policy distillation to recover capability lost during quantization.

The model can also run in a "small GPU" configuration where experts are streamed from system RAM. With an 8k context window, peak VRAM usage is around 7.6–8 GB.

If you're interested in the technical details, we wrote up the approach here (https://general-instinct.com/blog/frontier-moe-sub-4-bit)

We're especially interested in hearing from people deploying models onto robots or other edge devices. What models are you trying to run locally today? What has been the biggest bottleneck in getting them into production?

Comments

Comment by BoorishBears 4 days ago

I like the technique described here around distillation to recover from quantization, but I don't understand why we keep performing lossy compression on LLMs then using benchmarks that were nearly saturated before post-training to measure the effects.

You could erase the gains from literally half the compute going into some of these recent models and barely make a dent in MMLU-Pro and GPQA-D.

Comment by debo_ 4 days ago

As an aside, General Instinct sounds like the name I'd give a megacorp in one of my cyberpunk ttrpg campaigns.

Comment by Terretta 4 days ago

In cyberpunk ad world, after the campaign the whole town is littered with GI tracts.

Comment by XenophileJKO 4 days ago

I'm still kind of surprised that people are targeting edge deployment of MoE models. By definition they optimize for computation cost at the expense of memory efficiency. We generally need the opposite on the edge.

I'm hoping to see more work in the other direction with cyclic/looped transformers and other memory dense approaches.

Comment by flowbarai 4 days ago

[flagged]

Comment by a_t48 4 days ago

Hi Guanming/Bill. Would love to chat about what you're doing for actually running the models. I'm in a similar space, speeding up the `docker pull` component of inference deployment on edge devices (among other things!) If you're interested, shoot me an email at kyle@clipper.dev

Comment by gesai 4 days ago

Sorry if this is somewhat off-topic:

Through my estimations, based on Bonsai's parameters/GB ratio, if one model were to have this ratio and Gemma4:12b's size, it would have the nice number of 54.125b parameters (that could run on 16GB of RAM). Is there any organization attempting something of this kind?

Comment by ilaksh 4 days ago

Yes Google. They just released their Gemma 4 12b quant.

Comment by gesai 4 days ago

[dead]

Comment by rdksu 4 days ago

Have you run ablations on the actual effect/impact of on-policy distillation on contributing to the performance ? Just Curious ! As Unsloth based mixed quantisation methods on MoE models are widely used with great community rep.

Comment by VikRubenfeld 4 days ago

You've likely heard about this - he'd probably like to talk to you and might potentially give you some good PR.

https://www.youtube.com/watch?v=rAzT5lcezPs&t=467s

Comment by smokel 4 days ago

For those too lazy to watch someone talk on video for ages to make a point:

The link is to a famous YouTuber called PewDiePie and he uses a local LLM to parse his email, to save time with that. They have an autoreply system and get notified about urgent matters.

Comment by guanming0717 4 days ago

Thanks for sharing! I'd love to chat with him. Would you be open to introducing us? :)

Comment by ilaksh 4 days ago

I assume PewDiePie runs something like DeepSeek 4 Flash on that rig.

Comment by rohansood15 4 days ago

Have you benchmarked against other 3-bit dynamic quants like Unsloth? I am sorry but this framing against a full precision, newer, smaller MoE just seems misleading. Also, Gemma-4-26B-A4B is not the SOTA for edge. Even at launch, that would be the 31B.

Comment by guanming0717 4 days ago

Yes I did, with other SOTA quant methods like HQQ, AWQ etc. You can find more info in our blog :) https://general-instinct.com/blog/frontier-moe-sub-4-bit

Comment by rohansood15 4 days ago

I can't find it. Can you state your performance versus comparable 3-bit quantization from Unsloth/Bartowski? Edit: I appreciate that you seem to have open-sourced the quantization pipeline. This is not to question your work, but to understand where the outputs stand relative to the SoTA for quantization.

Comment by officialchicken 3 days ago

How many watts? How does it effect power envelope?

Comment by Pixel-Labs 4 days ago

[flagged]