I have written gemma3 inference in pure C

Posted by robitec97 2 days ago

Comments

Comment by austinvhuang 3 hours ago

My first implementation of gemma.cpp was kind of like this.

There's such a massive performance differential vs. SIMD though that I learned to appreciate SIMD (via highway) as one sweet spot of low-dependency portability that sits between C loops and the messy world of GPUs + their fat tree of dependencies.

If anyone want to learn the basics - whip out your favorite LLM pair programmer and ask it to help you study the kernels in the ops/ library of gemma.cpp:

https://github.com/google/gemma.cpp/tree/main/ops

Comment by janwas 3 hours ago

:D Your code was nicely written and it was a pleasure to port to SIMD because it was already very data-parallel.

Comment by w4yai 3 hours ago

> It proves that modern LLMs can run without Python, PyTorch, or GPUs.

Did we need any proof of that ?

Comment by jdefr89 3 hours ago

Python and PyTorch all call out to C libraries… I don’t get what he means by “proving LLMs can run without Python and PyTorch” at all. Seems like they don’t understand basic fundamentals about things here…

Comment by jasonjmcghee 3 hours ago

I guess llama.cpp isn't quite as popular as I had assumed.

Comment by christianqchung 1 hour ago

A bizarre claim like that would be what happens when you let an LLM write the README without reading it first.

Comment by skybrian 3 hours ago

Knowing the performance is interesting. Apparently it's 1-3 tokens/second.

Comment by kgeist 3 hours ago

ikllama.cpp is a fork of llama.cpp which specializes on CPU inference, some benchmarks from 1 year ago: https://github.com/ikawrakow/ik_llama.cpp/discussions/164

Comment by tolerance 3 hours ago

I imagine so regarding GPUs, right? Is this is a legitimate project then doesn’t it provide a proof of concept for performance constraints that relate to them? Couldn't the environmentally concerned take this as an indicator that the technology can progress without relying on as much energy is potentially spent now? Shouldn’t researchers in the industry be thinking of ways to prevent the future capabilities of the technology from outrunning the capacity of the infrastructure?

I know very little about AI but these are things that come to mind here for me.

Comment by yorwba 3 hours ago

GPUs are more efficient than CPUs for LLM inference, using less energy per token and being cheaper overall. Yes, a single data center GPU draws a lot of power and costs a fortune, but it can also serve a lot more people in the time your CPU or consumer GPU needs to respond to a single prompt.

Comment by tolerance 3 hours ago

I got you, thanks!

Comment by 2 days ago

Comment by behnamoh 3 hours ago

but why tho? next gemma is coming and no one uses gemma 3 in prod anyway.

Comment by uncognic 3 hours ago

I think /* */ single-line comments is a pretty good indication.

Comment by NitpickLawyer 3 hours ago

> no one uses gemma 3 in prod anyway.

Umm, we do. It's still one of the best for eu countries support / help chatbot style. It's got good (best?) multilingual support ootb, it's very "safe" (won't swear, won't display chinese characters, etc) and it's pretty fast.

Comment by gunalx 2 hours ago

Yep. Before gemma3 we where struggling with multilinguality on smaller European languages, and it is still one of the batter ones in that regard (even large open or closed models struggle with this to some extent). Gemma3 also is still pretty decent multi modal wise.

Comment by behnamoh 2 hours ago

but it lacks system prompt support.

Comment by data-ottawa 2 hours ago

Gemma3 is probably the best supported fine tunable model.