My first impressions on ROCm and Strix Halo
Posted by random_ 2 days ago
Comments
Comment by spoaceman7777 2 days ago
Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)
Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.
Comment by pierrekin 2 days ago
Comment by rpdillon 2 days ago
Comment by fwipsy 2 days ago
Comment by fluoridation 2 days ago
Comment by adrian_b 2 days ago
For instance, now the majority of desktops with DDR5 have 4 channels, not 2 channels, but the channels are narrower, so the width of the memory interface is the same as before.
To avoid ambiguities, one should always write the width of the memory interface.
Most desktop computers and laptop computers have 128-bit memory interfaces.
The cheapest desktop computers and laptop computers, e.g. those with Intel Alder Lake N/Twin Lake CPUs, and also many smartphones and Arm-based SBCs, have 64-bit memory interfaces.
Cheaper smartphones and Arm-based SBCs have 32-bit memory interfaces.
Strix Halo and many older workstations and many cheaper servers have 256-bit memory interfaces.
High-end servers and workstations have 768-bit or 512-bit memory interfaces.
It is expected that future high-end servers will have 1024-bit memory interfaces per socket.
GPUs with private memory have usually memory interfaces between 192-bit and 1024-bit, but newer consumer GPUs have usually narrower memory interfaces than older consumer GPUs, to reduce cost. The narrower memory interface is compensated by faster memories, so the available bandwidth in consumer GPUs has been increased much slower than the increase in GDDR memory speed would have allowed.
Comment by fluoridation 1 day ago
Source? I just looked up two random X870E boards from Gigabyte and both are dual channel.
>To avoid ambiguities, one should always write the width of the memory interface.
They're incomparable quantities. More channels support more parallel operations, while a wider bus at a constant frequency supports higher throughput.
The bus width is not even that useful of a metric. It's more useful to talk about bits per second, which is the product of bus width and frequency.
Comment by sliken 21 hours ago
For some workloads the extra channels help, despite having the same bandwidth. This is one of the reasons that it's possible for a DDR5 system to be slightly faster than a DDR4 system, even if the memory runs at the same speed.
Comment by fluoridation 19 hours ago
Uh, surely that depends on how the motherboard is wired. Just because each DIMM has half the pins on one channel and the other half on another, doesn't mean 2 DIMM = 4 channels. It could just be that the top pins over all the DIMMs are on one channel and the bottom ones are on another.
Comment by sliken 9 hours ago
I don't think you can wire up DDR5 dimms willy nilly as if they were 2 separate 32 bit dimms.
Comment by fluoridation 7 hours ago
Comment by sliken 7 hours ago
So with 2 dimms or 4 you still get 128 bit wide memory. With DDR4 that means 2 channels x 64 bit each. With DDR5 that means 4 channels x 32 bit each.
Keep in mind that memory controller is in the CPU, which is where the DDR4/5 memory controller is. The motherboards job is to connect the right pins on the DIMMs to the right pins on the CPU socket. The days of a off chip memory controller/north bridge are long gone.
So if you look at an AM5 CPU it clearly states:
* Memory Type: DDR5-only (no DDR4 compatibility).
* Channels: 2 Channel (Dual-Channel).
* Memory Width: 2x32-bit sub-channels (128-bit total for 2 sticks).Comment by fluoridation 6 hours ago
>In computer manufacture speak dual channel = 2 x 64 bit = 128 bits wide.
Yes, because AMD64 has 64-bit words. You can't satisfy a 64-bit load or store with just 32 bits (unless you take twice as long, of course). That you get 4 32-bit subchannels doesn't mean you can execute 4 simultaneous independent 32-bit memory operations. A 64-bit channel capable of a full operation still needs to be assembled out of multiple 32-bit subchannels. If you install a single stick you don't get any parallelism with your memory operations; i.e. the system runs in single channel mode, the single stick fulfilling only a single request at a time.
Comment by sliken 4 hours ago
However the motherboard vendors get annoyingly hide that from you by claiming DDR4 is dual channel (2 x 64 bit which means two outstanding cache misses, one per channel) and just glossing over the difference by saying DDR5 dual channel (4 x 32 bit which means 4 outstanding cache misses).
> Yes, because AMD64 has 64-bit words.
It's a bit more complicate than that. First you have 3 levels of cache, the last of which triggers a cache line load, which is 64 bytes (not 64 bits). That goes to one of the 4 channels, there's a long latency for the first 64 bits. Then there's the complications of opening the row, which makes the columns available, which can speed up things if you need more than one row. But the general idea is that you get at the maximum one cache line per channel after waiting for the memory latency.
So DDR4 on a 128 bit system can have 2 cache lines in flight. So 128 bytes * memory latency. On a DDR5 system you can have 4 cache lines in flight per memory latency. Sure you need the bandwidth and 32 bit channels have half the bandwidth per clock, but the trick is the memory bus spends most of it's time waiting on memory to start a transfer. So waiting 50ns then getting 32bit @ 8000 MT/sec isn't that different than waiting 50ns and getting 64 bit @ 8000MT/sec.
Each 32 bit subchannel can handle a unique address, which is turned into a row/column, and a separate transfer when done. So a normal DDR5 system can look up 4 addresses in parallel, wait for the memory latency and return a cache line of 64 bytes.
Even better when you have something like strix halo that actually has a 256 bit wide memory system (twice any normal tablet, laptop, or desktop), but also has 16 channels x 16 bit, so it can handle 16 cache misses in flight. I suspect this is mostly to get it's aggressive iGPU fed.
Comment by sliken 21 hours ago
Yes, but tablets, laptops, and normal (non-HEDT) desktops have 4 channels, 4x32 bit = 128 bit wide. Modern memory with DDR5 allows two 32 bit channels on a 64 bit dimm. The previous gen DDR4 would allow 1 64 bit channel on a 64 bit dimm.
So strix halo (on laptops, tablets, and desktops) allows for a 256 bit wide memory system, providing twice the memory bandwidth of any ryzen or intel i3/i5/i7/i9. The Apple pro (256 bit), max (512 bit), and ultra (1024 bit) lines of apple silicon have greater than 128 bit wide memory systems. On the AMD size it's just the Threadripper (256 bit) and Threadripper pro (512 bit), but those are typically in expensive workstations that are physically large, expensive, and need substantial cooling.
So the HALO is pretty unique (outside of Apple) for providing twice the memory bandwidth of anything else that fits in the tablet, laptop, or small desktop category.
Comment by suprjami 2 days ago
They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.
For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:
Comment by moffkalast 2 days ago
In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.
Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.
[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...
Comment by stebalien 1 day ago
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/discussi...
I assume they're applying the same technique going forward, but I have no idea how to determine if this is the case.
Comment by adrian_b 2 days ago
I do not know about its GPU, which might have only FP16.
So it is likely that the right inference strategy would be to run any BF16 computations on the Strix Halo CPU, while running the quantized computations on its GPU.
Comment by tssge 2 days ago
Comment by moffkalast 1 day ago
Comment by seemaze 2 days ago
[0] https://www.amd.com/en/developer/resources/technical-article...
Comment by data-ottawa 1 day ago
Comment by seemaze 1 day ago
Comment by sidkshatriya 2 days ago
Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.
I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.
Comment by ZiiS 2 days ago
Comment by adrian_b 2 days ago
The BIOSes recent enough to be able to connect through the Internet normally have the option to use a USB memory from inside the BIOS setup.
Some motherboards can update the BIOS from a USB memory even without a CPU in the socket.
Comment by bityard 2 days ago
Comment by imp0cat 2 days ago
Comment by sidkshatriya 2 days ago
If something is "standard" nowadays does it mean it is the right way to go ?
One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.
Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.
Comment by ethbr1 1 day ago
Next thing you'll be telling me that you have a problem piping internet hosted install scripts directly into shell!
Comment by trvz 2 days ago
Comment by anko 2 days ago
Comment by rdslw 2 days ago
- gemma4-31b normal q8 -> 5.1 tok/s
- gemma4-31b normal q16 -> 3.7 t/s
- gemma4-31b distil q16 -> 3.6 t/s
- gemma4-31b distil q8 -> 5.7 tok/s (!)
- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)
- gemma4-26b-a4b ud q16 -> 12 t/s
- gemma4-26b-a4b cl q8 -> 42 t/s (!)
- gemma4-26b-a4b cl q16 -> 12 t/s
- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)
- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s
- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s
- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s
- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s
Comment by SwellJoe 2 days ago
prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)
eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)
total time = 1747.62 ms / 279 tokens
With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.
Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.
Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/
Comment by bityard 2 days ago
All you really need is podman, toolbx, and the Strix Halo toolbox images from https://github.com/kyuz0/amd-strix-halo-toolboxes. Then you just download your ggufs and hand them to llama-server.
Yes, there are other solutions that are a bit more hand-holdy, but if you already know how to use docker/podman and just want to get something working in an evening, this works too.
Comment by thr3at-surfac3 1 day ago
Comment by everlier 2 days ago
It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.
Comment by data-ottawa 1 day ago
It is quite slow, but if you want to compute embeddings in the background it’s fine.
I didn’t find it more energy efficient than just using the GPU for time insensitive tasks though.
Comment by IamTC 2 days ago
Comment by roenxi 2 days ago
The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.
ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.
Comment by SwellJoe 2 days ago
You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.
Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).
Comment by anaisbetts 2 days ago
Comment by SwellJoe 1 day ago
And, I think the first response time of ROCm is pretty consistently faster than Vulkan, even if Vulkan has a slightly higher token rate. Though I don't see that big of a different on token rates, either. Honestly, though, I haven't done enough real testing to know for sure. The benchmarks Donato Capitella posts (https://kyuz0.github.io/amd-strix-halo-toolboxes/) have been my guide on what to run in what way, and the performance of most things that can run on the Strix Halo are Fast Enough(tm) such that I don't agonize about performance. When Vulkan was all that worked with llama.cpp, that's what I used. Now that ROCm is reliable, I'm using ROCm. ROCm feels faster, maybe just because it processes prompts faster and starts typing the answer fast (at a rate faster than I can read it, so when it starts answering is the more important metric even if faster token rate would lead to it finishing faster).
In short: If ever I'm doing something that will take many hours to complete, and I need to optimize it, I'll do some tests first to be sure I'm using the optimal path. Otherwise, as long as ROCm is working, I'll probably just keep using it.
Comment by roenxi 2 days ago
But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.
If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?
Comment by SwellJoe 2 days ago
CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.
On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.
Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.
Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/
Comment by roenxi 2 days ago
That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?
Comment by SwellJoe 2 days ago
Comment by sabedevops 2 days ago
Comment by SwellJoe 2 days ago
Comment by dragontamer 2 days ago
And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?
Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.
GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.
Comment by timmy777 2 days ago
I'll give a specific example in my feedback, You said:
``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```
But there are no numbers, results or output paste. Performance, or timings.
Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.
Do you mind sharing these? Thanks!
Comment by gessha 2 days ago
Comment by l33tfr4gg3r 2 days ago
Comment by JSR_FDED 2 days ago