My first impressions on ROCm and Strix Halo

Posted by random_ 2 days ago

Counter58Comment59OpenOriginal

Comments

Comment by spoaceman7777 2 days ago

I'm somewhat confused as to why this is on the front page. It doesn't go into any real detail, and the advice it gives is... not good. You should definitely not be quantizing your own gguf's using an old method like that hf script. There are lots of ways to run LLMs via podman (some even officially recommended by the project!). The chip has been out for almost a year now, and its most notable (and relevant-to-AI) feature is not mentioned in this article (it's the only x86_64 chip below workstation/server grade that has quad-channel RAM-- and inference is generally RAM constrained). I'm also quite puzzled about this bit about running pytorch via uv.

Anyway. I wouldn't recommend following the steps posted in there. Poke around google, or ask your friendly neighborhood LLM for some advice on how to set up your Strix Halo laptop/desktop for the tasks described. A good resource to start with would probably be the unsloth page for whichever model you are trying to run. (There are a few quantization groups that are competing for top-place with gguf's, and unsloth is regularly at the top-- with incredible documentation on inference, training, etc.)

Anyway, sorry to be harsh. I understand that this is just a blog for jotting down stuff you're doing, which is a great thing to do. I'm mostly just commenting on the fact that this is on the front page of hn for some reason.

Comment by pierrekin 2 days ago

Thanks for writing this comment, I think seeing someone’s “first impressions” and then seeing someone else’s response to those thoughts is more interesting and feels more connected socially than just reading a “correct” guide or similar especially when it’s something I’m curious about but wouldn’t necessarily be motivated enough to actually try out myself.

Comment by rpdillon 2 days ago

Agreed. Been running a Strix Halo box since mid-2025. Lemonade builds of llama.cpp with Unsloth or Bartowski quants have proven to be excellent.

Comment by fwipsy 2 days ago

Quad-channel RAM is common on consumer desktops. Strix Halo has *8* channels, and also very fast RAM (soldered RAM can be faster than dimms because the traces are shorter.)

Comment by fluoridation 2 days ago

Quad channel memory is not common on consumer desktops, it's a strictly HEDT and above feature. The vast majority of consumer desktops have 2 channels or fewer.

Comment by adrian_b 2 days ago

One should no longer use the word "channel" because the width of a channel differs between various kinds of memories, even among those that can be used with the same CPU (e.g. between DDR and LPDDR or between DDR4 and DDR5).

For instance, now the majority of desktops with DDR5 have 4 channels, not 2 channels, but the channels are narrower, so the width of the memory interface is the same as before.

To avoid ambiguities, one should always write the width of the memory interface.

Most desktop computers and laptop computers have 128-bit memory interfaces.

The cheapest desktop computers and laptop computers, e.g. those with Intel Alder Lake N/Twin Lake CPUs, and also many smartphones and Arm-based SBCs, have 64-bit memory interfaces.

Cheaper smartphones and Arm-based SBCs have 32-bit memory interfaces.

Strix Halo and many older workstations and many cheaper servers have 256-bit memory interfaces.

High-end servers and workstations have 768-bit or 512-bit memory interfaces.

It is expected that future high-end servers will have 1024-bit memory interfaces per socket.

GPUs with private memory have usually memory interfaces between 192-bit and 1024-bit, but newer consumer GPUs have usually narrower memory interfaces than older consumer GPUs, to reduce cost. The narrower memory interface is compensated by faster memories, so the available bandwidth in consumer GPUs has been increased much slower than the increase in GDDR memory speed would have allowed.

Comment by fluoridation 1 day ago

>now the majority of desktops with DDR5 have 4 channels, not 2 channels

Source? I just looked up two random X870E boards from Gigabyte and both are dual channel.

>To avoid ambiguities, one should always write the width of the memory interface.

They're incomparable quantities. More channels support more parallel operations, while a wider bus at a constant frequency supports higher throughput.

The bus width is not even that useful of a metric. It's more useful to talk about bits per second, which is the product of bus width and frequency.

Comment by sliken 21 hours ago

Sadly motherboards, tech journalist, and many other places confuse the difference between a dimm and channel. The trick is the DDR4 generation they were the same, 64 bits wide. However a standard DDR5 dimm is not 1x64 bit, it's actually 2x32 bit. Thus 2 DDR5 dimms = 4 channels.

For some workloads the extra channels help, despite having the same bandwidth. This is one of the reasons that it's possible for a DDR5 system to be slightly faster than a DDR4 system, even if the memory runs at the same speed.

Comment by fluoridation 19 hours ago

>However a standard DDR5 dimm is not 1x64 bit, it's actually 2x32 bit. Thus 2 DDR5 dimms = 4 channels.

Uh, surely that depends on how the motherboard is wired. Just because each DIMM has half the pins on one channel and the other half on another, doesn't mean 2 DIMM = 4 channels. It could just be that the top pins over all the DIMMs are on one channel and the bottom ones are on another.

Comment by sliken 9 hours ago

I think there's a standard wiring for the dimm and some parts are shared. Each normal ddr5 dimm has 2 sub channels that are 32 bits each, and the new specification for the HUDIMM which will only enable 1 sub channel and only have half the bandwidth.

I don't think you can wire up DDR5 dimms willy nilly as if they were 2 separate 32 bit dimms.

Comment by fluoridation 7 hours ago

Well, I don't know what to tell you. I'm not a computer engineer, but I assume Gigabyte has at least a few of those, and they're labeling the X870E boards with 4 DIMMS as "dual channel". I feel like if they were actually quad channel they'd jump at the chance to put a bigger number, so I'm compelled to trust the specs.

Comment by sliken 7 hours ago

In computer manufacture speak dual channel = 2 x 64 bit = 128 bits wide.

So with 2 dimms or 4 you still get 128 bit wide memory. With DDR4 that means 2 channels x 64 bit each. With DDR5 that means 4 channels x 32 bit each.

Keep in mind that memory controller is in the CPU, which is where the DDR4/5 memory controller is. The motherboards job is to connect the right pins on the DIMMs to the right pins on the CPU socket. The days of a off chip memory controller/north bridge are long gone.

So if you look at an AM5 CPU it clearly states:

   * Memory Type: DDR5-only (no DDR4 compatibility).

   * Channels: 2 Channel (Dual-Channel).

   * Memory Width: 2x32-bit sub-channels (128-bit total for 2 sticks).

Comment by fluoridation 6 hours ago

Why are you quoting something that contradicts you? It clearly states it's a dual channel memory architecture with 32-bit subchannels. The fact the two words are used means they mean different things.

>In computer manufacture speak dual channel = 2 x 64 bit = 128 bits wide.

Yes, because AMD64 has 64-bit words. You can't satisfy a 64-bit load or store with just 32 bits (unless you take twice as long, of course). That you get 4 32-bit subchannels doesn't mean you can execute 4 simultaneous independent 32-bit memory operations. A 64-bit channel capable of a full operation still needs to be assembled out of multiple 32-bit subchannels. If you install a single stick you don't get any parallelism with your memory operations; i.e. the system runs in single channel mode, the single stick fulfilling only a single request at a time.

Comment by sliken 4 hours ago

AM5 is the AMD standard, it's accurate, seems rather pedantic to differentiate between 2 sub channels per dimm and saying 4 32 bit channels for a total of 128 bit.

However the motherboard vendors get annoyingly hide that from you by claiming DDR4 is dual channel (2 x 64 bit which means two outstanding cache misses, one per channel) and just glossing over the difference by saying DDR5 dual channel (4 x 32 bit which means 4 outstanding cache misses).

> Yes, because AMD64 has 64-bit words.

It's a bit more complicate than that. First you have 3 levels of cache, the last of which triggers a cache line load, which is 64 bytes (not 64 bits). That goes to one of the 4 channels, there's a long latency for the first 64 bits. Then there's the complications of opening the row, which makes the columns available, which can speed up things if you need more than one row. But the general idea is that you get at the maximum one cache line per channel after waiting for the memory latency.

So DDR4 on a 128 bit system can have 2 cache lines in flight. So 128 bytes * memory latency. On a DDR5 system you can have 4 cache lines in flight per memory latency. Sure you need the bandwidth and 32 bit channels have half the bandwidth per clock, but the trick is the memory bus spends most of it's time waiting on memory to start a transfer. So waiting 50ns then getting 32bit @ 8000 MT/sec isn't that different than waiting 50ns and getting 64 bit @ 8000MT/sec.

Each 32 bit subchannel can handle a unique address, which is turned into a row/column, and a separate transfer when done. So a normal DDR5 system can look up 4 addresses in parallel, wait for the memory latency and return a cache line of 64 bytes.

Even better when you have something like strix halo that actually has a 256 bit wide memory system (twice any normal tablet, laptop, or desktop), but also has 16 channels x 16 bit, so it can handle 16 cache misses in flight. I suspect this is mostly to get it's aggressive iGPU fed.

Comment by sliken 21 hours ago

> Quad-channel RAM is common on consumer desktops

Yes, but tablets, laptops, and normal (non-HEDT) desktops have 4 channels, 4x32 bit = 128 bit wide. Modern memory with DDR5 allows two 32 bit channels on a 64 bit dimm. The previous gen DDR4 would allow 1 64 bit channel on a 64 bit dimm.

So strix halo (on laptops, tablets, and desktops) allows for a 256 bit wide memory system, providing twice the memory bandwidth of any ryzen or intel i3/i5/i7/i9. The Apple pro (256 bit), max (512 bit), and ultra (1024 bit) lines of apple silicon have greater than 128 bit wide memory systems. On the AMD size it's just the Threadripper (256 bit) and Threadripper pro (512 bit), but those are typically in expensive workstations that are physically large, expensive, and need substantial cooling.

So the HALO is pretty unique (outside of Apple) for providing twice the memory bandwidth of anything else that fits in the tablet, laptop, or small desktop category.

Comment by phonon 2 days ago

4 DIMMS =/= 4 channels

Comment by fwipsy 1 day ago

I knew that, but I still thought most desktops with 4 dimm slots supported quad-channel memory. I guess I was wrong.

Comment by suprjami 2 days ago

If you are using quants below Q8 then get them from Unsloth or Bartowski.

They are higher quality than the quants you can make yourself due to their imatrix datasets and selective quantisation of different parts of the model.

For Qwen 3.5 Unsloth did 9 terabytes of quants to benchmark the effects of this:

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Comment by moffkalast 2 days ago

That used to be a good suggestion, and it still most likely is if you're using a recent Nvidia dGPU, but absolutely not for iGPUs like the Halo/Point or Arc LPG. The problem is bf16.

In short, even lower quants leave some layers at original precision and llama.cpp in its endless wisdom does not do any conversion when loading weights and seeing what your card supports, so every time you run inference it gets so surprised and hits a brick wall when there's no bf16 acceleration. Then it has to convert to fp16 on the fly or something else which can literally drop tg by half or even more. I've seen fp16 models literally run faster than Q8 on Arc despite being twice the size with the same bandwidth and it's expectedly similar [0] on AMD.

Models used to be released as fp16 which was fine, then Gemma did native bf16 and Bartowski initially came up with a compatibility thing where they converted bf16 to fp32 then fp16 and used that for quants. Most models are released as bf16 these days though and Bartowski's given up on doing that (while Unsloth never did that to begin with). So if you do want max speed, you kinda have to do static quants yourself and follow the same multi-step process to remove all the stupid bf16 weights from the model. I don't get why this can't be done once at model load ffs, but this is what we've got.

[0] https://old.reddit.com/r/LocalLLaMA/comments/1r0b7p8/free_st...

Comment by stebalien 1 day ago

At least for qwen3.5, it looks like unsloth has updated their quantization algorithms to avoid bf16. See the march 5th update:

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/discussi...

I assume they're applying the same technique going forward, but I have no idea how to determine if this is the case.

Comment by adrian_b 2 days ago

The CPU of Strix Halo has good BF16 acceleration, like any other Zen 4/Zen 5 CPU (the future Zen 6 will add FP16 acceleration).

I do not know about its GPU, which might have only FP16.

So it is likely that the right inference strategy would be to run any BF16 computations on the Strix Halo CPU, while running the quantized computations on its GPU.

Comment by tssge 2 days ago

The GPU has INT4, INT8, BF16 and FP16. Notably no FP8 or FP4.The official GPTQ-Int4 release from Qwen is a great quant for this but custom kernels are still rare for this hardware.

Comment by moffkalast 1 day ago

Must be a case of the hardware being there and the software not actually supporting it then.

Comment by seemaze 2 days ago

Check out the officially supported project Lemonade[0] by AMD. It has gfx1151 specific builds of vLLM, llama.cpp, comfy-ui, and even a PR to merge a Strix Halo port of Apple’s MLX[1] with a quick and easy install.

[0] https://www.amd.com/en/developer/resources/technical-article...

[1] https://github.com/lemonade-sdk/lemonade/issues/1642

Comment by data-ottawa 1 day ago

I don’t think lemonade includes a comfyui wrapper, it does have stable diffusion support built in though.

Comment by seemaze 1 day ago

I think you are correct. I’ve mostly been working with plain llama.cpp, but recently started looking into lemonade for the baked-in NPU support.

Comment by sidkshatriya 2 days ago

> It seems that things wouldn't work without a BIOS update: PyTorch was unable to find the GPU. This was easily done on the BIOS settings: it was able to connect to my Wifi network and download it automatically.

Call me traditional but I find it a bit scary for my BIOS to be connecting to WiFi and doing the downloading. Makes me wonder if the new BIOS blob would be secure i.e. did the BIOS connect over securely over https ? Did it check for the appropriate hash/signature etc. ? I would suppose all this is more difficult to do in the BIOS. I would expect better security if this was done in user space in the OS.

I'm much prefer if the OS did the actual downloading followed by the BIOS just doing the installation of the update.

Comment by ZiiS 2 days ago

I have never seen a BIOS that didn't allow offline updates? However SSL is much less processing then a WPA2 WiFi stack, I would certainly expect this to be fully secure and boycot a manufacturer who failed. Conversely updating your BIOS without worrying if your OS is rooted is nice.

Comment by adrian_b 2 days ago

Updating your BIOS without worrying if your OS is rooted can be easily and more securely done from an USB memory.

The BIOSes recent enough to be able to connect through the Internet normally have the option to use a USB memory from inside the BIOS setup.

Some motherboards can update the BIOS from a USB memory even without a CPU in the socket.

Comment by bityard 2 days ago

You don't HAVE to update the bios over wifi, fwupd is perfectly able to do it as well.

Comment by imp0cat 2 days ago

Isn't this pretty much standard in this day and age? HP for example also has this option in BIOS for their laptops (but you still can either download the BIOS blob manually in Linux or use the automatic updater in Windows if you want).

Comment by sidkshatriya 2 days ago

> Isn't this pretty much standard in this day and age?

If something is "standard" nowadays does it mean it is the right way to go ?

One of my main issues is that this means your BIOS has to have a WiFi software stack in it, have a TLS stack in it etc. Basically millions of lines of extra code. Most of it in a blob never to be seen by more than a few engineers.

Though in another a way allowing BIOS to perform self updates is good because it doesn't matter if you've installed FreeBSD, OpenBSD, Linux, Windows, <any other os> you will be able to update your BIOS.

Comment by ethbr1 1 day ago

> If something is "standard" nowadays does it mean it is the right way to go ?

Next thing you'll be telling me that you have a problem piping internet hosted install scripts directly into shell!

Comment by trvz 2 days ago

I fully expect any BIOS to have millions of unnecessary lines of code already though. May as well have a bit more for user convenience.

Comment by anko 2 days ago

I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also how rocm compares to triton.

Comment by rdslw 2 days ago

## performance data for token generation using lmstudio

- gemma4-31b normal q8 -> 5.1 tok/s

- gemma4-31b normal q16 -> 3.7 t/s

- gemma4-31b distil q16 -> 3.6 t/s

- gemma4-31b distil q8 -> 5.7 tok/s (!)

- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)

- gemma4-26b-a4b ud q16 -> 12 t/s

- gemma4-26b-a4b cl q8 -> 42 t/s (!)

- gemma4-26b-a4b cl q16 -> 12 t/s

- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s

- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s

- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s

- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s

Comment by SwellJoe 2 days ago

Currently running Gemma 4 26B A4B 8-bit quantization, reasoning off, and the most recent job performed thus (which seems about average, though these are short running tasks, <2 seconds for each prompt):

prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)

eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)

total time = 1747.62 ms / 279 tokens

With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.

Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.

Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/

Comment by bityard 2 days ago

If you just want to run models, most of TFA is taking the scenic route.

All you really need is podman, toolbx, and the Strix Halo toolbox images from https://github.com/kyuz0/amd-strix-halo-toolboxes. Then you just download your ggufs and hand them to llama-server.

Yes, there are other solutions that are a bit more hand-holdy, but if you already know how to use docker/podman and just want to get something working in an evening, this works too.

Comment by thr3at-surfac3 1 day ago

The unified memory architecture is what makes Strix Halo interesting for inference workloads. No PCIe bottleneck moving weights between CPU and GPU memory. For anyone getting started, the Unsloth UD quants are the way to go their imatrix calibration makes a real difference in output quality at Q6/Q8 compared to naive quantization. Curious about the ROCm vs Vulkan situation though. Has anyone benchmarked the prompt processing speed difference? For agentic workflows where you're constantly feeding new context, first-token latency matters more than raw tok/s.

Comment by everlier 2 days ago

owning GGUF conversion step is good in sone circumstances, but running in fp16 is below optimal for this hardware due to low-ish bandwidth.

It looks like context is set to 32k which is the bare minimum needed for OpenCode with its ~10k initial system prompt. So overall, something like Unsloth's UD q8 XL or q6 XL quants free up a lot of memory and bandwidth moving into the next tier of usefulness.

Comment by data-ottawa 1 day ago

Linux kernel 7 enables the NPU on Linux. You can use fastflowLM with lemonade now.

It is quite slow, but if you want to compute embeddings in the background it’s fine.

I didn’t find it more energy efficient than just using the GPU for time insensitive tasks though.

Comment by IamTC 2 days ago

Nice. Thanks for the writeup. My Strix Halo machine is arriving next week. This is handy and helpful.

Comment by roenxi 2 days ago

I thought the point of something like Strix Halo was to avoid ROCm all together? AMDs strategy seems to have been to unify GPU/CPU memory then let people write their own libraries.

The industry looks like it's started to move towards Vulkan. If AMD cards have figured out how to reliably run compute shaders without locking up (never a given in my experience, but that was some time ago) then there shouldn't be a reason to use speciality APIs or software written by AMD outside of drivers.

ROCm was always a bit problematic, but the issue was if AMD card's weren't good enough for AMD engineers to reliably support tensor multiplication then there was no way anyone else was going to be able to do it. It isn't like anyone is confused about multiplying matricies together, it isn't for everyone but the naive algorithm is a core undergrad topic and the advanced algorithms surely aren't that crazy to implement. It was never a library problem.

Comment by SwellJoe 2 days ago

You misunderstand the point, and ROCm. The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.

You can use Vulkan instead of ROCm on Radeon GPUs, including on the Strix Halo (and for a while, Vulkan was more likely to work on the Strix Halo, as ROCm support was slow to arrive and stabilize), but you need something that talks to the GPU.

Current ROCm, 7.2.1, works quite well on the Strix Halo. Vulkan does, too. ROCm tends to be a little faster, though. Not always, but mostly. People used to benchmark to figure out which was the best for a given model/workload, but now, I think most folks just assume ROCm is the better choice and use it exclusively. That's what I do, though I did find Gemma 4 wouldn't work on ROCm for a little bit after release (I think that was a llamma.cpp issue, though).

Comment by anaisbetts 2 days ago

This hasn't been my experience, ROCm is usually not only a bit slower for me (~32 t/s vs ~43 t/s on the main model I use), it is way less reliable; any upgrade in kernel version or AMD driver and suddenly everything is broken

Comment by SwellJoe 1 day ago

It can be tricky to get/keep ROCm working, but around 7.2 it became reliable and as fast as or faster than ROCm 6.4.

And, I think the first response time of ROCm is pretty consistently faster than Vulkan, even if Vulkan has a slightly higher token rate. Though I don't see that big of a different on token rates, either. Honestly, though, I haven't done enough real testing to know for sure. The benchmarks Donato Capitella posts (https://kyuz0.github.io/amd-strix-halo-toolboxes/) have been my guide on what to run in what way, and the performance of most things that can run on the Strix Halo are Fast Enough(tm) such that I don't agonize about performance. When Vulkan was all that worked with llama.cpp, that's what I used. Now that ROCm is reliable, I'm using ROCm. ROCm feels faster, maybe just because it processes prompts faster and starts typing the answer fast (at a rate faster than I can read it, so when it starts answering is the more important metric even if faster token rate would lead to it finishing faster).

In short: If ever I'm doing something that will take many hours to complete, and I need to optimize it, I'll do some tests first to be sure I'm using the optimal path. Otherwise, as long as ROCm is working, I'll probably just keep using it.

Comment by roenxi 2 days ago

> The GPU and CPU share memory, that doesn't mean you don't need to interact with the GPU, anymore.

But we already have software that talks to the GPU; mesa3d and the ecosystem around that. It has existed for decades. My understanding was that the main reasons not to use it was that memory management was too complicated and CUDA solved that problem.

If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?

Comment by SwellJoe 2 days ago

> CUDA solved that problem.

CUDA is a proprietary Nvidia product. CUDA solved the problem for Nvidia chips.

On AMD GPUs, you use ROCm. On Intel, you use OpenVINO. On Apple silicon you use MLX. All work fine with all the common AI tasks you'd want to do on self-hosted hardware. CUDA was there first and so it has a more mature ecosystem, but, so far, I've found 0 models or tasks I haven't been able to use with ROCm. llama.cpp works fine. ComfyUI works fine. Transformers library works fine. LM Studio works fine.

Unless you believe Nvidia having a monopoly on inference or training AI models is good for the world, you can't oppose all the other GPU makers having a way for their chips to be used for those purposes. CUDA is a proprietary vendor-specific solution.

Edit: But, also, Vulkan works fine on the Strix Halo. It is reliable and usually not that much slower than ROCm (and occasionally faster, somehow). Here's some benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/

Comment by roenxi 2 days ago

Why? What is the point of focusing on something that seems to be a memory management solution when the memory management problem theoretically just went away?

That has been one of the big themes in GPU hardware since around 2010 era when AMD committed to ATI. Nvidia tried to solve the memory management problem in the software layer, AMD committed to doing it in hardware. Software was a better bet by around a trillion dollars so far, but if the hardware solutions have finally come to fruit then why the focus on ROCm?

Comment by SwellJoe 2 days ago

I dunno. GPU programming and performance is above my pay grade. I assume the reason every GPU maker is investing in software is because they understand the problems to be solved and feel it's worth the investment to solve them. I like AMD because their Linux drivers are open source. I like Intel because all their stuff is Open Source. I like Nvidia notably less because none of their stuff is Open Source, not even the Linux drivers.

Comment by sabedevops 2 days ago

The problem with ROCm, unlike CUDA, is that it doesn’t run on much of AMDs own hardware, most notably their iGPU.

Comment by SwellJoe 2 days ago

Yeah, that kinda sucks, but, all their new generation onboard GPUs are supported by ROCm. e.g. Ryzen AI 395 and 400 series which will be found in mid-to-high end laptops and desktops and motherboards. They seem to have realized that the reason Nvidia is kicking their ass is that people can develop with CUDA on all sorts of hardware, including their personal laptop or desktop.

Comment by dragontamer 2 days ago

> If memory gets unified, what is the value proposition of ROCm supposed to be over mesa3d? Why does AMD need to invent some new way to communicate with GPUs? Why would it be faster?

And the memory barriers? How do you sync up the L1/L2 cache of a CPU core with the GPU's cache?

Exactly. With a ROCm memory barrier, ensuring parallelism between CPU + GPU, while also providing a mechanism for synchronization.

GPU and CPU can share memory, but they do not share caches. You need programming effort to make ANY of this work.

Comment by timmy777 2 days ago

Thanks for sharing. However, this missed being a good writeup due to lack of numbers and data.

I'll give a specific example in my feedback, You said:

``` so far, so good, I was able to play with PyTorch and run Qwen3.6 on llama.cpp with a large context window ```

But there are no numbers, results or output paste. Performance, or timings.

Anyone with ram can run these models, it will just be impracticably slow. The halo strix is for a descent performance, so you sharing numbers will be valuable here.

Do you mind sharing these? Thanks!

Comment by gessha 2 days ago

This is more of a “succeeding to get anywhere close to messing around” rather than “it works so now I can run some benchmarks” type of article.

Comment by l33tfr4gg3r 2 days ago

To give benefit of doubt, author does state multiple times (including in the title) that these were "first impressions", so perhaps they should have mentioned something like "...In the next post, we'll explore performance and numbers" to avoid a cliffhanger situation, or do a part 1 (assuming the intention was to follow-up with a part 2).

Comment by JSR_FDED 2 days ago

Perfect. No fluff, just the minimum needed to get things working.

Comment by aappleby 2 days ago

No benchmarks?

Comment by politelemon 2 days ago

First impressions