Zero-Copy GPU Inference from WebAssembly on Apple Silicon
Posted by agambrahma 2 days ago
Comments
Comment by fulafel 2 days ago
Beware the reality distortion field: This is of course how it's worked on most x86 machines for a long time. And also on most Macs when they were using Intel chips.
Comment by littlecranky67 2 days ago
Comment by ben-schaaf 2 days ago
dGPUs bring their own VRAM because it's a different type of memory, allowing them to get higher performance than they could with DDR. The M4 Max requires 128GB of LPDDR5X to reach its ~500GB/s bandwidth. The RX Vega 64 had that same bandwidth in 2017 with just 8GB of HBM2.
Comment by fc417fc802 2 days ago
Of course the APIs have allowed you to make direct use of pointers to CPU memory for something like a decade. However that requires maintaining two separate code paths because doing so while running on a dGPU is _extremely_ expensive.
Comment by kimixa 1 day ago
The "reserved" memory is more about the guaranteed minimum to allow the thing to actually light up, and sometimes specific hardware blocks had more limited requirements (e.g. the display block might require contiguous physical addresses, or the MMU data/page tables themselves) so we would reserve a chunk to ensure they can actually be allocated with those requirements. But they tended to be a small proportion of the total "GPU Memory used".
Sure, sharing the virtual address space is less well supported, but the total amount of memory the GPU can use is flexible at runtime.
Comment by fulafel 2 days ago
Comment by littlecranky67 1 day ago
Comment by fulafel 1 day ago
Comment by agambrahma 1 day ago
What is different then is the combination of
1. UMA memory (and yes, iGPU had this, pre-M1) 2. enough bandwidth / GPU throughput for local inference 3. straightforward `makeBuffer(bytesNoCopy:)` path
So, the novelty isn't the shared memory itself, but the whole chain lining up to make the Wasm linear memory -> Metal-buffer approach practical + performant enough.
(and not saying there's some Apple Silicon magic here either ... it'd work anywhere there was UMA and no-copy host-pointer path)
Comment by saagarjha 2 days ago
Comment by jsomedon 2 days ago
Comment by agambrahma 1 day ago
The value would be in actor processes, where you can delegate inference without paying the 'copy tax' for crossing the sandbox boundary.
So, less "inference engine" and more "Tmux for AI agents"
Think pausing, moving, resuming, swapping model backend.
I scoped the post to memory architecture, since it was the least obvious part ... will follow up with one about the actor model aspect.
Comment by saagarjha 1 day ago
Comment by swiftcoder 2 days ago
Comment by saagarjha 2 days ago
Comment by swiftcoder 1 day ago
Comment by nl 2 days ago
The whole Apple Silicon thing is (in this case) just added details that don't actually matter.
[1] https://github.com/WebAssembly/memory-control/blob/main/prop...
Comment by eis 2 days ago
Comment by fho 2 days ago
The difference being that modern integrated GPU are just that much faster and can run inference at tolerable speeds.
(Plus NPUs being a thing now, but that also started much earlier. Thr 10th gen Intel Core architecture already had instructions to deal with "AI" workloads... just very preliminary)
Comment by mirekrusin 2 days ago
Comment by fc417fc802 2 days ago
If you make use of host pointers and run on an iGPU no copy will take place.
Comment by fho 2 days ago
I am pretty sure that my old 10th gen CPU/GPU combo has the ability to use the "unified"/zero-copy access mode for the GPU.
Comment by eis 2 days ago
The article is talking about one particular optimization that one can implement with Apple Silicon and I at least wasn't aware that it is now possible to do so from WebAssembly - so to completely dismiss it as if it had nothing to do with Apple Silicon is imho not fair.
Comment by pjmlp 1 day ago
And yes things like the Amiga Blitter, arcade or console graphics units were already baby GPUs.
Comment by nl 1 day ago
That's the same no matter the physical memory system architecture.
Comment by trueno 2 days ago
enhance
> no copies, no serialization, no intermediate buffers
would it kill people to write their own stuff why are we doing this. out of all the things people immediately cede to AI they cede their human ability to communicate and convey/share ideas. this timeline is bonkers.
Comment by Aurornis 2 days ago
I’ve wasted so much time looking at interesting repos this year before discovering that one of the main claims was a hallucination, or that when I got to the specific part of the codebase it just had a big note from the LLM that’s it’s a placeholder until it can figure out how to do the requested thing.
The people who have AI write their articles don’t care if it works or if it’s correct. They’re trying to get jobs and want something quick and interesting that will appeal to a lazy hiring manager. We’re just taking the bait too.
Comment by trueno 2 days ago
I'd build on this: The people who have AI write their articles very likely don't know how their thing works or is correct. High chance they'll stumble when they are expected to speak about whatever it is they are presenting with some authority and demonstration of knowledge. Human to human, not being able to do that = obliterates trust. Places it somewhere near the realm of misinformation, which everyone unilaterally has no interest in consuming.
Good luck to people who want to fluff expertise and present as more-capable for job prospects, the world is shit and I know there's more people who need income than there are jobs that provide for our basic human needs, but this level of AI crutching is just going to bode poorly for those who think this is going to get them where they need to go.
Comment by rvz 2 days ago
"Here is X - it makes Y"
"That's not X, it's Y."
"...no this, no that, no X, no Y."
Another way of telling via code is by deducing the experience of the author if they became an expert of a different language since...yesterday.There will be a time where it will be problematic for those who over-rely on AI and will struggle on on-site interviews with whiteboard tests.
Comment by bensyverson 2 days ago
Comment by jhayward 2 days ago
It doesn't have a great success record.
I personally would rather they exhibited expert skills in using tools, and expressing their design insight as a part of that skillset.
Comment by JSR_FDED 2 days ago
All other things that could be LLM-mediated have no more signal.
Comment by andsoitis 2 days ago
Some ideas to help you: ask the candidate something underspecified and watch what they do first. Do they ask clarifying questions, make their assumptions explicit? After they answer ask what would change their mind, where does that break down? Pick a topic they know and ask them to explain it to a smart non-engineer. Make them estimate something they can’t look up (forces them to decompose, bound, and calibrate). Once they’ve proposed a solution to a question, change the constraints to see if they can adapt or whether they’re stuck.
What you want to evaluate is dynamic reasoning, adaptability.
Comment by z0r 2 days ago
Comment by notepad0x90 2 days ago
humans have been using tools to communicate since pre-history. language itself is one tool of communication invented to supersede body-language and grunting and noises. the thought and idea is theirs, it was communicated. Would it be that much different if they used a spellchecker extensively to edit their work?
I get why you're annoyed but is it really such a big deal? random people aren't to blame for whatever other annoyances "AI slop" has created.
Comment by trueno 2 days ago
Would it kill anyone at all to add a preamble that is forthcoming about using AI to write something? A chance to say these are my ideas and I've used claude to help me state it eloquently because <english is not my first language / i dont write well / claude said it better than i ever could> etc ? Not doing that, presenting as more capable/knowing than one probably is, is what destroys trust immediately the moment it's sniffed out that AI was used to write something.
It's irresponsible, a self-nerf, and it's annoying. Venn diagram there is basically a circle. We're all familiar with how vibe coding appears to weaken your ability to write code, like skipping the gym and expecting good muscle density. All I'm saying is people shouldn't be skipping the gym for literally communicating with each other because there's gonna be a lot of times in life where you're not gonna be able to whip out chat jippity to continue a real conversation with another person. Ceding that turf means you're willingly trading your ability to deal with real life scenarios with other human beings for short term gain. It's funny how the universe tends to find balance. Yeah, being well read and expressing ideas well is a skill, it takes work.
Comment by porridgeraisin 2 days ago
Why does it matter if it's their thought or not. If you currently care about GPU inference from webassembly on apple silicon, you can use this article. That's really about it.
Now if you care about GPU inference from wasm on apple silicon, and you found problems with this articles content, then great, comment about it. If you say that the problem with the content is due to the usual surface level slop LLMs belt out, then great complain about LLMs. But your comment didn't say anything about gpu inference from wasm on apple silicon.
Comment by trueno 1 day ago
Comment by rdedev 2 days ago
That's a pretty utilitarian view of language. How would it feel if everyone spoke and wrote like a PR representative? This is what an article written by an LLM is starting to sound like.
I'm even willing to argue that the way in which you convey your ideas is as important as the idea itself. Like we could all be eating soylent for our daily nutritional requirements but we don't. The taste of the food we eat is important. It's the same with writing for me
Comment by ben-schaaf 2 days ago
Are they? I don't know how much they used AI, the entire article could be written from a one sentence prompt and so I'd argue that the thoughts and ideas are not their own.
This isn't like using a spell checker, it's like using a ghost writer.
Comment by nullsanity 2 days ago
Comment by wmf 2 days ago
Comment by itamos 2 days ago
Comment by pjmlp 2 days ago
Also, these folks should be amazed by 8 and 16 bit games development, or games consoles in general.
Comment by jedisct1 1 day ago
Comment by tancop 1 day ago
Comment by agambrahma 1 day ago
It's less about browsers, and more about server/edge/local-agent runtimes.
Wasm lets you have
- sandboxing (untrusted actor code)
- clean snapshot/restore
- portability of actor across machines
If you don’t need those properties, then yes ... native is obviously the better choice
Comment by adamsilvacons 2 days ago
Comment by EthanFrostHI 2 days ago