Nvidia is proposing a beast of a CPU system for Windows PCs
Posted by tosh 3 days ago
Comments
Comment by stego-tech 3 days ago
The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory. Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.
A unified memory pool does two things:
1) Lets systems optimize utilization based on need, rather than be confined to specific pools
2) Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
So at a time when memory is expensive, unified pools make more sense. Even when memory becomes cheap and plentiful again, it’s just practical at this point to allocate a larger overall pool instead of managing discrete sets.
The one big drawback is security. A shared memory pool means side-channel attacks against memory from the GPU or CPU could potentially compromise the other as well, meaning memory-safe designs are going to be critical to security going forward (which is good for Rust adherents, I figure).
Comment by AnthonyMouse 3 days ago
The trouble with this is that the different types of memory have different characteristics. Latency for ordinary system memory is actually better than it is for GDDR, because GDDR is optimized for bandwidth. RTX 5090 has 1.8TB/s of memory bandwidth with a 512-bit memory bus. The same bus width for DDR5-9600 would have better latency but only a third of the bandwidth.
CPU workloads are generally bounded by latency and GPU workloads are generally bounded by bandwidth, which is why they use two different types.
> Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
The trouble with this is cost. In principle you could get the same 1.8TB/s of memory bandwidth as the RTX 5090 has, with the better latency of DDR5, by using DDR5 with a 1536-bit bus. This is indeed with multi-socket servers do, two sockets with 768-bit in memory channels per socket, but now check how much those system boards cost.
But the remaining alternatives are both worse. If you use GDDR for the unified memory then GDDR costs more than DDR and you're going to have significantly worse latency for the CPU. If you use DDR without a 3-4 times wider bus than the already-wide GPU then the GPU gets starved for bandwidth.
Comment by Melatonic 2 days ago
It also has way better throughput because it's physically surrounding the chip itself and wired in a way that maximises this.
The real problem is interconnect speed and latency. We have made tons of progress elsewhere but AI is exposing that the interconnect in many systems is just not great. Even future PCIE 6.0 is fairly bandwidth constrained compared to 8 channels of DDR memory or the way we solder GDDR next to the chip.
We moved on from AGP and older formats to PCI-E and I think it's time to do that again. And maybe even "slot" based implementations in general for both RAM (system and graphics) and GPUs.
We need consumer and workstations in summary to use pin based stuff like LPCAMM ram. And the interconnect on the motherboard itself needs to be both wider (more bandwidth) and lower latency. This might require moving on from motherboard being 2 dimension only (a flat board) to something like an L shape to gain more physical board space.
Comment by ssivark 2 days ago
Comment by marcosdumay 2 days ago
Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.
Comment by trumpdong 2 days ago
Comment by marcosdumay 2 days ago
And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.
It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.
Comment by trumpdong 2 days ago
Comment by saagarjha 2 days ago
Comment by marcosdumay 1 day ago
Comment by Melatonic 2 days ago
Comment by stego-tech 2 days ago
Everything is ultimately a compromise of some sort, and modern Unified Memory feels like one of the better compromises out there given the current plateauing of hardware scaling, the growing costs associated with memory and NAND, and the shifting complexity from hardware (more instruction sets, more accelerators, more cores) to software (more abstraction layers, more machine learning).
Comment by fc417fc802 3 days ago
Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)
Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.
Comment by maccard 3 days ago
Game dev here. For anyone reading this - it’s not because we’re lazy, it’s because _it’s really hard to do_.
One of the biggest differences between the current generation consoles and the current gen PCs is unified memory.
Comment by stego-tech 3 days ago
A unified pool of memory suddenly makes that simultaneously easier, but also far more flexible, which frees up developer time and bandwidth to focus on other, more important tasks.
Comment by BuyMyBitcoins 3 days ago
Comment by maccard 3 days ago
The problem is that when you need something in gpu you have to go through RAM first (unless you have DMA which is a more recent addition). That doesn’t just add latency it also adds an extra step of cache invalidation, so you have to plan for that from the highest level of gameplay. If you need to prepare for a GPU memory miss _and_ a CPU memory miss as a worst case all the time, it’s very hard to make good use of the bandwidth in the best case
Comment by keyringlight 3 days ago
I'm not a game developer, but it would also seem to be a link between resource usage by the engine, and whatever content the production side are making. For all the commentary about how brilliant the id software engines are, if you examine the levels you pass through they're also very efficient with what they demand out of the engine - it's like an orchestra playing well together, not one instrument that means you can do anything.
Comment by nearbuy 3 days ago
Comment by maccard 3 days ago
That spec is also a throughput measured per second whereas our frame rates are much higher than 1/s. At 60hz, that’s now between 140 and 800 textures a frame. If you miss _one_ you don’t get that back.
A single main character in a game can be 2-5 regular textures, plus all of the extra mapping textures we have these days. Now do landscapes, environments, props, background videos, and it all adds up. 4k textures are pretty universally used. If you look at a tiny object up close we need a higher res texture to be able to show it neatly.
You also have memory pressure - raytracing makes heavy use of VRAM so you have to make the tradeoff of how much do you want to allocate to caching lighting, vs how much you want to keep textures and geo around.
Lastly, as you say, actually keeping up with 360GB/s from the CPU side is tough. If you require any transformation or CPU operations that’s just not going to happen. If you need to pull from disk, even on an NVMe drive reading synchronously, the max throughput is < 10% of that, and that assumes you are actually reading 360GB from disk. If you pause to do anything else, you’ll significantly slow that down. Players also generally don’t like it if we thrash their NVMe disks :)
Comment by nearbuy 2 days ago
Absolutely an RTX 3060 is a more normal gamer GPU than the 5090, but you're also not playing in 4k without DLSS on a 3060. Drop to the most common resolution on Steam (1080p), and turn on DLSS and you've basically cancelled out that 6x factor in bandwidth. Even if the 3060 had more bandwidth, it doesn't have enough processing power for native 4k gaming in typical games. So 360 GB/s is still a lot of bandwidth for the resolution most 3060 gamers are using.
Comment by maccard 2 days ago
DLSS isn’t just a magic on switch for free perfect up scaling. If you rendered at 720p and DLSS’ed up to 1080 it’s still going to look pretty rubbish.Its always surprising to me just how many people have 1080 monitors though given we’ve had more than that for two generations of consoles.
And lastly - all the same points still apply about frame rate (which can be more than 60) and memory bandwidth per frame and cache invalidation etc at 360GB/S, as they do at 1.8TB/s
Comment by nearbuy 1 day ago
That greatly reduces your GPU memory bandwidth though. Sampling a subset of the texture only transfers that subset. Reading from higher mip levels uses less bandwidth. If your textures are high enough resolution to appear sharp at both resolutions (at least one texel per pixel), you need 4x more bandwidth to sample your material textures at 4k screen resolution for the same scene.
More importantly, material texture sampling is not most of your bandwidth to begin with. At 4k, most of your bandwidth is going to your full screen render passes. Especially with deferred rendering.
> DLSS isn’t just a magic on switch for free perfect up scaling. If you rendered at 720p and DLSS’ed up to 1080 it’s still going to look pretty rubbish.
I don't find this true at all. DLSS 4 Balanced looks excellent and renders at less than 720p for 1080p output.
Comment by gmueckl 3 days ago
Comment by nearbuy 3 days ago
Plus, DLSS can greatly reduce the bandwidth requirements for 4K gaming.
Comment by gmueckl 3 days ago
Comment by nearbuy 2 days ago
Comment by gmueckl 2 days ago
Let me put it this way: what I care about is how quickly data arrives after a bunch of shader threads request it. Throughput is one way for hardware to reduce that time. The other way is to hide the latency (GPUs do a lot to keep themselves busy while waiting for memory), but those strategies can only do so much.
Lower memory throughput almost always leads to a longer runtime of GPU calls in practice, and thus lower update rates.
Comment by nearbuy 1 day ago
Comment by Stevvo 3 days ago
Comment by maccard 3 days ago
I promise you want our games to look as good as you want them to look.
Comment by Stevvo 2 days ago
Comment by cm2187 3 days ago
Comment by to11mtm 3 days ago
LPCAMM2/SOCAMM2 exist, heck I think Framework is using LPCAMM2 in one of their new laptops.
Heck, I'm willing to bet that a lot of manufacturers would rather go that route than soldered in, if for no other reason than the relative cost of warranty work between the two.
However, people probably need to stop being obsessed with ultrathin laptops for that to happen.
Comment by fc417fc802 3 days ago
I've never been able to understand this. Once we made it down to ~20 mm (which for the record still accommodates dual-stacked SO-DIMMs, a 2.5 inch bay, and a user replaceable battery but not an RJ45 jack) I don't understand what the practical impact of any further reduction is supposed to be. Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.
Comment by wtallis 3 days ago
There's very wide variation between laptops in how noticeably they'll flex or yield or creak when pressed. Laptops with a build quality that actually feels solid are far from being ubiquitous or even a majority.
Doubling the thickness of my MacBook Air would probably make it regress on that solid feeling, unless the weight was also significantly increased.
And regardless of whether current laptop form factors could accommodate a 2.5" drive, there's no use in doing so. That drive form factor is entirely obsolete for laptops and is just a waste of space and materials, and has been for about a decade.
Comment by fc417fc802 3 days ago
I'm not sure why you seem to think that making something thicker would reduce the stiffness or strength. It's generally the opposite - see the concept of a torsion box. Anyway that wasn't the point. The point was that regardless of how thin you make the thing it will forever remain a cumbersome and delicate item that you have to treat with care when packing so what meaningful positive impact does shaving off those last few mm have? It's never made any sense to me.
Comment by bloqs 3 days ago
Comment by ForOldHack 3 days ago
I would much prefer two SODIMM sockets with the option to go to 32MB shared video memory, or DDR4/DDR5. Give me OPTIONS!
Comment by stego-tech 3 days ago
Unified memory doesn't have to be soldered on or serviceable. That's a choice Apple made because it fit their product vision, but it's not mandatory in the slightest.
Comment by Melatonic 2 days ago
CPUs don't slot in for a reason
Comment by arka2147483647 3 days ago
So, it does not have to be soldered.
Comment by sroussey 3 days ago
I don't know how linear or sensitive CPU and GPU benchmarks are to such a 20% slowdown, but i don't think Apple wants to pay it. And it looks like the next generation will be even closer to the SOC.
Comment by Melatonic 2 days ago
We're also hitting the limit of DDR5 here (before moving to multiplexed)
I would guess if you had LPCAMM2 located physically around the CPU (one or two on each of the 4 CPU edges) you could also reduce that latency.
Comment by Lplololopo 3 days ago
Comment by bpavuk 3 days ago
Comment by GeekyBear 3 days ago
If you wanted to get sleep right and improve battery life, that was the trade off.
Comment by nottorp 3 days ago
Thought getting sleep right was something that happened before MS decided they need to be able to wake your PC any time they want and not hardware related much.
Comment by GeekyBear 2 days ago
Comment by pjmlp 2 days ago
The vertical integration many associate with Apple, was the common approach to most 8 and 16 bit home computers.
Naturally after all these years, many PC vendors want their margins back, and thus the phenomenon of everyone going back to vertical integration, especially in form factors that are ideal for such, like laptops, tablets and phones.
So the option boils down to classical desktops, or being picky on which laptops to buy.
Comment by MBCook 3 days ago
Comment by cm2187 3 days ago
Comment by MBCook 3 days ago
It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?
But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.
Comment by wtallis 3 days ago
LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.
Comment by AnthonyMouse 3 days ago
LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus
Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.
Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.
Comment by wtallis 3 days ago
The difference here is in what the standard defines on paper vs what is actually shipping in products and readily available off the shelf. Who's selling a whole system with LPCAMM2 certified for 9600MT/s? Intel's current-gen Panther Lake top of the line laptop chips are rated for 9600MT/s when using soldered LPDDR5x but only 7467MT/s when using LPCAMM2, according to their current datasheet: https://www.intel.com/content/www/us/en/content-details/8721...
That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds. Intel's own shipping memory speed moved past 7467MT/s a few months earlier than even Apple's.
> Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs.
> Moreover, making the bus wider is "easy"
Citations needed. Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s. The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s (which again, is not yet available even in a 128-bit configuration in shipping hardware). Adding more channels to even a LPCAMM2 configuration means adding more trace length, because only two modules can actually be adjacent to the CPU socket. (Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.)
Comment by AnthonyMouse 2 days ago
The 9600MT/s modules are new and will probably be found at some point this year. Framework already sells LPCAMM2 at 8533MT/s with full validation:
https://knowledgebase.frame.work/what-drammemory-is-supporte...
> That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds.
It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.
> Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s.
Servers use conservative timings. EXPO memory kits above 6400MT/s are available for Threadripper with 8 channels. And again, these are using traditional DIMMs with longer traces rather than CAMM, but they're still managing an extremely wide bus with close to the same performance.
> The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s
CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.
> (which again, is not yet available even in a 128-bit configuration in shipping hardware).
A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.
> Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.
Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.
Comment by wtallis 2 days ago
From your link:
> Framework Laptop 13 Pro (Intel® Core™ Ultra Series 3) supports one slot of LPCAMM2 memory up to 96GB at the native 7467 MT/s speed. It is compatible with LPCAMM2 modules with memory speed rated above 7467 MT/s, but the speed will be capped at 7467 MT/s because of the platform limitation.
The modules in question can only theoretically operate at 8533MT/s. Framework has yet to sell a system where the modules actually operate at more than 7467MT/s.
> It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.
You're either doing the math wrong, or just plain looking at the wrong systems. Try looking at the M5 generation.
> CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.
Yes, that's a problem, and Apple has solved it by moving the DRAM on-package. Datacenter GPUs have also solved it that way by putting the DRAM on a silicon interposer to allow even wider bus widths. Soldering standard DRAM packages on the motherboard is not the limit of how memory can be soldered down.
> A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.
Yes, 128 bits at lower speeds. Did you forget that the whole point I'm making here is that the speeds are not the same?
> Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.
The Mac Studio with the M3 Ultra is actually running the DRAM at a lower frequency than what Framework and other Intel-based systems could, but more than making up for it in bus width, to provide far more total memory bandwidth than any plausible LPCAMM2-based system that could be built today.
Comment by AnthonyMouse 2 days ago
The M5 generation isn't "1.5 years old" and even those aren't all that speed. The M5 Max with the 32-core GPU is ~7200MT/s, while the one with the 40-core GPU is over $4000.
> Yes, that's a problem, and Apple has solved it by moving the DRAM on-package.
There is no "package" here. Apple's processors are soldered to the logic board, as are Intel's in laptops. The DRAM Apple uses is standard LPDDR5 from the normal OEMs. Have a look at the LPCAMM2 module. It has four standard DRAM chips on the top and a connector on the bottom. DDR5 channels are really 32-bits, so the 128-bit module has four channels, four chips. The module is barely any larger than the chips themselves. It's not saving significant space by soldering them, it's just an alternative means of attaching them to the system board in the same place.
> Yes, 128 bits at lower speeds.
At the same speeds Apple was shipping a few months ago. Apple being the first to ship LPDDR5-9600 when it was that recent doesn't imply that it needs to be soldered, it implies that they're a huge company that can pay for early access to the new thing whether it's soldered or not. 9600MT/s LPCAMM2 modules have already been announced -- it's not a technical problem, it's an "Apple and OpenAI are buying out the fastest DRAM right now" problem.
> The Mac Studio with the M3 Ultra is actually running the DRAM at a lower frequency than what Framework and other Intel-based systems could, but more than making up for it in bus width, to provide far more total memory bandwidth than any plausible LPCAMM2-based system that could be built today.
By this logic the thing to beat it is the 8S Xeon servers from almost a decade ago with 48 channels of DDR4-2666. Or existing 2S servers with 24 channels of DDR5-6400.
Comment by wtallis 1 day ago
Ok, so the problem is you doing the math wrong. Note that the MacBook Pro configuration you're talking about has a DRAM capacity of 36GB, compared to 48+ GB for the ones with all the cores enabled and the full memory bandwidth. That 32-core config isn't running the DRAM slower, it's running with a narrower bus and fewer DRAM chips: https://theapplewiki.com/wiki/MacBook_Pro_(16-inch,_M5_Max)
> There is no "package" here. Apple's processors are soldered to the logic board, as are Intel's in laptops.
Denying the difference between putting the RAM on-package vs on the motherboard doesn't make that difference stop being real.
> Apple being the first to ship LPDDR5-9600 when it was that recent doesn't imply that it needs to be soldered
Apple wasn't even close to being the first to ship LPDDR5-9600. Android phones using DRAM at that speed started shipping at the end of 2023, and moved on to 10700MT/s starting in 2024. The situation here is not anywhere close to being one of Apple paying a premium to get faster DRAM chips that other laptop manufacturers can afford. Rather, for most of the past several years, laptop manufacturers (especially on the x86 side) have been unable to buy DRAM chips with a rating slow enough to match what their processors are capable of running at. It's become quite common to see on a Thinkpad spec sheet that eg. the DRAM parts are rated for 7467MT/s but will only operate at 6400MT/s due to processor limitations, then the next year see that the DRAM parts are rated for 8533MT/s but run at 7467MT/s, and so on. LPDDR speed increases have been driven primarily by flagship smartphones, and even the leftover slower-binned parts are faster than what most laptops can handle.
Comment by Melatonic 2 days ago
But for throughput served with 12 channels have pretty high theoretical even with slower
Comment by lelanthran 3 days ago
Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.
Comment by ForOldHack 3 days ago
"Abdul Jabar, couldn't have made these prices, with a sky hook."
Comment by QQ00 3 days ago
Comment by ForOldHack 3 days ago
Actually the opposite is true. Socketed RAM can be made to overclock and adjust timings, while soldered ram, no. Two Lenovo's one soldered ( Carbon X1 ), one T590, one slot: Crucial 16GB, 260-pin SODIMM, DDR4 PC4-19200. Exact same processor, the X1 is DDR3 soldered on 532.0 MHz PC3-1066. The T590, has DDR4, PC4-19200, 1200Mhz.
Both have a Core i7 8665U... and the T590 is much faster, with socketed ram.
Comment by lmz 3 days ago
Comment by nine_k 3 days ago
Comment by ValentineC 3 days ago
Comment by wtallis 3 days ago
If you look at eg. an Intel laptop chip, you'll see they design and build a memory PHY that can interface with either DDR5 or LPDDR5x. They don't support splitting it to have one controller operating with DDR5 and the other with LPDDR5x, for fairly obvious reasons: more complex hardware, harder for software/operating systems to manage optimally, and not a lot of benefits to drive demand and justify the expenses. The speed difference between LPDDR5x and DDR5 isn't really large enough to use LPDDR5x as an L4 cache; it would be more like two different NUMA nodes, with complications for laptop power management.
If you want somebody to build a chip with more than the usual 128-bit bus and make some of the memory controllers use LPDDR and some DDR5, then you're asking for a significant increase in chip cost due to the extra memory PHYs and pin count. That cost is only justified if almost all products using the bigger chips are going to actually take advantage of the full complement of memory controllers.
Comment by Onavo 3 days ago
What happened to PCIe 8 and CXL?
Comment by to11mtm 3 days ago
PCIe6 is a much larger change than 'just bump up the transfer rate', the encoding changed too (on top of the new code length, it's no longer NRZ,) so everyone needed to design and validate both the new encoding block, negotiation, etc etc.
That said, I'm guessing PCIe7 will be a 'smoother' transition from PCIE6, i.e. we might see 7.0 products in 2027. That will theoretically get you ~240GB/sec, on an x16 link, or hypothetically a little less than the hypothetical max of a current Strix Halo. (I'm guessing however, that PCIe protocol overhead will make the difference larger.)
Comment by tjoff 3 days ago
Most systems barely need more gpu memory than what is required for video, browsing etc.
Just because we found a new usecase doesn't flip that on its head.
Besides, I want to keep doing what I'm doing today. So if I need 128GB today and my local AI needs 128 GB then I'd need 256 GB to keep doing the same work.
The argument rather seems to be that we shouldn't use such expensive memory on the GPU. Which might be true if you only want to do inference on it.
Comment by Joel_Mckay 3 days ago
It is ambitious, and absurd... like all CEOs that eventually go loopy. =3
Comment by jmyeet 3 days ago
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.
Comment by Salgat 3 days ago
Comment by zozbot234 3 days ago
They seem to? Intel Arc is the cheapest option by far for a discrete card with 32GB VRAM.
Comment by Auracle 3 days ago
It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.
Comment by zozbot234 3 days ago
Comment by schubidubiduba 2 days ago
Comment by to11mtm 3 days ago
(I still kinda want to get one tho.)
Comment by htrp 3 days ago
Needs 320 GB Vram
Comment by ActorNightly 3 days ago
The biggest advantage with NVIDIA is CUDA.
Comment by overfeed 3 days ago
AMD is selling every MI card it makes, and the market wants more of them.
Comment by ActorNightly 2 days ago
Comment by dahart 3 days ago
Comment by jmyeet 3 days ago
BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:
1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and
2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.
According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.
The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.
You can chain Mac Studios together into a cluster with TB5 too.
But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.
Comment by JohnBooty 3 days ago
Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.
Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.
Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.
Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.
By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.
The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.
Since Apple already sells unified memory systems, what
is the market opportunity you envision?
nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)
So.
Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.
They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.
The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.
Comment by zozbot234 3 days ago
Comment by robotresearcher 3 days ago
Comment by zozbot234 3 days ago
Comment by robotresearcher 3 days ago
Comment by zozbot234 3 days ago
Comment by robotresearcher 3 days ago
If there's an M5 Ultra it'll be interesting to see what they've optimized it for.
Comment by MBCook 3 days ago
Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.
My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.
Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.
Comment by JohnBooty 3 days ago
Yes, a Mac with 128GB+ will let you load some pretty big models.
However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.
https://www.reddit.com/r/oMLX/comments/1swztoh/m5_max_128gb_...
And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.
So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.
The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.
My understanding is this is the advantage that’s pushing huge Mac Studio demand.
This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.Comment by zozbot234 3 days ago
Comment by zozbot234 3 days ago
Comment by Melatonic 2 days ago
Comment by woodson 3 days ago
These days, more like >$4.1K (at least in the US).
Comment by simonebrunozzi 3 days ago
Comment by Nevermark 3 days ago
Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.
Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.
Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.
Comment by david-gpu 3 days ago
Comment by jayd16 3 days ago
I'm honestly a little confused by what you mean here. Why would we want to maximize those things? Games are about consistent output under the frame deadline, not full saturation of the hardware.
Why would anyone try to saturate a 5090 with their game? The addressable market is tiny and you'd have to hope their full spec runs as well as or better than your test rig or they'll still not hit framerate.
Comment by simonbw 3 days ago
Comment by Rohansi 3 days ago
You're also likely not going to maximize all of bandwidth, compute, etc. because one of them will likely be your bottleneck. And it might be different depending on the GPU, too.
Comment by rustystump 3 days ago
Comment by Rohansi 2 days ago
Comment by jayd16 2 days ago
Certainly not more from main memory, and maybe not more from the vram either depending on how the pipeline goes.
It's not a linear slider.
Comment by Retr0id 3 days ago
Comment by bobmcnamara 3 days ago
Comment by Retr0id 3 days ago
Comment by RiverCrochet 3 days ago
Comment by seemaze 3 days ago
The question is ultimate shape of knowledge compression and bandwidth optimization at which we arrive I suppose.
Comment by canyp 3 days ago
More details: https://rocm.docs.amd.com/en/docs-7.2.0/how-to/system-optimi...
Comment by seemaze 3 days ago
Comment by electroglyph 3 days ago
Comment by Salgat 3 days ago
Comment by zozbot234 3 days ago
Comment by Melatonic 2 days ago
Comment by ForOldHack 3 days ago
Comment by pbalau 3 days ago
Shared memory existed since the first CPU with an embedded GPU came to market and you could set in BIOS how much memory goes to what component.
I do have an opinion about how unified memory could be different, but I want a proper explanation.
Comment by saltcured 3 days ago
In unified memory, all the memory is host memory and data can go from program to GPU with zero copy movements. The addresses of buffers can be shared via appropriate MMU translation support, so that the application and graphics subsystem are communicating effectively through the basic RAM cache coherency protocols over the same buffers.
Edit to add: Aside from the zero copy transfer potential, it also means dynamic allocation strategies can shift the balance between host and graphics allocations on the fly. Individual image and message buffers can be allocated on the fly instead of setting a static split between the two worlds.
Comment by johnny22 3 days ago
Comment by stego-tech 3 days ago
Comment by pbalau 3 days ago
Comment by surajrmal 3 days ago
Comment by ImprobableTruth 3 days ago
Comment by cthalupa 3 days ago
Comment by Rohansi 3 days ago
Comment by MBCook 3 days ago
Unified memory is what Apple is doing, other phones do, and many low end built in GPUs have done in PCs for ages. There is only one physical memory pool. Both the CPU and GPU can access it at full speed.
This means no copying between pools of memory. No speed penalty accessing the CPU memory from GPU or vice versa. If the GPU only needs 2 GB to draw the desktop it only uses 2 GB of the pool. Or it can use 45 GB if it needs it and the CPU doesn’t. But all memory has to be the same speed, and that ain’t cheap given how fast GPUs like things. I don’t know if expandable memory is possible, and they use the same bus do they compete for bandwidth. Seems theoretically easier to program for to me.
The opposite is what’s been common in graphics cards since the 2D era. CPU and GPU have their own memory and can talk over PCI/AGP/PCI-E. This is what I think they mean by shared memory, if it’s not what’s the point in touting unified?
In this model if the GPU uses 2 GB of its 12 GB total, the other 10 isn’t available to the OS at full speed and I’m not aware of any operating systems that would use it for programs/cache by default. If the GPU needs 45 GB… too bad. You have to page things in and out of GPU memory over the much slower system bus. Starting a game means loading assets into main memory then transferring them to the GPU (newer tech can accelerate this). But the CPU can have slower memory than the GPU saving money. Memory expansion on the CPU side easy. And the CPU saturating its memory bus has no effect on the speed of the GPU memory bus because it’s physically separate. More complicated memory model but it’s the one everyone uses used to.
Which is better is a matter of opinion and workload needs.
Comment by Rohansi 3 days ago
> I don’t know if expandable memory is possible
It technically is. These new systems (mostly) get their high bandwidth by using more channels (wider bus) of normal RAM modules. A system that has LPCAMM2 sockets should allow using the same LPDDR5X memory but you'd need a socket per two channels. A typical PC only supports two channels so having four (two sockets) would double the bandwidth.
Comment by MBCook 3 days ago
Comment by Gareth321 3 days ago
Comment by Lplololopo 3 days ago
What do you mean by this? Memory bandwidth is fundamental to the speed of an local AI model
Comment by nalekberov 3 days ago
However, I couldn’t care less about faster CPU when:
1. It limits my ability to upgrade my system
2. Windows gets increasingly bloated and slower
Comment by merb 3 days ago
Comment by NikolaNovak 3 days ago
Comment by GTP 3 days ago
Comment by wren6991 3 days ago
I'm not sure what you mean by this. Memory bandwidth is the main bottleneck for single-user decode. The bottleneck is actually more severe for end-user inference than cloud inference, because end users don't have the option to increase arithmetic intensity by computing tokens for multiple clients in the same pass.
One thing we've learned from Apple is the viability of spamming more LPDDR5X channels (up to 1024-bit total bus width on M3U) as a means of achieving high bandwidth while keeping the cost/capacity reasonable.
Comment by jorvi 3 days ago
GDDR tries to push out as much bandwidth as possible, because that really matters for (traditional) GPU workloads. A constant but insignificant (= correctable) error rate is considered completely fine for GDDR, because that sacrifice allows the memory to be pushed much farther.
Meanwhile most (traditional) SDRAM workloads don't give a hoot about bandwidth but really care about latency. And ideally you want no errors, hence ECC RAM being so venerated.
If you unify memory, you're gonna have to choose to sacrifice one of those workloads or go suboptimal for both.
Weirdly enough this mostly matters for non-gaming workloads. The Apple M-series are absolute monsters in gaming, completely crushing the RTX XX90 editions in performance-per-watt, but as soon as memory bandwidth becomes paramount the M-series falls heavily behind.
Comment by aabdi 3 days ago
Comment by cthalupa 3 days ago
My M5 Max 128gb MBP decodes faster than one of my Sparks, but the Spark's prefill is so much faster it can often answer the same query before the mac's prefill is finished. If you have large prompts, low cacheability, etc., a spark might be a very good options.
Not to mention you get can get two sparks and the MBP will be 85%+ of the cost at half the RAM.
I'm kind of tempted to pick one up. Leave running big models to my dual dgx setup, and all the misc. random stuff on an rtx.
Comment by zozbot234 3 days ago
Comment by aabdi 3 days ago
Seems niche to be both uncacheable and long context?
Comment by cthalupa 2 days ago
Anywhere where you might have a large backlog of data to work with can end up in this sort of situation.
Comment by Izikiel43 3 days ago
The ps4 was the prime example of this, and how it could run so many great games.
Comment by pjjpo 3 days ago
Comment by Asmod4n 3 days ago
Comment by stego-tech 3 days ago
Most consumers will never really care about, let alone see, the difference in PCIe or memory bandwidth impacts from such a shift to unified memory pools. We might (being, at least in my case, a huge nerd), but I’m increasingly of the opinion that if modern blockbuster games are built for upscaling/reconstruction anyhow, then suddenly such sacrifices to performance seem acceptable relative to the gains in efficiency.
Comment by jayd16 3 days ago
No copy unified memory will help with that but you do pay the read speed costs.
Comment by BoredPositron 3 days ago
Comment by up2isomorphism 3 days ago
Comment by testing22321 3 days ago
M1 knocking from 2020.
Gamed changed, past tense, six years ago. This is catch-up.
Comment by jandrese 3 days ago
Comment by zdw 3 days ago
Most other SGIs had single or low double-digit megabytes of texture memory, whereas the O2 could host one gigabyte of unified memory and use a huge chunk of that for textures.
Comment by wmf 3 days ago
Comment by JMiao 3 days ago
Comment by p_l 3 days ago
That was because unlike other GPUs at the time, O2's didn't have dedicated memory but shared the memory with CPU - way slower, but zero copies and bigger.
Arguably early home computers and workstations also used "unified memory" :D
Comment by Rohansi 3 days ago
Comment by throwaway27448 3 days ago
Comment by zokier 3 days ago
Comment by throwaway27448 3 days ago
Comment by Rohansi 3 days ago
Comment by throwaway27448 3 days ago
Sorry, I meant before the M1 came out. And you and I both know that "unified memory" doesn't refer to allocating ram to the gpu for zero-swap sharing.
Comment by Rohansi 1 day ago
Comment by Rohansi 3 days ago
Vega series is 2017.
https://rocm.docs.amd.com/projects/HIP/en/docs-6.3.0/how-to/...
Comment by p_l 3 days ago
Comment by throwaway27448 3 days ago
Comment by ac29 3 days ago
Comment by throwaway27448 2 days ago
Comment by p_l 2 days ago
Comment by Rohansi 2 days ago
Comment by Rohansi 3 days ago
I don't think the M1 specifically focused on inference. Their goal was to replace Intel/AMD/Nvidia with their own chips, and since the previous Macs shipped discrete GPUs, they had to match or beat those so they don't ship something slower.
Comment by bombcar 3 days ago
Comment by zokier 3 days ago
Comment by supertroop 3 days ago
Funny that it is getting credit only now.
Comment by p_l 3 days ago
O2 was popular in systems where large textures or textures generated dynamically (like mapping external video input to texture) was important
Comment by vlovich123 3 days ago
As a Rust adherent, please do not put words in our mouths or set up unrealistic expectations for other people by linking together concepts at a very shallow level.
Language level memory safety has no answer for hardware security flaws which is what side channel attacks are. No programming language can provide memory privacy if another chip in your machine can read your memory. Just like no programming language can protect your application from a kernel vulnerability of the kernel it’s running on.
Comment by stego-tech 3 days ago
Comment by b112 3 days ago
Comment by infecto 3 days ago
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
Comment by root-parent 3 days ago
This is the 2026 edition of Ken Olsen: "There is no reason anyone would want a computer in their home"
Comment by throw0101a 3 days ago
Digging into this:
> In conclusion, there is evidence that Ken Olsen did doubt the need for computers in the home, but the evidence is based primarily on the testimony of David Ahl who was perturbed when the personal computer project he championed at DEC was not supported by Olsen in 1974.
> Olsen’s resistance may have been similar to that expressed by another DEC executive, Gordon Bell. In 1980 Bell thought home terminals would act as gateways to remote computers which would provide appropriate services.
* https://quoteinvestigator.com/2017/09/14/home-computer/
It was supposedly said in 1977: most computers at that time were not small, and so it would not be surprising that people would not expect the general public to desire a large, power-hungry, noise-y apparatus in their house.
Comment by wccrawford 3 days ago
And, like the overly large machines of 1977, models are getting faster, leaner, and better. It's happening a lot quicker, though.
Comment by Silhouette 3 days ago
Comment by api 3 days ago
You already can if you’re willing to spend many thousands of dollars on a beast of a machine. I’m talking about middle tier desktops and laptops here. Maybe eventually even phones.
The only way hosted stays strongly competitive in that world is if they can keep pushing the frontier or by playing the classic social media and SaaS games of network effect building and integrations.
Many people might still use hosted, of course, but what I really mean is that their multiples won’t be justified and they will have little to no moat. AI will become commoditized, like a sophisticated next generation form of an encyclopedia with search.
Comment by throw0101a 2 days ago
Just because you can do more and more things at home (thanks Moore and Dennard), doesn't preclude needing things also done remotely. The number of at-home systems seems to have fed a growing number of remote systems (especially once always-on connectivity became ubiquitous).
It's basically the angle Apple is going for: do as much locally (for the sake of privacy), and then offload when it becomes "too much".
Comment by Silhouette 1 day ago
Comment by kristov 3 days ago
Comment by supermatt 3 days ago
Comment by parineum 3 days ago
People take these quotes out of context all the time. Said in a business context, there was no need, at that time, for someone to have a personal computer.
There's no business justification in 1977 for a personal computer department at a business. It's similar to the gates quote about RAM (I think it was 64KB?).
These statements aren't meant to be forever quotes. Their business plan quotes.
Comment by michaelcampbell 3 days ago
640, and Bill Gates said he either never said that, or at least never remembered having said it. I think there is no evidence anywhere that he did.
https://www.computerworld.com/article/1563853/the-640k-quote...
Comment by Shorel 3 days ago
Comment by glimshe 3 days ago
Comment by shermantanktop 3 days ago
The early popularity of Minitel, the continued popularity of ssh/tmux, and the web browser itself indicates that bespoke client applications are not the only way. He wasn’t directionally wrong.
Comment by wslh 3 days ago
Comment by dakolli 3 days ago
Comment by Gigachad 3 days ago
Even for a lot of LLM type tasks, 128gb is likely more than enough to control a lot of PC configuration and automation with natural language.
Comment by joering2 3 days ago
Comment by shermantanktop 3 days ago
Nobody ever said that, at least not as an assertion or prediction. The actual instances of similar language are from multiple people describing their earlier thoughts before they learned it wasn’t true.
Comment by throw1234567891 3 days ago
Comment by DonHopkins 3 days ago
Comment by fg137 3 days ago
Comment by AaronAPU 3 days ago
Local models aren’t deterministically equivalent in capabilities to foundation models. Home computers are turing complete; just like a mainframe. They are just slower. Often not slower enough to matter.
Comment by sandworm101 3 days ago
Comment by Pxtl 3 days ago
You could run a pretty good home server on $50 of gear and yet we never saw any real adoption of OwnCloud/NextCloud style products as an alternative to Google Drive/Photos or Apple Cloud.
Why should LLM/Transformers be any different? Especially when you need a proper expensive GPU to run them instead of a Raspberry Pi?
Comment by thewebguyd 3 days ago
On-device AI is going to be important, I think. It doesn't have to take the form of a chatbot UI to be useful.
Comment by com2kid 3 days ago
Comment by robotresearcher 3 days ago
I think there’s a sweet spot currently with munging your data blindly on the server so that your client device battery still lasts all day.
Meanwhile Apple and others push on with making client side models more efficient so that eventually the server costs and complexities go away.
Comment by fg137 2 days ago
If asked to choose between photo editing done within 3s using cloud provider vs an average of 30s using local compute, most consumers will choose the former without hesitation.
Most users' usage is also going to fall nicely in the free tier of a typical freemium pricing model, like ChatGPT today.
People who talk endlessly about local inference have no idea about user workflows and usability.
Comment by dominotw 3 days ago
Comment by wolvesechoes 2 days ago
Comment by parineum 3 days ago
Maybe if you ask them that question, but if you show them two products, they'll definitely prefer the faster one. 30 seconds is a long time to watch a progress bar.
Comment by spwa4 3 days ago
People definitely aren't going to accept more expensive + slower ...
Comment by sandworm101 3 days ago
Comment by parineum 3 days ago
You don't think the commercials of Google's AI photo features aren't going to have an impact on Apple users of their phones can do a worse version of that feature and it takes longer?
Comment by jb1991 3 days ago
Comment by flatline 3 days ago
Comment by smcleod 3 days ago
Comment by zozbot234 3 days ago
Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.
Comment by greenavocado 3 days ago
Comment by epolanski 3 days ago
2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.
3. Qwen doesn't like quantization at all.
Comment by kgeist 3 days ago
Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.
Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.
Comment by zozbot234 3 days ago
Comment by kgeist 3 days ago
Comment by ColonelPhantom 3 days ago
Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.
Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.
Comment by zozbot234 3 days ago
If you have reasonable amounts of RAM to cache the most likely experts, that's not true at all. Qwen 27B is marginally faster on a nearly empty context, then falls behind as context length increases due to the different attention mechanisms. Prefill for Qwen is much faster, but you're still comparing vastly different model sizes and capabilities. DeepSeek Flash is the best deal overall.
> completely fit in a high-end consumer or mid-end pro GPU
Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Comment by ColonelPhantom 3 days ago
Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).
Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)
> as context length increases
What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.
> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.
Comment by zozbot234 3 days ago
For any given workload/session? Empirically, yes, that's what has been found across different models. There's quite a bit of predictability that makes caching helpful.
> Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
There are ways of masking some of that latency, though it requires some architecture-specific cleverness which is less directly applicable to a generic engine like llama.cpp.
> Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal
The llama.cpp folks are working on adding support, and the ds4 project is working on CUDA support for streaming inference, targeting the DGX Spark.
> From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens.
DeepSeek V4 seems to do quite well on recall tasks even with large context. That's one plausible benefit of its compressed attention mechanism, compared to earlier models. Some degradation will likely still be there, but it's not necessarily obvious.
As for why people are calling Qwen 27B "borderline unusable" that may have to do with it being a dense model which makes for an increased compute intensity and pushes users towards discrete GPU platforms, since those tend to have the most compute overall as far as consumer hardware is concerned. I might agree that Qwen 27B is quite ideally tailored towards these platforms, but that does come with some limitations.
Comment by trollbridge 3 days ago
Comment by Pxtl 3 days ago
But yeah, the Qwen line is pretty impressive on commodity hardware.
Comment by derefr 3 days ago
To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.
I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?
Comment by Pxtl 3 days ago
"Go add a gh action to compile and deploy this thing and run its tests" is one I've found it's good at. Yes I know how to make a gh pipeline but it's always a hassle to remember what goes where.
Cranking out unit tests is okay. It's good at summarizing things so it's not half bad at writing jsdoc/xmldoc comments.
Comment by epolanski 3 days ago
Comment by unmole 3 days ago
I have a hard time believing running a model on a laptop will be cheaper than running it in a datacenter. Why wouldn't economies of scale apply here as with every other computation?
Comment by TylerE 3 days ago
Comment by dofm 3 days ago
Comment by TylerE 3 days ago
Comment by wazdra 3 days ago
Comment by hungryhobbit 3 days ago
Local may or may not be cheaper than remote now, depending on the details, but the factors you describe won't affect the math nearly as much as they will once that subsidization ends.
Comment by dannyw 3 days ago
Qwen3.6 is practically indistinguishable to Sonnet 4.6 at least in my personal experience. And sonnet 4.6 is not that cheap.
Comment by wjnc 3 days ago
Comment by zozbot234 3 days ago
Comment by dgellow 3 days ago
The vision NVIDIA is selling is pure marketing IMHO
Comment by lrae 3 days ago
Comment by itishappy 3 days ago
Comment by jerf 3 days ago
You're going to need to analyze the problem much more deeply because it sound like the standards you are implicitly applying would result in "economically, everything should be centrally hosted" but that is clearly not the result that obtains. Even a modern mid-grade cell phone is no slouch; you may not be running a current-gen frontier AI on it but you certainly can do a lot of other rather intense things locally that would have been laughable 10 years ago, like suprisingly high powered games.
Comment by strictnein 3 days ago
Comment by latch 3 days ago
Comment by strictnein 1 day ago
Comment by bespokedevelopr 3 days ago
But they also want to taste the sweet fruit of AI so the only way to do this that a CISO will approve is on local air gapped hardware. It’s a niche but still a billion dollar niche.
Comment by thewebguyd 3 days ago
Comment by unstatusthequo 3 days ago
Comment by Pxtl 3 days ago
Comment by JMiao 3 days ago
Comment by sandworm101 3 days ago
Comment by Zetaphor 3 days ago
Not everything I want to use an LLM for requires "PhD level intelligence", and increasingly I'm finding more uses that involve sharing my personal data.
Yesterday my local model helped me when looking for a doctor who is in-network for my insurance. I threw it a screenshot from the providers search results and it looked up reviews for all of them.
Comment by sandworm101 3 days ago
Comment by eszed 3 days ago
Comment by sandworm101 3 days ago
I own the DVDs so I'm OK upscaling/editing my own copies for my own use. But if I ran the task on an ai service I would no doubt trigger copyright issues.
Comment by bredren 3 days ago
I suspect personal privacy and need to run AI workflows to handle the litany of administration tasks of a household will be what result in regular need for local AI.
Apple is already out front with this on a personal, individual level, but they are not obviously headed toward multiuser/family-level ~biz admin with a persistent server running local LLM.
Comment by epolanski 3 days ago
Especially on Dwarfstar.
Comment by voidfunc 3 days ago
This made me laugh. I can only image how insufferable this person is to deal with.
Comment by falsemyrmidon 3 days ago
Do you think he's in mensa too?
Comment by cyanydeez 3 days ago
anyone whose addicted to token theoughput is losing the operational knowledge and offline capabilities.
if you arent moving to the AMD 395 or MACs then youre hitching aride on the expensive calory ride
Comment by throw1234567891 3 days ago
Comment by cyanydeez 3 days ago
But watching everyone flounder because claude goes down or forcing you on API costs.
I'm programming things that'd take me days with a PC that, without OpenAI's VRAM shenagans, would cost you $2k.
It's more than just 'this is what I could do' it's definitely about 'this is what anyone could do with a new PC purchase'.
Comment by throw1234567891 3 days ago
Comment by cyanydeez 3 days ago
You're doing what the IT industry has been addicted to for decades: number goes up.
Comment by throw1234567891 3 days ago
No, I have a hands on experience with bigger models, and understand the advantages of using them.
Comment by cyanydeez 3 days ago
You also probably believe you need to 'escape the permanent underclass'
Comment by throw1234567891 3 days ago
You assume a lot. Sometimes it’s good to simply ask a question.
Comment by speed_spread 3 days ago
Comment by throw1234567891 3 days ago
Comment by GeekyBear 3 days ago
Where you will need games to be rewritten for ARM to get full performance, just like on Apple's M series chips.
Comment by iLoveOncall 3 days ago
Lol yeah seriously, that stinks "I ask AI to generate a huge amount of bullshit and upload it to pad irrelevant stats".
Absolute loser.
Comment by nkurz 3 days ago
As to why he now has this on his blog? I also cringe when I read it. I presume someone told him he should self-promote more, and this is his lame attempt to do so. He's almost certainly the most cited person in his department, but it's entirely possible that none of his colleagues actually know this. Cut him some slack. Self-promotion is not his strength. He's a nerd's nerd, and not a marketer. I'll mention to him that his attempt here might be backfiring when I'm next in contact with him.
Comment by infecto 3 days ago
Comment by hgoel 3 days ago
Comment by iLoveOncall 3 days ago
He doesn't just have it on his blog, he has it EVERYWHERE. Sometimes 2 or 3 times on the same page.
Comment by dgacmu 3 days ago
It sounds like he's gotten bad advise about how to market himself /or/ this is being marketed to people who have bigger checks to write and whom he believes will be responsive to this kind of marketing. As an academic, it rubs me very wrong - I think it's detrimental to the field when we get into h-index stacking contests or citation count comparisons. But I don't know what incentives he's responding to, which seems important for putting this stuff in context.
(as an aside, it turns out that polars + fastexcel is about 10x faster than pandas + openpyxl for searching that dataset, if anyone else is curious what he was actually talking about. :)
Comment by netsharc 3 days ago
Being the top x% is what OnlyFans girls brag about, professor...
And it's not exactly brain surgery, is it? https://www.youtube.com/watch?v=THNPmhBl-8I
Comment by SkiFire13 3 days ago
Comment by jayd16 3 days ago
Comment by SwtCyber 3 days ago
Comment by dagmx 3 days ago
1. Yes it has the same number of cores as a 5070 mobile. It’s also running at a shared peak of 2/3 the bandwidth and a shared peak of 2/3 the TDP. The GPU by itself will likely perform at half the dedicated units performance
2. Apple may not have SVE2 but they do have the AMX (private) and SME. I don’t see why he thinks the SVE2 will give him more performance than the SME.
3. He mentions a single core type but doesn’t mention the total makeup. We already have known for a year how the DGX Spark compares to Apple chips. For CPU it’s roughly equivalent to an M3 Pro and for GPU compute (not rasterization) it’s between an M4 Pro and M4 Max without considering bandwidth.
The real advantage to these is that they run CUDA. That’s it. Otherwise when they launch they’ll be 2-3 generations behind where Apple is and 1 gen behind AMD.
The other super power of the DGX Spark was the NIC for pairing them together. But that’s been removed here too.
Comment by storus 3 days ago
You are likely thinking about token generation which is dependent on memory bandwidth where Apple has an edge. Spark's GPU compute is way higher than even M5 Max (17 FP32 TFlops), around 2x FP32 TFlops... It's literally 6144 CUDA cores like desktop 5070, slowed down by slow memory and lower TDP (29.7 vs 31 FP32 TFlops on 5070).
Comment by dagmx 3 days ago
I’d also mention that you’re comparing peaks which the RTX Spark won’t be hitting. The top TDP is less than that of the DGX Spark.
I just think anyone calling this a beast and a game changer are conflating/extrapolating from different form factors and constraints
Comment by well_ackshually 2 days ago
cool story, but nobody cares about mobile GPUs for blender. A 4080 eats an M5 Max alive for breakfast. The 5080 in my machine that cost me 1500€ runs circles around an M5 Max that would cost me over 6000€. And when in 5 years the 5080 isn't enough, I can upgrade it to a 7080 or whatever, which will remain compatible.
If you're a professional, soldered products like the RTX Spark or Apple's offering are a dead end. They are literally never worth it.
Comment by dagmx 2 days ago
It’s not going to be the primary place of creation but there’s a lot of usefulness in having a portable workstation or that entire segment of the laptop workspace wouldn’t exist.
In either case, it’s besides the point because the point is talking about the compute levels of a GPU in the same form factor.
Comment by Foobar8568 2 days ago
Comment by well_ackshually 2 days ago
Game dev & asset work is probably happy with a 5080 and that's what most rendering/dev machines would have.
The addressable market of "i have 6000 to blow and i need meh performance on anything related to 3D rendering" is small, and benchmarks make it look bigger than it really is.
Comment by dagmx 2 days ago
Disney’s Hyperion is CPU based and RenderMan XPU is just exiting beta after over a decade.
But while they do stack their workstations with higher end GPUs for artist throughput in viewports it’s mostly just for the higher memory to fit unoptimized scenes in. None of the studios or major films I’ve worked on have had their on desk artists be raster rate gated but just memory gated.
But again, besides the point, because it’s still valuable as a metric to compare with when comparing perf between similar chipsets.
There are already more creatives using their consumer grade hardware to make stuff. And even the studios you mentioned do actually use laptops on the go for parts of their creation pipelines for various things like virtual production scouting etc.
Comment by cthalupa 3 days ago
Same model, same quant, same query, as close to as matched settings as I can get from vllm, and for workloads with large prompts + low cacheability, one of my sparks will often be done responding before the mbp is done with prefill.
Comment by llm_nerd 3 days ago
Guy suddenly became aware of a chip that the rest of the industry long knew about, seems completely unaware of the competitors, and posts about how it's a BEAST and will be a GAME CHANGER.
Like the DGX Spark was a game changer? Eh, it has mostly been a massive disappointment. An overpriced nvidia laptop isn't going to change the equation an iota.
Comment by trympet 3 days ago
Comment by wmf 3 days ago
Comment by oofbey 3 days ago
Comment by well_ackshually 2 days ago
Comment by modeless 3 days ago
Comment by arjie 3 days ago
Qualcomm is like AMD was for GPUs for like decades. Lots of announcements and people on the Internet are huge fans based on web pages they’ve read but if you try to make it work it’s a nightmare.
Snapdragon X Elite doesn’t work on Linux so it’s a pointless platform. Enthusiasts have made M1 work better. Literally have old Macs running rather than use Qualcomm.
Comment by someguyornotidk 3 days ago
Whether this is true or not, it's pretty safe to assume anything based on their stuff is not for me.
Comment by zozbot234 3 days ago
Comment by danslo 3 days ago
Comment by KeplerBoy 3 days ago
Comment by davemp 3 days ago
Comment by satvikpendem 3 days ago
Comment by TiredOfLife 3 days ago
Comment by ksec 3 days ago
But perhaps more importantly. Nvidia seems to be doing a lot better with its ecosystem. Nvidia has much better distribution channels and partners building on top of their PC Gaming GPU. It also have gaming developers relations that is unmatched by any in the industry.
Qualcomm has so far failed to execute this, both in PC and on there Server CPU side.
Comment by dismalaf 3 days ago
Comment by Danox 3 days ago
Comment by hypercube33 3 days ago
Comment by sedatk 3 days ago
What's lousy about it? I use it daily and have zero problems.
Comment by criticalfault 3 days ago
Comment by embedding-shape 3 days ago
Comment by jeroenhd 3 days ago
Some distros still need extracting Qualcomm firmware from Windows to get Linux to work properly. Audio remains a challenge, like x86 Linux decades ago. Apparently camera stuff works these days but produces images of subpar quality.
These issues also occur on normal Linux. My experience with my Lenovo+Intel laptop was that it took three months after release for the firmware to work properly (and the Nvidia drivers took much longer, but that's my fault for buying something containing Nvidia hardware). Intel managed to do what Qualcomm did in months rather than years.
I hope Qualcomm finally sorts this shit out, I really do, but with the prices of computers these days, I'm going to need to see quite the discount before I'll consider buying anything with a Snapdragon.
Comment by ChocolateGod 3 days ago
This is a problem with Linux on ARM generally (Android has had it since inception), it's not a Qualcomm problem.
Comment by criticalfault 3 days ago
they seem to have dealt with this for the server hardware
Comment by criticalfault 3 days ago
Comment by justincormack 3 days ago
Comment by Elixir6419 3 days ago
My experience (wanted to use x13s as daily sriver) is that there was good progress for about a year, until jhovold was leading the charge, but something expired and qualcom as far as i can tell forgot that some progress should happen on x1 and x8c as well as x2.
Comment by gsnedders 3 days ago
And I know a lot of that lies on the vendors, but it does feel unfortunate (from a standardisation/conformance/certification point of view) that Windows requiring it doesn’t make it easy to boot other OSes!
Comment by stefan_ 3 days ago
Comment by reactordev 3 days ago
They could have had a 128core arm chip by now.
Comment by adabyron 3 days ago
There's also the whole giant trillion dollar company doesn't want to invest and let small ideas grow. They only focus on things that move the needle, which isn't much at the size.
Had Microsoft executed and invested, they could have made a come back imo in both search, mobile & hardware. Unfortunately major lack of leadership or they just don't want those areas.
Comment by reactordev 3 days ago
Comment by dylan604 3 days ago
Comment by reactordev 1 day ago
Comment by bradfa 3 days ago
Qualcomm are trying harder now it seems. But it will take time to repair their reputation in the PC market.
Comment by thewebguyd 3 days ago
Tuxedo computers tried and didn't succeed either.
I will never buy Qualcomm again. I avoid them on phones as well by just buying Apple. They do not support their hardware beyond the release.
Comment by jeroenhd 3 days ago
To each their own, but I don't recall Apple ever mainlining any of their drivers on Linux. You're rightfully angry on the laptop side of things, but Apple is much worse than Qualcomm when it comes to open source support for their phones.
Qualcomm probably shouldn't have promised Linux support in the first place. Everyone seems to love Apple's hardware even though you're practically stuck with macOS. Had Qualcomm just stuck to Windows-only, they would've probably received a much better reception by the tech press.
Comment by mlinhares 3 days ago
Comment by izacus 3 days ago
Comment by dismalaf 3 days ago
Comment by derefr 3 days ago
Comment by darkwater 3 days ago
Comment by modeless 3 days ago
Comment by diabllicseagull 3 days ago
https://discourse.ubuntu.com/t/ubuntu-concept-snapdragon-x-e...
Comment by izacus 3 days ago
Comment by alt227 3 days ago
Comment by Remnant44 3 days ago
outside of anything else, amdahls law means that as the parallel performance grows, we become _more_ limited by the inherently serial code, and thus single core performance, not less.
Given that single core performance is "harder" (can't just throw more cores/sockets at the problem), it's also critically important.
Comment by dagmx 3 days ago
Strix Halo is 16 cores. Intel Core Ultra 9 285HX is 24. Apple is 18. Qualcomm is something similar too but I can’t recall. NVIDIA is 20.
Until you get to threadripper/epyc or Xeon territories (completely different form factors and TDPs) the arm chips are ahead on both power and perf than the x86. And even when you get to those areas, arm is equivalent or out performs them as can be seen by the recent neoverse x3 and Vera benchmarks.
Comment by wmf 3 days ago
Comment by hulitu 3 days ago
Because that't the only part this chip excels.
People are comparing apples with oranges since ages.
Comment by modeless 3 days ago
Comment by pjmlp 2 days ago
Comment by pjmlp 2 days ago
Unless we're talking about RTOS, threads are always interrupted, thus how fast a single one can race is irrelevant in the whole picture.
Comment by SecretDreams 3 days ago
Comment by Razengan 3 days ago
I'll wait for the 365 AI Ultimate Professional Enterprise Edition: Origins version
Comment by re-thc 3 days ago
Technically speaking, Qualcomm acquired Nuvia, which is where this came from and that company came from ex-Apple engineers wanting to do what Apple said no for their chips.
So it's almost same CPU design (origins).
Comment by hulitu 3 days ago
Is there a desktop version ? For real work ?
Comment by Innittech 3 days ago
Comment by dofm 3 days ago
https://nvidianews.nvidia.com/news/nvidia-microsoft-windows-...
I have been somewhat surprised at the lack of commentators observing that this is Microsoft and above all NVIDIA launching a device that is fundamentally at odds with the metered cloud model of AI.
When you look at the other announcements and murmurings (better offline BYOK for Copilot, talk of an unmetered AI future) I think it’s clear that these two firms understand that cloud-only AI is not sustainable or inherently in their interests. But their willingness to undermine OpenAI with a product like this is notable.
Comment by thewebguyd 3 days ago
Comment by dofm 3 days ago
Copilot just got proper "offline" BYOK support, didn't it? Presumably that was one of the things they were talking about. Though I imagine that has something to do with the fact that Zed has supported that properly for months.
Comment by Yokohiii 2 days ago
LLMs will get bigger and even with 128GB (that many wont saturate), you wont run future frontier models. For LLM vendors and integrators it's a handy thing to move lower quality inference to the consumers.
Also running local doesn't have to mean that the models have open weights. MS will likely start to distribute closed models at scale once the hardware is there.
Comment by GodelNumbering 3 days ago
Comment by bigyabai 3 days ago
Comment by moffkalast 3 days ago
AMD has the advantage that their x86 machines run everything, Apple maintains the whole MacOS stack, while Nvidia can barely scrape together one Ubuntu release per Jetson generation, it's beyond embarrassing. Maybe they ought to put those agents they keep droning about to some actual work on their OS support.
Comment by bigyabai 3 days ago
Why would they do more? It's an LTS distro, the Nvidia drivers are updated for as long as the hardware's compute capability is supported.
Nvidia's ARM drivers are updated constantly, and battle-tested as the backbone in hundreds of thousands of Grace ARM servers.
Comment by moffkalast 3 days ago
That's not even considering the lazy out tree patchwork support Nvidia does for their products on top of that. Maybe it's different in this case for Windows since it forces a rolling release, but I seriously doubt they'll do it properly instead of forking some version and keeping it around for 10 years like absolute idiots.
Comment by bigyabai 2 days ago
For their ARM SOCs? Almost every single ARM OEM on the consumer market is begging you to use out-of-tree blobs for basic firmware support. Nvidia's stance isn't ideal but it's also not unique (or damning) to the rest of their ARM competitors.
Comment by GuestFAUniverse 3 days ago
It's just a personal computer. It normally runs multiple operating systems just fine.
Windows PC sounds like people talking about tech who are either payed by M$, or embed pictures into Word documents to send them.
Nobody has to kill the fun those OS agnostic machine allow, by artificially bind them to a shitty OS.
Comment by zdragnar 3 days ago
Even for personal use, I'd imagine the amount of people dual booting Windows and something else are a very tiny minority.
Saying "Windows PC" is a pretty reasonable way to distinguish between "made by Apple" and "made by someone else" because the market of PCs that aren't made by Apple and don't come with Windows is really, really tiny.
To be honest, this seems like a strange hill to take such an aggressive stance upon.
Comment by jeroenhd 3 days ago
For normal people, there are three computer operating systems: Windows, Apple, and ChromeOS. Nvidia isn't going with ChromeOS and Apple hates their guts, so Windows is the only normal operating system they can market.
Their marketing makes clear that these devices aren't the piddly Chromebooks that ruined the desktop experience for so many people (expensive Chromebooks were nice, but rare in practice).
Qualcomm promised Linux support, failed to deliver, and now anybody burnt by their promise won't want to buy their hardware again. If they promise a Windows PC, people won't have reason to complain when Linux or FreeBSD or SerenityOS won't boot on there. Given Qualcomm's failures here, Nvidia is probably doing the right thing.
Comment by dylan604 3 days ago
I did this for years. We ran Resolve color correction suites with external chassis to place multiple Nvidia GPUs in it at a fraction of the cost of the shitty TrashCanMac that was available. Lots of people continued to use the 2012 Cheese Grater MacPro with its older CPUs. The only way to get modern (at the time) compute in a Mac was to use a Hackintosh. Since it wasn't for personal use, not having things like AppStore, Messages, Music, etc wasn't a big deal, so building a Hackintosh was easier.
I built one for personal prosumer use around the time of the 1080s that allowed me more machine for the dollar than Apple offered. Once the M-series chips came out and they were capable of what the Hackintosh was doing for me put me off of building anything newer.
Comment by kmac_ 3 days ago
So, the partnership is maybe natural, but not prospective. Also, note how Linux is getting popular among gamers. Of course, it's way behind Windows, but the direction of the change is clear.
I'm convinced that Nvidia is not primarily targeting the consumer market and that the ultimate goal for its CPUs is the server space. The company invests effort where the money is, and consumer products account for only a fraction of its total revenue. Maintaining a presence in the consumer market seems more like a way to avoid a complete pivot than a strategic priority.
Comment by pjmlp 2 days ago
Linux won't be popular for much longer among gamers if the Proton fountain dries out.
Comment by crazygringo 3 days ago
I'm assuming it's just clarifying this isn't about Macs.
The term "PC" is ambiguous, since it can either refer to all personal computers in its original meaning, or to the IBM PC lineage that is mainly contrasted with Macs. Remember the famous "I'm a Mac, I'm a PC" ads.
When you just say "PC", people today genuinely don't know which meaning you are referring to. And "IBM PC" is antiquated, and "IBM PC clone" is even worse. So "Windows PC" is a pretty decent name.
Do you have a better suggestion? Because "Non-Mac PC" doesn't exactly roll off the tongue. If you say "Windows PC", everyone knows what you mean.
And it's not an "anal fixation", there's no need to be gratuitously insulting.
Comment by Aperocky 3 days ago
I prefer Windows XP, or even Windows Vista, to Windows 11 with its copilot. And it's been a downhill race, even macs are more of your own personal machine than Windows today, which is saying a lot.
PC should be a PC, Windows is as they advertised, a Copilot PC.
Comment by alkonaut 3 days ago
I run it for work because we make windows programs. We use drivers that don't exist on Win-for-ARM yet. So to most people a "Windows PC" is an x64 Windows PC still. The risk for MS if compat isn't good enough for Windows-Arm64 is that people might as well shift from windows entirely if they need new software and harware anyway.
Comment by bigyabai 3 days ago
Your x86 machines were, but these are ARM SOCs. Many of them don't even support UEFI, let alone the upstream Linux kernel.
Comment by rvba 3 days ago
Comment by speed_spread 3 days ago
Comment by pjmlp 2 days ago
Comment by jayd16 3 days ago
Comment by fg137 3 days ago
(HN reaction to Vision Pro back in 2024 is almost hilarious if not ridiculous, looking at it today. I knew it would be a flop and I was so right.)
Comment by monster_truck 3 days ago
Spark DGX also remains a nothingburger, I would be livid if I spent this kind of money and had to waste time chasing down power cap bugs or A/B/C testing each firmware version to find the one that is least slow and also does not fail https://dredyson.com/the-hidden-truth-about-dgx-spark-perfor...
Comment by SwtCyber 3 days ago
Comment by QQ00 3 days ago
over the last decade, many software (especially the popular and industry standard ones) shifted to GPU accelerated design. it's a push before NVIDIA even tried to capitalize on that.
Comment by ftchd 3 days ago
Comment by xpct 3 days ago
I dislike the cycle of propagating news and assuming that someone else double-checked it.
Comment by Someone 3 days ago
“News Summary:
- NVIDIA RTX Spark powers the world’s first Windows PCs purpose-built for personal agents, featuring 1 petaflop of AI performance, industry-leading power efficiency, full-stack NVIDIA AI and graphics technology, and up to 128GB of unified memory.
- NVIDIA and Microsoft collaborate to deliver a native Windows experience for personal agents, including new security primitives and NVIDIA OpenShell to run agents securely on primary devices.
- RTX Spark lets creators, AI developers and gamers render ultralarge 90GB+ 3D scenes, edit 12K 4:2:2 video, generate 4K AI videos, run 120B-parameter LLMs with up to 1 million tokens context using agents locally, and play AAA games at 1440p and over 100 frames per second.
- Adobe is rearchitecting Photoshop and Premiere from the ground up for RTX Spark to deliver 2x faster AI and graphics performance.
- RTX Spark-powered slim Windows laptops with all-day battery life and premium displays, as well as compact desktop PCs available this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface and MSI, with models from Acer and GIGABYTE to follow.”
Comment by kcb 3 days ago
Comment by 1970-01-01 2 days ago
Comment by wombat-man 2 days ago
Comment by iceflinger 2 days ago
Comment by tosh 3 days ago
Comment by infecto 3 days ago
Comment by tempodox 3 days ago
Comment by siliconc0w 3 days ago
Comment by zzzoom 3 days ago
Comment by maipen 3 days ago
Comment by cthalupa 2 days ago
Rising supply from China will impact prices even in countries where there are tariffs.
Comment by carefree-bob 2 days ago
Comment by mathgladiator 3 days ago
Comment by satvikpendem 3 days ago
Comment by mathgladiator 2 days ago
Comment by seanalltogether 3 days ago
Comment by flakiness 3 days ago
As a side note, qualcomm chip set on Android has been doing this for years (like Apple) so it's not super unique thing. It's more like there was no need before.
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
Comment by kimixa 3 days ago
The GPU can still happily use all the rest of the memory for other use cases - which tend to be the bulk of allocations anyway. Though there might be performance implications - for example "moving" buffer ownership to the GPU would need to evict CPU caches, and often 4k pages and tlb lookups can be a pretty inefficient situation for GPU-style accesses.
That's been pretty standard for any SoC for decades. And "differences" to apple's SoC are more implementation details.
Comment by Keyframe 3 days ago
This isn't the first time we have UMA on the PC, btw. When SGI did their PC workstations, their 320 and 540 PC workstations had what they called Cobalt graphics chipset and crossbar with their IVC architecture. They bypassed AGP at the time completely. It was quite unique to see strict UMA on a PC. Haven't seen it since until these new systems we're seeing now on PCs and Mac.
Comment by eigenspace 3 days ago
Some software assumes pre-defined set-aside pools of memory reserved for video purposes, but the chip does actually have access to the whole pool.
Comment by ApatheticCosmos 3 days ago
Comment by joe_mamba 3 days ago
IIRC that's due to maintain BIOS and Windows (+games & apps) backwards compatibility, but memory access speeds are the same.
Comment by SwtCyber 3 days ago
Comment by glitchc 3 days ago
Comment by fc417fc802 3 days ago
That's an API issue not a hardware issue. Regardless, I believe the major APIs permit seamlessly sharing pointers at this point? (I have no experience doing that though.)
Comment by ankurdhama 3 days ago
Comment by gghh 3 days ago
Last time I check an NVidia situation was for DGX Spark (the GB10 chip), it has regular LPDDR5X which by JEDEC standard cannot go beyond ~270 GB/sec, ie 8533 Mbit/s on a 256 lanes bus.
So yeah Lemire seems to go "OMG unified memory, they're following Apple path..." ok, but Apple pulled off a much faster interconnect, 800 GB/s ballpark, and I'm trying to understand (not really, I'm asking you to try understand, he he) how is this laptop faring in that regard.
Comment by comandillos 3 days ago
Comment by embedding-shape 3 days ago
A RTX Pro 6000 has ~24K 5th generation tensor cores, I'm guessing this would then be 1/4 of the count but 6th generation? Wasn't clear from the images.
Comment by gravypod 3 days ago
Comment by embedding-shape 3 days ago
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.
Comment by gravypod 3 days ago
If these chips become popular I am sure you will see LLM architectures taking advantage of the parallelism.
Comment by cthalupa 3 days ago
Perhaps in theory, but for the gb10 stuff the memory is all on the CPU die and connected to the GPU die via nvlink-c2c
Comment by mohamedkoubaa 3 days ago
Comment by Melatonic 2 days ago
Basically what we need is a chip that also has pins or some type of attachement system on the top (physically) or maybe below where the chip itself connects to the motherboard.
Imagine a CPU you can just plug in a block of HBM memory on top of (or on "bottom" of). This would allow a much larger physical surface area for putting ram cache near the compute cores itself because you would not be limited to edge lengths.
Cooling the whole thing would be a methodology change (might need liquid coolers that sandwhich in between the ram cache and compute and cool both)
Comment by Schnitz 3 days ago
Comment by nine_k 3 days ago
Comment by jmyeet 3 days ago
The obvious comparison here is the M5 Max where you can buy a Macbook Pro with 128GB of also unified memory. Obviously CUDA cores are specific to NVidia so it's hard to directly compare but I've seen claims that the M5 Max is roughly equivalent to ~4000 CUDA cores. This obviously depends on workload and whether the CPU supports the precision you want to use (eg FP4).
The M5 Max has memory bandwidth of 819GB/s. The RTX Spark I believe is ~600. So it might be slightly better than the current generation of Macs but likely worse than the expected M5 Ultras of the new Mac Studios (likely Q3 2026).
For comparison, a 5090 has >20k CUDA cores and 1800GB/s memory bandwidth with 32GB of VRAM. The RTX 6000 Pro (at ~$10k) has 96GB of VRAM, same bandwidth and ~24k CUDA cores.
We have to see what RTX Spark systems sell for but the DGX Spark is in the Mac Studio price range (~$4k).
I do think Apple has a real opportunity here but there offerings aren't quite there yet. The M5 Ultras might be a really attractive option for local LLMs. I expect them to be in high demand.
Comment by bigyabai 3 days ago
Who claimed that? The M5 is still a raster focused GPU, dedicated matmul blocks be damned. For some workloads that napkin math might work out, but for many others it's a wild overshoot. Time-to-first-token still favors CUDA, and real-world training workloads aren't getting anywhere near Apple Silicon.
All of the memory bandwidth in the world is useless if you spend 15 minutes processing 64k tokens worth of context prefill. This is where CUDA shines.
Comment by dh2022 3 days ago
Comment by mrweasel 3 days ago
The idea that any hardware performance increase will be eaten up by terrible software is an evergreen. A computer that could serve as the single server for a medium size enterprise 20 years ago, is no longer able to serve as a desktop for a receptionist. I'm not even sure we're talking diminishing returns anymore, we're probably past the point of maximum yield and into the negative returns at this point.
Comment by Waterluvian 3 days ago
Comment by burnt-resistor 3 days ago
Comment by mariopt 3 days ago
Before we get local AI, we'll be using hybrid AI.
Running big models locally is unrealistic ($$$$$) but, if you imagine an Agentic Workflow where some bits run on the cloud and other smaller tasks locally, it's an amazing deal. You don't need Opus/Code/DeepSeek/Kimi/etc to do basic stuff that models like Gemma4:12b/Qwen-27b can do locally with much less latency.
Having a laptop where I can use a remote big model and combine it with 5 local domain specific models, is something I would love to do today. Imagine using OpenCode and you've a small model deciding which tasks run locally, then decides if you've a good local model for XYZ task or if we use a cloud model.
My main concern is: Is this hardware powerfull enough to allow local quick models switch? Unlikely but I hope I'm wrong
Comment by Gareth321 3 days ago
Comment by zuzululu 3 days ago
Comment by BoredPositron 3 days ago
Comment by thewebguyd 3 days ago
Comment by burnt-resistor 3 days ago
Comment by ChrisArchitect 3 days ago
A powerful new chapter for Windows PCs, accelerated by Nvidia RTX Spark
https://news.ycombinator.com/item?id=48352693
Nvidia RTX Spark
Comment by AmazingTurtle 3 days ago
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
Comment by dangus 3 days ago
Nvidia just wants to sell stuff to everyone.
And I think for professionals doing local AI work, products like Strix Halo and Apple Silicon are a competitive threat.
A big part of maintaining the leading software ecosystem is ensuring you have competitive hardware for all your users.
I also think the RTX Spark product is relatively low effort for Nvidia. Grab a Mediatek CPU and slap an Nvidia GPU on the die. Sure, that’s oversimplifying it, but still.
Comment by empiree 2 days ago
Comment by VortexLain 3 days ago
Comment by QQ00 3 days ago
Comment by PedroBatista 3 days ago
It's an interesting "newcomer" and the more the better but calling this a "beast" and a "game changer" is ridiculous to say the least.
Then there is the price..
Comment by jqpabc123 3 days ago
I'd say this relates directly to the cost of running AI models remotely.
And we won't know what the actual cost will be until AI vendors recover the huge pile of cash they've dumped into development (plus interest).
Comment by chpatrick 3 days ago
Comment by dofm 3 days ago
The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.
Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.
The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.
Comment by jqpabc123 3 days ago
But most businesses don't really care about most of the apple --- they only need their special bite out of it.
For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.
Comment by dofm 3 days ago
But I don’t know about specialised: this could run quite large models with MoE.
Comment by dgellow 3 days ago
Running local models will stay niche for a while, unless we see breakthroughs
Comment by jqpabc123 3 days ago
Most doctors don't care much about engineering or accounting or software development or 10000 other things that big vendor models address.
This area is yet to be really explored. Nvidia aims to provide the hardware to do so.
Comment by CamperBob2 3 days ago
I'm not sure anyone really understands why.
Comment by jqpabc123 2 days ago
Comment by CamperBob2 2 days ago
The author is probably confusing RAG with pretraining. You can RAG on PubMed but you can't arrive at a competitive model by pretraining solely on it.
Comment by Animats 3 days ago
It's not that the NVidia chip has that much RAM built in, after all. It's that it can address that much. RAM is sold separately.
Comment by cthalupa 3 days ago
So I would expect the mini PCs to come in less than the sparks. Laptops I assume will be close in price with the addition of all the other laptop stuff.
Comment by forrestthewoods 3 days ago
Comment by Animats 3 days ago
Comment by forrestthewoods 3 days ago
It is all in integrated into one monolith “superchip” package. The 128gb of RAM isn’t going to be purchased separately or be upgradable. At least according to all indicators. Which is what I was responding to.
Comment by wmf 3 days ago
Comment by amacbride 3 days ago
I've found it very useful for running big models, but it's not a screaming powerhouse in terms of raw compute.
Comment by adamnemecek 3 days ago
Comment by Melatonic 2 days ago
Comment by ozgrakkurt 3 days ago
Comment by neuroelectron 3 days ago
Comment by proxysna 3 days ago
Comment by vegabook 3 days ago
Comment by bigyabai 3 days ago
Comment by llm_nerd 3 days ago
Decent single core (a long ways from Apple level, but decent), but it makes up for it in cores to provide M5 level performance, CPU wise. Memory bandwidth it is kind of starved, at 1/6th many GPUs.
They got Microsoft to customize Windows for the RTX Spark, and will likely have to brutally throttle it when running as a laptop (it's literally a 140W TDP chip), and that's neat. It's going to be a very expensive laptop.
Comment by SwtCyber 3 days ago
Comment by Apreche 3 days ago
Comment by MrBuddyCasino 3 days ago
Comment by dagmx 3 days ago
DGX Spark has a maximum of 273 GB/s bandwidth in ideal scenarios (hard to reach)
That puts it between an M5 (153) and M5 Pro (307)
Comment by MrBuddyCasino 3 days ago
Mind you thats not to/from memory, which indeed only has 273 GB/s.
Comment by dagmx 3 days ago
Comment by MrBuddyCasino 3 days ago
Perhaps a sobering rule of thumb: if it was actually useful, you couldn't buy them because someone would scoop them all up to shove them in a DC and make money with it.
Comment by MrBuddyCasino 3 days ago
Comment by alberth 3 days ago
Comment by trynumber9 3 days ago
Comment by PeterStuer 3 days ago
Clip me :). You are currently living through the final stages of unrestricted computing in the hands of the 'public'. Our regimes are going to pull up the drawbridge in the name of 'safety'. Download the open models asap and prepare for an airgapped computing environment. That will be your frontier in not extremely neutered AI in the near future.
I am so hoping I'm completely wrong on this btw.
Comment by shevy-java 3 days ago
Comment by noveltyaccount 3 days ago
I expect computers with this chip will be about $4000. If Microsoft can deliver on local AI models that can orchestrate Windows and have solid real world intelligence, that will be an inexpensive business purchase compared to pay as you go tokens. I'm excited to see how this plays out.
Comment by 2OEH8eoCRo0 3 days ago
Comment by zamadatix 3 days ago
Comment by fc417fc802 3 days ago
Comment by zamadatix 3 days ago
Looking at it more, I believe the story repeats with the TSMC processes used for the CPU vs chips like GB200 as well.
Even if none of the above were the case, the question still isn't "why not make the enterprise GPU" it's "why not make the higher margin per chip area product". If the NV1/GB10 take less die space and cost a lot it's not immediately apparent the enterprise GPU actually nets Nvidia more $ per die or not. That's why it's relevant these will be sold at a premium.
Comment by dofm 3 days ago
And maybe for NVIDIA and MS it is also about them quietly betting that local models are, in fact, going to be good enough for most tasks pretty soon.
Comment by thewebguyd 3 days ago
Comment by easygenes 2 days ago
Comment by wmf 3 days ago
Comment by einpoklum 3 days ago
Comment by JBiserkov 3 days ago
Comment by snvzz 3 days ago
We aren't so naive as to move from a locked IP ISA like x86 to another locked IP ISA such as ARM.
Right?
Comment by dcreater 3 days ago
Comment by emsign 3 days ago
Comment by alt227 3 days ago
Windows 11 can run just fine on 8Gb of memory, what cant is Google Chrome.
Comment by YasuoTanaka 3 days ago
Comment by adrian_b 3 days ago
While this NVIDIA system is inferior from the point of view of the memory capacity, its main advantage is that the top models will have a bigger GPU, i.e. with 6144 or 5120 FP32 execution units, compared to 2560 for the AMD GPU (compared to the NVIDIA CPU, the AMD CPU has a better multi-threaded performance for legacy programs, and a much better multi-threaded performance for the applications that use AVX-512).
However, these top models with big GPUs will also be much more expensive than the competing AMD system, while also being much more expensive than a laptop or mini-PC with an equivalent discrete NVIDIA GPU (which has the disadvantage of having direct access only to a much smaller, even if faster, memory).
Comment by christkv 3 days ago
Comment by adrian_b 2 days ago
The memory interface is a little faster, but the greatest improvement is +50% in the memory capacity, both over the old Strix Halo and over NVIDIA Spark.
However, even the Strix Halo CPU was better than the NVIDIA/Mediatek CPU.
NVIDIA has only the advantage (in its more expensive variants) of a GPU equivalent with RTX 5070.
It remains to see which will be the prices of the NVIDIA Spark models with big GPUs, but the rumors are that they grow from around $3000 upwards, with the upper limit for 128 GB DRAM and uncut GPU being unknown yet.
It also remains to be seen whether the variants with the biggest GPU can use it effectively when having a rather low memory bandwidth for such a big GPU.
Comment by zamadatix 3 days ago
Comment by avocadoking 3 days ago
Comment by SwtCyber 3 days ago
Comment by zackify 3 days ago
Comment by yoyohello13 3 days ago
Comment by seabrookmx 3 days ago
Assuming all that stuff is upstreamed (and they aren't using oddball webcam/input devices etc) it should have much better support than Qualcomm.
Fingers crossed!
Comment by daft_pink 3 days ago
Comment by shadowpho 3 days ago
Comment by epolanski 3 days ago
Comment by Aperocky 3 days ago
Comment by shevy-java 3 days ago
Nvidia is milking the market now. We need more competition again - currently we have a mafia control the prices, not just Nvidia but all the AI companies. The price increases should be paid for them, not by us. "Free market" is being manipulated by them here.
Comment by cyberziko 3 days ago
Comment by dgellow 3 days ago
Comment by 8note 3 days ago
Comment by crims0n 3 days ago
Comment by cryo32 3 days ago
Tech companies have strangled their own market.
Comment by thewebguyd 3 days ago
Comment by htk 3 days ago
Nothing new here, apart from being able to use CUDA on a less power hungry system.
Comment by bigyabai 3 days ago
> Nothing new here, apart from being able to use CUDA on a less power hungry system.
CUDA has been running on ARM SOCs since the Tegra K1, 12 years ago. Nvidia is not new to ARM, nor is CUDA.
Comment by oldnetguy 3 days ago
Comment by npn 3 days ago
Up to $5000 because why not?
With that money you can build a real PC with rtx 5090!
Comment by thewebguyd 3 days ago
Comment by derefr 3 days ago
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.
Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.
What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.
I guess you'd need to do a bit of software work:
1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);
2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")
...but once that's done, it's done.
Comment by Rohansi 3 days ago
Comment by sherazp995 3 days ago
Nvidia going from GPU to CPU now?
Comment by wmf 3 days ago
Comment by userbinator 3 days ago
Comment by TiredOfLife 3 days ago
Comment by throwaway5752 3 days ago
Comment by thrance 3 days ago
Comment by buffer_overlord 3 days ago
Comment by danielovichdk 3 days ago
Must be a new business model.
....
Step into my office
Why ?
Because you are fucking fired
Comment by effnorwood 2 days ago
Comment by sometimelurker 3 days ago
Comment by easygenes 2 days ago
Comment by theturtle 3 days ago
Comment by sylware 2 days ago
Comment by sisve 3 days ago
Bill Gates had a quote some years ago...
People have still not learned how fast we improve our tech and how much cheaper thing gets I guess :)
Comment by chaostheory 3 days ago
Comment by sisve 2 days ago
Comment by dgellow 3 days ago
Comment by sisve 3 days ago
Comment by dgellow 3 days ago