30 Years of HPC: many hardware advances, little adoption of new languages
Posted by matt_d 3 days ago
Comments
Comment by jandrewrogers 6 hours ago
How people imagine scalable parallelism works and how it actually works doesn’t have a lot of overlap. The code is often boringly single-threaded because that is optimal for performance.
The single biggest resource limit in most HPC code is memory bandwidth. If you are not addressing this then you are not addressing a real problem for most applications. For better or worse, C++ is really good at optimizing for memory bandwidth. Most of the suggested alternative languages are not.
It is that simple. The new languages address irrelevant problems. It is really difficult to design a language that is more friendly to memory bandwidth than C++. And that is the resource you desperately need to optimize for in most cases.
Comment by bruce343434 6 hours ago
Comment by bayindirh 27 minutes ago
C++ is better than FORTRAN, because while it's being still developed and quite fast doing other things that core FORTRAN is good at is hard. At the end of the day, it computes and works well with MPI. That's mostly all.
C++ is better than C, because it can accommodate C code inside and has much more convenience functions and libraries around and modern C++ can be written more concisely than C, with minimal or no added overhead.
Also, all three languages are studied so well that advanced programmers can look a piece of code and say that "I can fix that into the cache, that'll work, that's fine".
"More modern" programming languages really solve no urgent problems in HPC space and current code works quite well there.
Reported from another HPC datacenter somewhere in the universe.
Comment by lugu 4 hours ago
Comment by ozgrakkurt 47 minutes ago
There is a colossal ergonomics difference if you compare using clang vs rust to writing a hashmap for example.
C compilers just have everything you can think of because everythin is first implemented there.
Using anything else just seems kind of pointless. I understand new languages do have benefits but I don't believe language matters that much really.
The person who writes that garbage pointer soup in C write Arc<> + multi threaded + macro garbage soup in Rust.
Comment by _flux 3 hours ago
The rust compiler actually has similar things, but they're not available in stable builds. I suppose there are some issues if principle why not to include them in stable. E.g: https://doc.rust-lang.org/std/intrinsics/fn.prefetch_read_da...
Maybe some time in the future good acceptable abstractions will be conceived for them.. Perhaps using just using nightly builds for HPC is not that far out, though.
Comment by ameliaquining 50 minutes ago
__builtin_assume is available on stable (though of course it's unsafe): https://doc.rust-lang.org/std/hint/fn.assert_unchecked.html
There's an open issue to stablize the prefetch APIs: https://github.com/rust-lang/rust/issues/146941 As is usually the case when a minor standard-library feature remains unstable, the primary reason is that nobody has found the problem urgent enough to put in the required work to stabilize it. (There's an argument that this process is currently too inefficient, but that's a separate issue.) In the meantime, there are third-party libraries available that use inline assembly to offer this functionality, though this means they only support a couple of the most popular architectures.
Comment by m_mueller 28 minutes ago
Comment by _flux 10 minutes ago
As I understand it, the Fortran compiler just expects your code to respect the "restrictness", it doesn't enforce it.
Comment by j4k0bfr 6 hours ago
Comment by bayindirh 25 minutes ago
I managed to reach practical IPC limits of the hardware I was running on, and while I could theoretically make prefetcher happier with some matrix reordering, looking back, I'm not sure how much performance it provided since the FPU was already saturated at that point.
Comment by Narishma 5 hours ago
Comment by formerly_proven 3 hours ago
Comment by kmaitreys 2 hours ago
Comment by Joeboy 2 hours ago
Speaking authoritatively from my position as an incompetent C++ / Rust dev.
Comment by kmaitreys 37 minutes ago
Because passing pointers isn't as ergonomic in Rust, I do things in arena-based way (for example setting up quadtrees or octrees). Is that part of the issue when it comes to memory bandwidth?
Comment by zozbot234 1 hour ago
Comment by kmaitreys 36 minutes ago
For now my plan is to write fairly similar style code as one may write in C++/Fortran through MPI bindings in Rust.
Comment by convolvatron 15 minutes ago
if you take that one step further and only use those objects on a single core, now your default model is lock-free non-shared objects. at large scale that becomes kind of mandatory. some large shared memory machines even forgo cache consistency because you really can't do it effectively at large scale anyways.
but all of this is highly platform dependent, and I wouldn't get too wrapped up around it to begin with. I would encourage you though to worry first about expressing your domain semantics, with the understanding that some refactoring for performance will likely be necessary.
if you have the patience and personally and within the project, it can be a lot of fun to really get in there and think about the necessary dependencies and how they can be expressed on the hardware. there's a lot of cool tricks, for example trading off redundant computation to reduce the frequency of communication.
Comment by Joel_Mckay 4 hours ago
Rust is typically slowest (often negligible <3%), C++ has better CUDA support, and C can be heavily optimized with inline assembly (very unforgiving to juniors.)
Also, heavily associated with coding style =3
https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
Comment by Joel_Mckay 5 hours ago
Even with HDL defined accelerators, that statement may not mean what people assume. =3
https://en.wikipedia.org/wiki/Latency_(engineering)
https://en.wikipedia.org/wiki/Clock_domain_crossing
https://en.wikipedia.org/wiki/Metastability_(electronics)
https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
Comment by j4k0bfr 1 hour ago
Comment by Joel_Mckay 1 hour ago
https://www.youtube.com/watch?v=FujoiUMhRdQ
https://github.com/Spyros-2501/Z-turn-Board-V2-Diary
https://www.youtube.com/@TheDevelopmentChannel/playlists
https://myirtech.com/list.asp?id=708
The Debian Linux example includes how to interface hardware ports.
Almost always better to go with a simpler mcu if one can get away with it. Best of luck =3
Comment by j4k0bfr 1 hour ago
Comment by suuuuuuuu 4 hours ago
Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.
Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.
Comment by pjmlp 3 hours ago
NVidia designs CUDA hardware specifically for the C++ memory model, they went through the trouble to refactor their original hardware across several years, so that all new cards would follow this model, even if PTX was designed as polyglot target.
Additionally, ISO C++ papers like senders/receivers are driven by NVidia employees working on CUDA.
Comment by suuuuuuuu 49 minutes ago
Comment by pjmlp 42 minutes ago
You can program CUDA in standard C++20, with CUDA libraries hidding the language extensions.
I love when C and C++ dialects are C and C++ when it matters, and not when it doesn't help to sell the ideas being portrayed.
Comment by suuuuuuuu 29 minutes ago
I'm extremely dubious that such an opaque abstraction can actually solve the (true) problem. "Not having to write CUDA" is not enough - how do you tune performance? Parallelization strategies, memory prefetching and arrangement in on-chip caches, when to fuse kernels vs. not... I don't doubt the compiler can do these things, but I do doubt that it can know at compile time what variants of kernel transformations will optimize performance on any given hardware. That's the real problem: achieving an abstraction that still gives one enough control to achieve peak performance.
Edit: you tell me if I'm wrong, but it seems that std::par can't even use shared memory, let alone let one control its usage? If so, then my point stands: C++ is not remotely relevant. Again, avoiding writing CUDA (etc.) doesn't solve the real problem that high-performance language abstractions aim to address.
Comment by fcanesin 3 hours ago
Comment by suuuuuuuu 50 minutes ago
Comment by Joel_Mckay 6 hours ago
In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.
Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.
Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.
Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.
I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...
The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.
Best of luck, =3
"Crabs Trade Shells in the Strangest Way | BBC Earth"
Comment by Xcelerate 3 hours ago
The other issue is that to really get the value out of these machines, you sort of have to tailor your code to the machine itself to some degree. The DOE likes to fund projects that really show off the unique capabilities of supercomputers, and if your project could in principle be done on the cloud or a university cluster, it’s likely to be rejected at the proposal stage. So it’s sort of “all or nothing” in the sense that many codebases for HPC are one-off or even have machine-specific adaptations (e.g., see LAMMPS). No new general purpose language would really make this easier.
Comment by jpecar 6 hours ago
Comment by jltsiren 5 hours ago
Distributed computing never really took off in bioinformatics, because most tasks are conveniently small. For example, a human genome is small enough that you can run most tasks involving a single genome on an average cost-effective server in a reasonable time. And that was already true 10–15 years ago. And if you have a lot of data, it usually means that you have many independent tasks.
Which is nice from the perspective of a tool developer. You don't have to deal with the bureaucracy of distributed computing, as it's the user's responsibility.
C++ is popular for developing bioinformatics tools. Some core tools are written in C, but actual C developers are rare. And Rust has become popular with new projects — to the extent that I haven't really seen C++20 or newer in the field.
Comment by zozbot234 1 hour ago
Comment by calvinmorrison 15 minutes ago
Comment by bluedino 36 minutes ago
I haven't talked to anyone writing C++ code on a HPC cluster that I'm working on in a long, long time. And that's in industrial/chemical/automotive fields.
Comment by jpecar 6 hours ago
So from what I see actual programming language doesn't matter as much as how the work is organized. Anything helping people simplify this task is of immediate benefit to the science.
Comment by jkh1 5 hours ago
Comment by JohnWabwire 4 hours ago
Comment by riffraff 7 hours ago
I've never worked in HPC but it seems it should be relatively simple to find a C/C++ dev that can pick up OpenMP, or one that already knows it, compared to hiring people who know Chapel.
The "scaling down" factor (how easy or interesting it is to use tool X for small use) seems a disadvantage of HPC-only languages, which creates a barrier to entry and a reduction in available workforce.
Comment by kinow 6 hours ago
And even knowing OpenMP or MPI may not suffice if the site uses older versions or heterogeneous approaches with CUDA, FPGA, etc. Knowing the language and the shared/distributed mem libs help, but if your project needs a new senior dev than it may be a bit hard to find (although popularity of company/HPC, salary, and location also play a role).
Comment by physicsguy 5 hours ago
So for e.g. when I did HPC simulation codes in magnetics, there was little point focusing on some of these areas because our codes were dominated by the long-range interaction cost which limited compute scaling. All of our effort was tuning those algorithms to the absolute max. We tried heterogenous CPU + GPU but had very mixed results, and at that time (2010s) the GPU memory wasn't large enough for the problems we cared about either.
I then moved to CFD in industry. The concerns there were totally different since everything is grid local. Partitioning over multi-GPU is simple since only the boundaries need to be exchanged on each iteration. The problems there were much more on the memory bandwidth and parallel file system performance side.
Basically, you have to learn to solve whatever challenges get thrown up by the specific domain problem.
> And even knowing OpenMP or MPI may not suffice if the site uses older versions
To be fair, you always have the option of compiling yourself, but most people I met in academia didn't have the background to do this. Spack and EasyBuild make this much much easier.
Comment by bluedino 34 minutes ago
Comment by KaiserPro 5 hours ago
There are a couple of big things that are difficult to get your head around:
1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines)
2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. )
3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start.
4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.)
Comment by chatmasta 3 hours ago
Comment by nnevatie 3 hours ago
Comment by swiftcoder 6 hours ago
Comment by jacquesm 5 hours ago
Comment by Joel_Mckay 5 hours ago
And Erlang has already run many telecom infrastructures for decades. Surprising given how fragile the multi-host implementation has proven.
Erlang/Elixir are neat languages, and right next to Julia for fun. =3
Comment by jacquesm 1 hour ago
Comment by Joel_Mckay 37 minutes ago
Haven't personally deployed this version yet. ymmv =3
Comment by pklausler 2 hours ago
(What HPC does need, IMNSHO, is to disband or disregard WG5/J3, get people who know what they're doing to fix the features they've botched or neglected for thirty years, and then have new procurements include RFCs that demand the fixed portable Fortran from system integrators rather than the ISO "standard".)
Comment by ivell 2 hours ago
Comment by DamonHD 5 hours ago
Comment by shevy-java 2 hours ago
It may not be dead, but it seems much harder for languages to gain adoption.
I think there are several reasons; I also suspect AI contributes a bit to this.
People usually specialize in one or two language, so the more languages exist, the less variety we may see with regards to people ACTUALLY using the language. If I would be, say, 15 years old, I may pick up python and just stick with it rather than experiment and try out many languages. Or perhaps not even write software at all, if AI auto-writes most of it anyway.
Comment by RhysU 4 hours ago
The article neglects that all of C, C++, and Fortran have evolved over the last 30 years.
Also, you'll find significant advances in the HPC library ecosystem over the trailing years. Consider, for example, Trilinos (https://trilinos.github.io/index.html) or Dakota (https://dakota.sandia.gov/about-dakota/) both of which push a ton of domain-agnostic capabilities into a C++ library instead of bolting them into a bespoke language. Communities of users tend to coalesce around shared libraries not creating new languages.
Comment by pjmlp 3 hours ago
Comment by RhysU 1 hour ago
This section, https://chapel-lang.org/blog/posts/30years/#ok-then-why, does not mention libraries at all.
Comment by pjmlp 44 minutes ago
> A fact of life in HPC is that the community has many large, long-lived codes written in languages like Fortran, C, and C++ that remain important. Such codes keep those languages at the forefront of peoples’ minds and sometimes lead to the belief that we can’t adopt new languages.
> In large part because of the previous point, our programming notations tend to take a bottom-up approach. “What does this new hardware do, and how can we expose it to the programmer from C/C++?” The result is the mash-up of notations that we have today, like C++, MPI, OpenMP, and CUDA. While they allow us to program our systems, and are sufficient for doing so, they also leave a lot to be desired as compared to providing higher-level approaches that abstract away the specifics of the target hardware.
Nothing there suggests the languages don't improve, especially anyone that follows ISO knows where many of improvements to Fortran, C and C++ are coming from.
For example, C++26 is probably going to get BLAS into the standard library, senders/receivers is being sponsored by CUDA money.
Another thing you missed from the author background, is that Chapel is sponsored by HPE and Intel, and one of the main targets are HPE Cray EX/XC systems, they know pretty well what is happening.
Comment by crabbone 5 hours ago
That's not to say that new things don't happen there, it's just that I find a lot of old stuff that was shown to be bad decades ago still being in vogue in HPC. Probably because it's a relatively small field with a lot of people there being academics and not a lot of migration to/from other fields.
You've probably never heard of `module` (either Tcl or Lmod). This is a staple of HPC world. What this thing does is it sources or (tries to) remove some shell variables and functions into the shell used either interactively or by a batch job. This is a beyond atrocious idea to handle your working environment. The information leaks, becomes stale, you often end up loading the wrong thing into your environment. It's simply amazing how bad this thing is. And yet, it's just everywhere in HPC.
Another example: running anything in HPC, basically, means running Slurm batch jobs. There are alternatives, but those are even worse (eg. OpenPBS). When you dig into the configuration of these tools, you realize they've been written for pre-systemd Linux and are held together by a shoestring of shell scripting. They seldom if at all do the right thing when it comes to logging or general integration with the environment they run in. They can be simultaneously on the bleeding edge (eg. cgroup integration or accelerator driver integration) and be completely backwards when it comes to having a sensible service definition for systemd (eg. try to manage their service dependencies on their own instead of relying on systemd to do that for them).
In other words, imagine a steam-punk world, but now it's in software. That's sort of how HPC feels like after a decade or so in more popular programming fields.
Also, a lot of code written for HPC is written the way it is not because the writer chose the language or the environment. The typical setup is: university IT created a cluster with whatever tools they managed to put there eons ago, and you, the code writer, have to deal with... using CentOS6 by authenticating to university's AD... in your browser... through JupyterLab interface. And there's nothing you can do about it because the IT isn't there, is incompetent to the bone and as long as you can get your work done somehow, you'd prefer that over fighting to perfect your toolchain.
Bottom line, unless a language somehow becomes indispensable in this world, no matter its advantages, it's not going to be used because of the huge inertia and general unwillingness to do beyond the minimum.
Comment by pama 3 hours ago
Your centos6 references made me chuckle :-)
Comment by zozbot234 1 hour ago
Comment by anewhnaccount2 4 hours ago
Comment by hpcgroup 38 minutes ago
Comment by kevinten10 6 hours ago
Comment by chinabot 4 hours ago