Performance Hints (2023)
Posted by danlark1 1 day ago
Comments
Comment by xnx 1 day ago
L1 cache reference 2,000,000,000 ops/sec
L2 cache reference 333,333,333 ops/sec
Branch mispredict 200,000,000 ops/sec
Mutex lock/unlock (uncontended) 66,666,667 ops/sec
Main memory reference 20,000,000 ops/sec
Compress 1K bytes with Snappy 1,000,000 ops/sec
Read 4KB from SSD 50,000 ops/sec
Round trip within same datacenter 20,000 ops/sec
Read 1MB sequentially from memory 15,625 ops/sec
Read 1MB over 100 Gbps network 10,000 ops/sec
Read 1MB from SSD 1,000 ops/sec
Disk seek 200 ops/sec
Read 1MB sequentially from disk 100 ops/sec
Send packet CA->Netherlands->CA 7 ops/secComment by twotwotwo 1 day ago
If the reciprocal numbers are more intuitive for you you can still say an L1 cache reference takes 1/2,000,000,000 sec. It's "ops/sec" that makes it look like it's a throughput.
An interesting thing about the latency numbers is they mostly don't vary with scale, whereas something like the total throughput with your SSD or the Internet depends on the size of your storage or network setups, respectively. And aggregate CPU throughput varies with core count, for example.
I do think it's still interesting to think about throughputs (and other things like capacities) of a "reference deployment": that can affect architectural things like "can I do this in RAM?", "can I do this on one box?", "what optimizations do I need to fix potential bottlenecks in XYZ?", "is resource X or Y scarcer?" and so on. That was kind of done in "The Datacenter as a Computer" (https://pages.cs.wisc.edu/~shivaram/cs744-readings/dc-comput... and https://books.google.com/books?id=Td51DwAAQBAJ&pg=PA72#v=one... ) with a machine, rack, and cluster as the units. That diagram is about the storage hierarchy and doesn't mention compute, and a lot has improved since 2018, but an expanded table like that is still seems like an interesting tool for engineering a system.
Comment by zahlman 8 hours ago
The "Read 1MB from SSD" entry translates into a higher throughput (still not as high as you imply, but "SSD" is also a broad category ranging from SATA-connected devices though I think five generations of NVMe now); I assume the "Read 4KB" timing really describes a single, isolated page read which would be rather difficult to parallelize.
Comment by chrisweekly 20 hours ago
Comment by VorpalWay 1 day ago
For example, a modern CPU will be able to execute other instructions while waiting for a cache miss, and will also be able to have multiple cache loads in flight at once (especially for caches shared between cores).
Main memory is asynchronous too, so multiple loads might be in flight, per memory channel. Same goes for all the other layers here (multiple SSD transactions in flight at once, multiple network requests, etc)
Approximately everything in modern computers is async at the hardware level, often with multiple units handling the execution of the "thing". All the way from the network and SSD to the ALUs (arithmetic logic unit) in the CPU.
Modern CPUs are pipelined (and have been since the mid to late 90s), so they will be executing one instruction, decoding the next instruction and retiring (writing out the result of) the previous instruction all at once. But real pipelines have way more than the 3 basic stages I just listed. And they can reorder, do things in parallel, etc.
Comment by SeanSullivan86 1 day ago
I was interviewing recently and was asked about implementing a web crawler and then were discussing bottlenecks (network fetching the pages, writing the content to disk, CPU usage for stuff like parsing the responses) and parallelism, and I wanted to just say "well, i'd test it to figure out what I was bottlenecked on and then iterate on my solution".
Comment by moab 22 hours ago
Your question about what degree of parallelization is unfortunately too vague to really answer. SSDs offer some internal parallelism. Need more parallelism / IOPS? You can stick a lot more SSDs on your machine. Need many machines worth of SSDs? Disaggregate them, but now you need to think about your network bandwidth, NICs, cross-machine latency, and fault-tolerance.
The best engineers I've seen are usually excellent at napkin math.
Comment by jesse__ 1 day ago
Both ops/sec and sec/op vary on clock rate, and clock rate varies across machines, and along the execution time of your program.
AFAIK, Cycles (a la _rdtsc) is as close as you can get to a stable performance measurement for an operation. You can compare it on chips with different clock rates and architectures, and derive meaningful insight. The same cannot be said for op/sec or sec/op.
Comment by cogman10 1 day ago
Most modern CPUs are out of order executors. That means that while a floating point operation might take 4 cycles to complete, if you put a bunch of other instructions around it like adds, divides, and multiplies, those will all finish at roughly the same time.
That makes it somewhat hard to reason about exactly how long any given set of operations will be. A FloatMul could take 4 cycles on it's own, and if you have
FloatMul
ADD
MUL
DIV
That can also take 4 cycles to finish. It's simply not as simple as saying "Let's add up the cycles for these 4 ops to get the total cycle count".Realistically, what you'll actually be waiting on is cache and main memory. This fact is so reliable that it underpins SMT. It's why most modern CPUs will do that in some form.
Comment by jesse__ 1 day ago
---
Counter-nitpick .. your statement of "if you put a bunch of other instructions around it" assumes there are no data dependencies between instructions.
In the example you gave:
FloatMul
ADD
MUL
DIV
Sure .. if all of those are operating on independent data sources, they could conceivably retire on the same cycle, but in the context of the article (approximating the performance profile of a series of operations) we're assuming they have data dependencies on one another, and are going to be executed serially.Comment by yvdriess 1 day ago
Comment by CalChris 1 day ago
Comment by yvdriess 1 day ago
Comment by CalChris 1 day ago
Comment by yvdriess 1 day ago
Comment by barchar 21 hours ago
Usually the problem is an unfortunately placed spill, so the operation is actually l1d$ traffic, but still.
Comment by zahlman 8 hours ago
I don't know how to interpret this.
Comment by barfoure 1 day ago
Comment by svat 19 hours ago
Comment by zahlman 8 hours ago
Comment by svat 7 hours ago
Comment by danlark1 18 hours ago
Comment by justicehunter 16 hours ago
Comment by squirrellous 22 hours ago
I particularly like the “what to do for flat profiles” ad “protobuf tips” sections. Similar advice distilled to this level is difficult to find elsewhere.
Comment by jesse__ 1 day ago
Comment by canyp 21 hours ago
I think I'd rather be eaten by a giant crustacean than work on AI.
Comment by barfoure 1 day ago
Comment by simonask 1 day ago
A lot of code can be pessimized by golfing instruction counts, hurting instruction-level parallelism and microcode optimizations by introducing false data dependencies.
Compilers outperform humans here almost all the time.
Comment by Pannoniae 1 day ago
However, that doesn't mean that looking at the generated asm / even writing some is useless! Just because you can't globally outperform the compiler, doesn't mean you can't do it locally! If you know where the bottleneck is, and make those few functions great, that's a force multiplier for you and your program.
Comment by simonask 1 day ago
Comment by jesse__ 1 day ago
I'm going to be annoying and nerd-snipe you here. It's, generally, really easy to beat the compiler.
Comment by dardeaup 1 day ago
Can you explain what this phrase means?
Comment by simonask 1 day ago
It means that the shorter sequence of instructions is not necessarily faster, and can in fact make the CPU stall unnecessarily.
The fastest sequence of instructions is the one that makes the best use of the CPU’s resources.
Comment by sevensor 8 hours ago
Comment by barfoure 1 day ago
Compilers make mistakes too and they can output very erroneous code. But that’s a different topic.
Comment by jesse__ 1 day ago
"Compilers can do all these great transformations, but they can also be incredibly dumb"
-Mike Acton, CPPCON 2014