I've been writing ring buffers wrong all these years (2016)
Posted by flaghacker 3 days ago
Comments
Comment by RossBencina 1 day ago
The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.
I first learnt of this technique from Phil Burk, we've been using it in PortAudio forever. The technique is also widely known in FPGA/hardware circles, see:
"Simulation and Synthesis Techniques for Asynchronous FIFO Design", Clifford E. Cummings, Sunburst Design, Inc.
https://twins.ee.nctu.edu.tw/courses/ip_core_04/resource_pdf...
Comment by hinkley 1 day ago
Intel is still 64 byte cache lines as they have been for quite a long time but they also do some shenanigans on the bus where they try to fetch two lines when you ask for one. So there’s ostensibly some benefit of aligning data particularly on linear scans to 128 byte alignment for cold cache access.
Comment by rcoveson 1 day ago
Also, there's another benefit downstream of that one: Powers of two work as a schelling point for allocations. Picking powers of two for resizable vectors maximizes "good luck" when you malloc/realloc in most allocators, in part because e.g. a buddy allocator is probably also implemented using power-of-two allocations for the above reason, but also for the plain reason that other users of the same allocator are more likely to have requested power of two allocations. Spontaneous coordination is a benefit all its own. Almost supernatural! :)
Comment by hinkley 17 hours ago
That has next to nothing to do with how much of your 128 GB of RAM should be dedicated to any one data structure, because working memory for a task is the sum of a bunch of different data structures that have to fit into both the caches and main memory, which used to be powers of two but now main memory is often 2^n x 3.
And as someone else pointed out, the optimal growth factor for resizable data structures is not 2, but the golden ratio, 1.61. But most implementations use 1.5 aka 3/2.
Comment by loeg 1 day ago
Comment by kevin_thibedeau 1 day ago
Comment by KeplerBoy 22 hours ago
Comment by waffletower 16 hours ago
Comment by aidenn0 1 day ago
I first encountered this structure at a summer internship at a company making data switches.
Comment by tom_ 1 day ago
(People will probably moan at the idea of restarting the process periodically rather than fixing the issue properly, but when the period would be something like 50 years I don't think it's actually a problem.)
Comment by ale42 1 day ago
On a 64-bit platform, sure. When you're working on ring buffers with an 8-bit microcontroller, using 64-bit numbers would be such an overhead that nobody would even think of it.
Comment by thaumasiotes 1 day ago
I think you have that backwards. If something needs to be done every week, it will get done every week. That's not a problem.
If something needs to be done every fifty years, you'll be lucky if it happens once.
Comment by tom_ 1 day ago
Comment by gpderetta 22 hours ago
Comment by reincarnate0x14 1 day ago
Just as an example the Voyager computers have been restarted and that's been almost 60 years.
Comment by zephen 1 day ago
That may or may not be part of the actual definition of a ring buffer, but every ring buffer I have written had those goals in mind.
And the first method mentioned in the article fully satisfies this, except for the one missing element mentioned by the author. Which in practice, often is not only not a problem, but simplifies the logic so much that you make up for it in code space.
Or, for example, say you have a 256 character buffer. You really, really want to make sure you don't waste that one character. So you increase the size of your indices. Now they are 16 bits each instead of 8 bits, so you've gained the ability to store 256 bytes by having 260 bytes of data, rather than 255 bytes by having 258 bytes of data.
Obviously, if you have a 64 byte buffer, there is no such tradeoff, and the third example wins (but, whether your are doing the first or third example, you still have to mask the index data off at some point, whether it's on an increment or a read).
> The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.
There's "not possible" and then "not practical."
Sure, you could have a 50 byte buffer, but now, if your indices are ever >= 50, you're subtracting 50 before accessing the array, so this will increase the code space (and execution time).
> The [index size > array size] technique is also widely known in FPGA/hardware circles
Right, but in those hardware circles, power-of-two _definitely_ matters. You allocate exactly one extra bit for your pointers, and you never bother manually masking them or taking a modulo or anything like that -- they simply roll over.
If you really, really need to construct something like a 6 entry FIFO in hardware, then you have techniques available to you that mere mortal programmers could not use efficiently at all. For example, you could construct a drop-through FIFO, where every element traverses every storage slot (with a concomitant increase in minimum latency to 6 clock cycles), or you could construct 4 bit indices that counted 0-1-2-3-4-5-8-9-10-11-12-13-0-1-2 etc.
Most ring buffers, hardware or software, are constructed as powers of two, and most ring buffers either (a) have so much storage that one more element wouldn't make any difference, or (b) have the ability to apply back pressure, so one more element wouldn't make any difference.
Comment by azemetre 1 day ago
Comment by RossBencina 23 hours ago
Comment by ErroneousBosh 1 day ago
I don't see why it wouldn't be, it's just computationally expensive to take the modulo value of the pointer rather than just masking off the appropriate number of bits.
Comment by myrmidon 21 hours ago
The problem is incrementing past the index integer type limit.
Consider a simple example with ring buffer size 9, and 16bit indices:
When you increment the write index from 0xffff to 0, your "masked index" jumps from 6 (0xffff % 9) to 0 (instead of 7).
There is no elegant fix that I'm aware of (using a very wide index type, like possibly a uint64, is extremely non-elegant).
Comment by ErroneousBosh 20 hours ago
There's probably no good reason to make your buffer sizes NOT a power of two, though. If memory's that tight, maybe look elsewhere first.
Comment by myrmidon 18 hours ago
If you swap bitmasking for modulo operations then that does work at first glance, but breaks down when the index wraps around. This forces you to abandon the simple "increment" operation for something more complex, too.
The requirement for a power-of-two size is more intrinsic to the approach than just the bitmasking operation itself.
Comment by ErroneousBosh 17 minutes ago
Comment by atq2119 1 day ago
Comment by msm_ 18 hours ago
Comment by cuno 1 day ago
For non-power of two, just checked our own very old circular byte buffer library code and using the notation from this article, it is:
entriesAllocated() { return ((wrPtr-rdPtr+2*bufSize) % (2*bufSize)); }
remainingSpace() { return bufSize - entriesAllocated(); }
isEmpty() { return (entriesAllocated()==0); }
isFull() { return (entriesAllocated()==bufSize); }
incWr(int n) { wrPtr = (wrPtr+n) % (2*bufSize); }
incRd(int n) { rdPtr = (rdPtr+n) % (2*bufSize); }
The 2*bufSize gives you an extra bit (beyond representing bufSize) that lets you disambiguate empty vs full. And if it is a constant power of two (e.g. via C++ template), then you can see how this just compiles into a bitmask instead, like the author's version. You read and write the buffer at (rdPtr%bufSize) and (wrPtr%bufSize) respectively.Comment by Someone 1 day ago
It is, but, IMO, shouldn’t use the code for “a n-element ring buffer, with n set to 1”, similarly to how an array of booleans in many languages shouldn’t be implemented as “an arrayof Foos, with Foo set to bool”.
C++ has std::bitset and std::vector and Java similarly has BitSet and Array because using the generic code for arrays of bits is too wasteful.
Similarly, a one-element ring buffer is either full or it is empty. Why use two indexes to encode a single boolean?
Comment by cpgxiii 1 day ago
Rather infamously, C++ tried to be clever here and std::vector<bool> is not just a vector-of-bools but instead a totally different vector-ish type that lacks many of the important properties of every other instantiation of std::vector. Yes, a lot of the time you want the space efficiency of a dynamic bitset, rather than wasting an extra 7 bits per element. But also quite often you do want the behavior of a "real" std::vector for true/false values, and then you have to work around it manually (usually via std::vector<uint8_t> or similar) to get the expected behavior.
Comment by jsnell 1 day ago
Depending on the element width, you'd have space for different amounts of data in the inline buffer. Sometimes 1, sometimes a few more. Specializing for a one-element inline buffer would be quite complex with limited gains.
In retrospect trying to use that as a running gag for the blog post did not work well without actually giving the full context, but the full context would have been a distraction.
Comment by andrepd 1 day ago
Notably, this is not the case. C++ std::vector is specialised for bools to pack bits into words, causing an untold array (heh) of headaches.
And "wasteful" is doing a lot of lifting here. In terms of memory usage? Yes. In terms of CPU? The other way around.
Comment by mbel 1 day ago
That depends on your architecture and access pattern. In case of sequential access, packed bools may perform better due to arithmetic being usually way cheaper than memory operations.
Comment by ekropotin 1 day ago
It feels like 90% swe jobs these days are about writing CRUD wrappers.
Comment by avadodin 1 day ago
Mostly Type 1 and overflow is a diagnostic log at most. Losing all stale unprocessed data and leaving a ready empty buffer behind is often the desired outcome.
Type 3 is probably banned on most codebases because of the integer overflow.
Comment by zephen 1 day ago
Yeah, the Type 3 example could conceivably make it so that you intermix old and new data if you overflow, rather than just dumping a whole buffer.
Especially when your full() function checks for exact equality, like the one in the article does.
And if you remove the asserts, and then somehow underflow? God help you. You'll be pulling 4 billion entries you never actually stored out of the buffer, just repeating previously stored garbage over and over.
> Type 3 is probably banned on most codebases because of the integer overflow.
Not only this, but the purported code reduction benefits associated with type 3 are only superficial, and won't actually appear in any assembly listing.
Comment by Krssst 1 day ago
Signed integer overflow is definitely a problem however. Something as simple as incrementing a user-provided int can lead to UB (if the user provides INT_MAX).
Comment by RealityVoid 1 day ago
Comment by Neywiny 1 day ago
But in all honesty, look for more embedded jobs, then. We can certainly use the help.
Comment by nathan_douglas 1 day ago
Comment by Neywiny 21 hours ago
So I work with microcontrollers of various vendors, I do FPGA with hard and soft processors, recently did just past the smoke test through embedded Linux on a SoC, and I've done plenty of desktop code on Linux and Windows for interfacing. I get to work with a wide range of devices and a wide range of tasks for them. Might not pay as much but my goodness is it fun
Comment by ekropotin 1 day ago
Comment by Neywiny 1 day ago
Comment by IsTom 21 hours ago
Comment by RealityVoid 1 day ago
Comment by dang 1 day ago
I've been writing ring buffers wrong all these years - https://news.ycombinator.com/item?id=13175832 - Dec 2016 (167 comments)
Comment by codeworse 3 days ago
Comment by zephen 1 day ago
The first approach is lock-free, but as the author says, it wastes an element.
But here's the thing. If your element is a character, and your buffer size is, say, 256 bytes, and you are using 8-bit unsigned characters for indices, the one wasted byte is less than one percent of your buffer space, and also is compensated for by the simplicity and reduced code size.
Comment by fullstop 1 day ago
Comment by zephen 1 day ago
The article author claims that the "don't waste an element" code is also more efficient, but that claim seems to be based on a hard-on about the post-increment operator, rather than any kind of dive into the cyclometric complexity, or even, y'know, just looking at the assembler output from the compiler.
Comment by mrcode007 1 day ago
There is one more mechanism that allows implementing ring buffers without having to compare head and tail buffers at all (and doesn’t rely on counters or empty/full flags etc) that piggybacks on the cache consistency protocol
Comment by dooglius 1 day ago
Comment by mrcode007 8 hours ago
I know there is an academic wait-free and lock-free definition but folks use those often incorrectly as a slogan that something is magically better because it’s „lockfree”.
Imagine how _you_ would implement a read-modify-write atomic in the CPU and why E stands for exclusive (sort of like exclusive in a mutex)
Comment by wat10000 1 day ago
In this sense, the hardware locks used for atomic instructions don't really count, because they're implemented such that they can only be held for a brief, well defined time. There's no equivalent to suspending a thread while it holds a lock, causing all other threads to wait for an arbitrary amount of time.
Comment by spockz 1 day ago
Comment by mrcode007 8 hours ago
[1]https://www.microsoft.com/en-us/research/publication/concurr... [2]https://arxiv.org/pdf/1012.1824
Comment by nly 1 day ago
Makes the code trivial
Comment by thrtythreeforty 19 hours ago
Comment by Mikhail_Edoshin 1 day ago
(I think this was published in one of Llang's papers but in a rather obscure language.)
Comment by grumbelbart2 1 day ago
Comment by Mikhail_Edoshin 1 day ago
Comment by kybernetikos 1 day ago
Comment by z3t4 1 day ago
Comment by danhau 1 day ago
In the audio domain, the reader and weiter are usually allowed to trample over each other. If you‘ve ever gamed on a PC, you might have heard this. When a game freezes, sometimes you hear a short loop of audio playing until the game unfreezes. That‘s a ringbuffer whose writer has stopped, but the async reader is still reading the entire buffer.
Zig‘s “There are too many ring buffer implementations in the standard library“ might also be interesting:
Comment by aldonius 21 hours ago
Choose N to be a power of two >= the length of your filter.
Increment index i mod N, write the sample at buffer position x[i], output sum of x[i-k mod N] * a[k] where a[k] are your filter coefficients, repeat with next sample at next time step.
Comment by gpderetta 1 day ago
Comment by spacechild1 23 hours ago
Huh? Anytime you want to restrict the buffer to a specific size, you will have to support non-power-of-two capacities. There are cases where the capacity of the ring buffer determines the latency of the system, e.g. an audio ringbuffer or a network jitter buffer.