Bf-Tree: modern read-write-optimized concurrent larger-than-memory range index
Posted by SchwKatze 21 hours ago
Comments
Comment by algorithmsRcool 19 hours ago
Some of his other projects:
[0] https://github.com/microsoft/garnet [1] https://github.com/microsoft/FASTER [2] https://github.com/microsoft/Trill
Comment by reactordev 18 hours ago
Comment by the_arun 14 hours ago
Comment by koverstreet 16 hours ago
The paper also talks about the overhead of the mapping table for node lookups, and says "Bf-Tree by default pins the inner nodes in memory and uses direct pointer addresses to reference them. This allows a simpler inner node implementation, efficient node access, and reduced con- tention on the mapping table".
But you don't have to pin nodes in memory to use direct pointer lookups. Reserve a field in your key/value pair for a direct in-memory pointer, and after chasing it check that you got the node you expect; only fall back to the mapping table (i.e. hash table of cached nodes) if the pointer is uninitialized or you don't get the node you expect.
"For write, conventional B-Tree performs the worst, as a single record update would incur a full page write, as evidenced by the highest disk IO per operation."
Only with a random distribution, but early on the paper talks about benchmarking with a Zip-f distribution. Err?
The benchmark does look like a purely random distribution, which is not terribly realistic for most use cases. The line about "a single record updating incurring a full page write" also ignores the effect of cache size vs. working set size, which is a rather important detail. I can't say I trust the benchmark numbers.
Prefix compression - nice to see this popping up.
"hybrid latching" - basically, they're doing what the Linux kernel calls seqlocks for interior nodes. This is smart, but given that their b-tree implementation doesn't use it, you shouldn't trust the b-tree benchmarks.
However, I found that approach problematic - it's basically software transactional memory, with all the complexity that implies, and it bleeds out into too much of the rest of your b-tree code. Using a different type of lock for interior nodes where read locks only use percpu counters gives the same performance (read locks touch no cachelines shared by other CPUs) for much lower complexity.
Not entirely state of the art, and I see a lot of focus on optimizations that likely wouldn't survive in a larger system, but it does look like a real improvement over LSM trees.
Comment by fuzzybear3965 16 hours ago
Comment by koverstreet 16 hours ago
It is true that b-trees aren't ideal in that respect, and you will see some amount of write amplification, but not enough that it should be a major consideration, in my experience
You really have to take into account workingset size and cache size to make any judgements there; your b-tree writes should be given by journal/WAL reclaim, which will buffer up updates.
A purely random update workload will kill a conventional b-tree on write amplification - like I mentioned, that's the absolute worst case scenario for a b-tree. But it just doesn't happen in the real world.
For the data I can give you, that would be bcachefs's hybrid b-tree - large btree nodes (256k, typically) which are internally log structured; I would consider it a minor variation on a classical b-tree. The log structuring mean that we can incrementally write only the dirty keys in a node, at the cost of some compaction overhead (drastically less than a conventional LSM).
In actual real world usage, when I've looked at the numbers (not recently, so this may have changed) we're always able to do giant highly efficient b-tree writes - the journal and in-memory cache are batching things up as much as we want - which means write amplification is negligible.
Comment by namibj 4 hours ago
If you're worried about the last layer being a giant unmanageably large B+-Tree, just shard it similarly in key space to not need much free temporary working space on SSD to stream the freshly compacted data to while the inputs to the compaction still serve real time queries.
Comment by fuzzybear3965 16 hours ago
You thinking about running some benchmarks in a bcachefs branch (:pray:)?
I want to see this data structure prototyped in PostgreSQL.
[1]: https://github.com/brianfrankcooper/YCSB/tree/master/workloa...
Comment by koverstreet 15 hours ago
They're ancient, I only have pure random and sequential benchmarks - no zipf distribution, which really should be included.
Feel free to play around with them if you want :) I could even find the driver code, if you want.
I've always been curious about PostgreSQL's core b-tree implementation. I ran into a PostgreSQL developer at a conference once, and exchanged a few words that as I recall were enough to get me intrigued, but never learned anything about it.
In a system as big, complex and well optimized as either bcachefs or postgres, the core index implementation is no longer the main consideration - there's layers and layers, and the stuff that's fun to optimize and write paper about eventually gets buried (and you start thinking a lot more about how to lay out your data structures and less about optimizing the data structures themselves).
But you know in something like that there's going to be some clever tricks, that few people know about or even remember anymore :)
Comment by SchwKatze 14 hours ago
Comment by 0xdeafbeef 20 hours ago
Comment by heliumtera 19 hours ago
You managed to clone the repo an run your test by yourself, whatever the outcome is this is a plus against the standard model for scientific research.
Also, a breath of fresh air among every other show HN thread using hundreds of adjectives to describe the "behavior" of a fully vibed system. I think this is a good model for presenting engineering projects.
Comment by SchwKatze 19 hours ago
That's so true, which is kinda funny since one of the cornerstone of scientific thinking is reproducibility.
IMHO they're using the best tool for this, nix.
Comment by Retr0id 15 hours ago
> The ‘f’ in Bf-Tree stands for “faster”
Comment by yxhuvud 7 hours ago
Comment by pgwhalen 7 hours ago
I admit I’ll agree that that extra hop was a little confusing to me though. I guess people just like GitHub and don’t like PDFs.
Comment by SchwKatze 3 hours ago
Comment by dacapoday 12 hours ago
Comment by 0xdeafbeef 11 hours ago
Comment by tombert 19 hours ago
I haven't done any benchmarks on it yet so I have no idea how well it performs, but I have a very strict "no explicit lock" policy, so everything is handled with separate files with a reconciliation service that uses a vector clock to determine ordering.
NOTE: A lot of this is AI-assisted, treat what I have written as more of a whiteboard proof of concept. I'm in the process of rewriting it by hand to do some more aggressive optimizations.