What Does a Database for SSDs Look Like?
Posted by charleshn 14 hours ago
Comments
Comment by cornholio 10 hours ago
> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.
I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
Comment by amluto 10 hours ago
- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.
- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.
- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.
Comment by bee_rider 6 hours ago
Comment by amluto 30 minutes ago
Instead the industry seems to be moving toward CXL for fancy memory-like-but-not-quite-memory. CXL is based on PCIe, and it doesn’t have these weird interoperability and caching issues. Flushing writes all the way to PCIe has never been much of a problem, since basically every PCI or PCIe device ever requires a mechanism by which host software can communicate all the way to the device without the IO getting stalled in some buffer on the way.
Comment by rdtsc 8 hours ago
It is. Coordinated failures shouldn't be a surprise these days. It's kind of sad to here that from an AWS engineer. Same data pattern fills the buffers and crashes multiple servers, while they were all "hoping" that others fsynced the data, but it turns out they all filled up and crashed. That's just one case there are others.
Comment by hawk_ 8 hours ago
Comment by amluto 25 minutes ago
AWS etc, in contrast, have really quite abysmal durability for local disk storage. If you want the performance and cost benefits of using local storage (as opposed to S3, EBS, etc), there are plenty of server failure scenarios where the probability that your data is still there hovers around 0%.
Comment by rdtsc 7 hours ago
Comment by zrm 5 hours ago
Suppose you have each server commit the data "to disk" but it's really a RAID controller with a battery-backed write cache or enterprise SSD with a DRAM cache and an internal capacitor to flush the cache on power failure. If they're all the same model and you find a usage pattern that will crash the firmware before it does the write, you lose the data. It's little different than having the storage node do it. If the code has a bug and they all run the same code then they all run the same bug.
Comment by rdtsc 5 hours ago
Comment by adgjlsfhk1 9 hours ago
Comment by ayende 9 hours ago
Comment by seastarer 8 hours ago
Comment by Palomides 8 hours ago
Comment by saltcured 7 hours ago
It is great as long as your actual workload fits, but misleading if a microbenchmark doesn't inform you of the knee in the curve where you exhaust the buffer and start observing the storage controller as it retires things from this buffer zone to the other long-term storage areas. There can also be far more variance in this state as it includes not just slower storage layers, but more bookkeeping or even garbage-collection functions.
Comment by hawk_ 8 hours ago
Comment by Palomides 8 hours ago
Comment by ncruces 5 hours ago
In many situations, fsync flushes everything, including totally uncorrelated stuff that might be running on your system.
Comment by packetlost 8 hours ago
Comment by ffsm8 8 hours ago
Comment by SigmundA 8 hours ago
You must have a single flushed write to disk to be durable, but it doesn't need the second write.
Comment by beAbU 4 hours ago
It's very likely that he was part of the team that invented EC2.
Comment by pragmatic 10 hours ago
Comment by wpietri 11 hours ago
Comment by saghm 10 hours ago
Comment by evanelias 9 hours ago
Comment by formerly_proven 8 hours ago
Comment by mrkeen 14 hours ago
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
Comment by crazygringo 12 hours ago
Edit: thank you for all the answers -- very educational, TIL!
Comment by threeducks 12 hours ago
https://i.imgur.com/t5scCa3.png
https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.
Comment by mgerdts 8 hours ago
With SSDs, the write pattern is very important to read performance.
Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.
Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.
If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.
If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.
At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.
Comment by OptionOfT 10 hours ago
Comment by diroussel 9 hours ago
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
Comment by szundi 11 hours ago
Comment by yyyk 12 hours ago
Comment by loeg 6 hours ago
> Our extensive experiments discover that, unlike HDDs, the performance degradation of modern storage devices incurred by fragmentation mainly stems from request splitting, where a single I/O request is split into multiple ones.
Comment by formerly_proven 12 hours ago
- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
Comment by PunchyHamster 12 hours ago
Comment by b112 12 hours ago
Comment by lazide 11 hours ago
Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
Comment by Lwerewolf 6 hours ago
Comment by vegabook 6 hours ago
Comment by __turbobrew__ 5 hours ago
Comment by zokier 13 hours ago
Comment by koverstreet 10 hours ago
Comment by loeg 6 hours ago
Comment by koverstreet 6 hours ago
LSM-trees do really badly at multithreaded update workloads, and compaction overhead is really problematic when there isn't much update locality.
On the other hand, having most of your index be constant lets you use better data structures. Binary search is really bad.
For pure in memory indexes, according to the numbers I've seen it's actually really hard to beat a pure (heavily optimized) b-tree; for in-memory you use a much smaller node size than on disk (I've seen 64 bytes, I'd try 256 if I was writing one).
For on disk, you need to use a bigger node size, and then binary search is a problem. And 4k-8k as is still commonly used is much too small; you can do a lockless or mostly lockless in-memory b-tree, but not if it's persistent, so locking overhead, cache lookups, all become painful for persistent b-trees at smaller node sizes, not to mention access time on cache miss.
So the reason bcachefs's (and bcache's) btree is so fast is that we use much bigger nodes, and we're actually a hybrid compacting data structure. So we get the benefits of LSM-trees (better data structures to avoid binary search for most of a lookup) without the downsides, and having the individual nodes be (small, simple) compacting data structures is what makes big btree nodes (avoiding locking overhead, access time on node traversal) practical.
B-epsilon btrees are dumb, that's just taking the downsides of both - updating interior nodes in fastpaths kills multithreaded performance.
Comment by evanelias 8 hours ago
Comment by koverstreet 6 hours ago
Comment by evanelias 5 hours ago
Production adoption at scale is always relevant as a measure of stability, as well as a reflection of whether a solution is applicable to general-purpose workloads.
There's more to the story than just raw performance anyway; for example Meta's migration to MyRocks was motivated by superior compression compared to other alternatives.
Comment by nextaccountic 11 hours ago
anything like this, but for postgres?
actually, is it even possible to write a new db engine for postgres? like mysql has innodb, myisam, etc
Comment by evanelias 9 hours ago
That said, there are a few alternative storage engines for Postgres, such as OrioleDB. However due to limitations in Postgres's storage engine API, you need to patch Postgres to be able to use OrioleDB.
MySQL instead focused on pluggable storage engines from the get-go. That has had major pros and cons over the years. On the one hand, MyISAM is awful, so pluggable engines (specifically InnoDB) are the only thing that "saved" MySQL as the web ecosystem matured. It also nicely forced logical replication to be an early design requirement, since with a multi-engine design you need a logical abstraction instead of a physical one.
But on the other hand, pluggable storage introduces a lot of extra internal complexity, which has arguably been quite detrimental to the software's evolution. For example: which layer implements transactions, foreign keys, partitioning, internal state (data dictionary, users/grants, replication state tracking, etc). Often the answer is that both the server layer and the storage engine layer would ideally need to care about these concerns, meaning a fully separated abstraction between layers isn't possible. Or think of things like transactional DDL, which is prohibitively complex in MySQL's design so it probably won't ever happen.
Comment by einpoklum 11 hours ago
Comment by lazide 11 hours ago
Comment by pmontra 11 hours ago
> Companies are global, businesses are 24/7
Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.
24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
Comment by evanelias 9 hours ago
A massive number of companies have global customers, regardless of where the company itself has employees.
For example my b2b business is relatively tiny, yet my customer base spans four continents. Or six continents if you count free users!
Comment by rafabulsing 6 hours ago
Comment by ljosifov 13 hours ago
Comment by grogers 1 hour ago
Comment by londons_explore 14 hours ago
Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
Comment by Sesse__ 12 hours ago
If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.
Comment by throw0101a 12 hours ago
¿Por qué no los dos?
Comment by Sesse__ 12 hours ago
Comment by _bohm 11 hours ago
Comment by Tostino 10 hours ago
Comment by Sesse__ 6 hours ago
Postgres allows a group commit to try to combine multiple transactions to avoid the multiple fsyncs, but it adds delay and is off by default. And even so, it reduces fsyncs, not writes.
Comment by Tostino 5 hours ago
Comment by toolslive 10 hours ago
Comment by esperent 13 hours ago
Comment by digikata 13 hours ago
Comment by Sesse__ 12 hours ago
Comment by formerly_proven 12 hours ago
Comment by PunchyHamster 12 hours ago
And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".
I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
Comment by dist1ll 12 hours ago
That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.
Comment by formerly_proven 12 hours ago
Comment by dist1ll 11 hours ago
> In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. [..] With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability.
As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments
> We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum. At our observed failure rates, that’s sufficiently unlikely, even for the number of databases we manage for our customers.
[0] https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
Comment by PunchyHamster 9 hours ago
Comment by sreekanth850 12 hours ago
Comment by jandrewrogers 7 hours ago
Arbitrating differences in relative ordering across different observer clocks is what N-temporal databases are about. In databases we usually call the basic 2-temporal case “bitemporal”. The trivial 1-temporal case (which is a quasi-global clock) is what we call “time-series”.
The complexity is that N-temporality turns time into a true N-dimensional data type. These have different behavior than the N-dimensional spatial data types that everyone is familiar with, so you can’t use e.g. quadtrees as you would in the 2-spatial case and expect it to perform well.
There are no algorithms in literature for indexing N-temporal types at scale. It is a known open problem. That’s why we don’t do it in databases except at trivial scales where you can just brute-force the problem. (The theory problem is really interesting but once you start poking at it you quickly see why no one has made any progress on it. It hurts the brain just to think about it.)
Comment by PunchyHamster 9 hours ago
Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff
Comment by ayende 9 hours ago
Prety much all financial transactions are settled with a given date, not instantly. Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).
For that matter, the ultimate in BASE for financial transactions is the humble check.
That is a great example of "money out" that will only be settled at some time in the future.
There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.
Comment by sreekanth850 8 hours ago
Comment by dotancohen 11 hours ago
Comment by lazide 12 hours ago
Frankly, it’s shocking anything works at all.
Comment by didgetmaster 5 hours ago
However; a different query (e.g. SELECT name, phone_number FROM table) might result in fewer seeks if the data is stored by column instead of by row.
The article only seems to address data structures with respect to indexes, and not for the actual table data itself.
Comment by exabrial 10 hours ago
If you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.
I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.
Comment by firesteelrain 9 hours ago
Comment by Havoc 10 hours ago
Comment by hyperman1 11 hours ago
Comment by taffer 11 hours ago
Ultimately, it's a trade-off: larger pages mean faster I/O, while smaller pages mean better CPU utilisation.
Comment by ksec 8 hours ago
Design Database for SSD would still go a very very long way before what I think the author is suggesting which is designing for cloud or datacenter.
Comment by danielfalbo 14 hours ago
[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...
Comment by ritcgab 10 hours ago
Comment by gethly 11 hours ago
So, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.
Comment by adsharma 8 hours ago
This made sense for product catalogs, employee dept and e-commerce type of use cases.
But it's an extremely poor fit for storing a world model that LLMs are building in an opaque and probabilistic way.
Prediction: a new data model will take over in the next 5 years. It might use some principles from many decades of relational DBs, but will also be different in fundamental ways.
Comment by dbzero 10 hours ago
Comment by dist1ll 12 hours ago
Comment by raggi 13 hours ago
Comment by sscdotopen 12 hours ago
Comment by cmrdporcupine 8 hours ago
And CedarDB https://cedardb.com/ the more commercialized product that is following up on some of this research, including employing many of the key researchers.
Comment by ghqqwwee 10 hours ago
Comment by CraigJPerry 10 hours ago
I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.
Comment by sreekanth850 12 hours ago
Comment by nly 11 hours ago
The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding
High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.
Comment by gethly 11 hours ago
I think this was one, and I want to emphasise this, of the main points behind Odin programming language.
Comment by cmrdporcupine 8 hours ago
It turns out that btrees are still efficient for this work. At least until the hardware vendors deign to give us an interface to SSD that looks more like RAM.
Reading over https://www.cs.cit.tum.de/dis/research/leanstore/ and associated papers and follow up work is recommended.
In the meantime with RAM prices sky rocketing, work and research in buffer & page management for greater-than-main-memory-sized DBs is set to be Hot Stuff again.
I like working in this area.
Comment by sreekanth850 8 hours ago
Comment by cmrdporcupine 8 hours ago
Comment by toolslive 10 hours ago
Comment by javaunsafe2019 5 hours ago
Comment by Rakshath_1 10 hours ago
Comment by Rakshath_1 10 hours ago