SSDs and Distributed Data Systems

Data systems have always been designed around the limitations of physical hardware. I think of the design of these systems as being a compromise between the external API you want to provide and the ingredients you have to build it with. In particular, a discouraging portion of the lines of code of a database or filesystem are there to try to mask the latency of disk drives. I think most people understand that SSDs are somehow faster, but I wanted to give a few thoughts on what can be done with the current batch of SSDs and what might be possible in the future if the trend in price and size for SSDs continues (and not everyone thinks it will).

Here are the facts you need to know about SSDs:

  1. They eliminate “seek time” and consequentially make random reads very, very fast. A commodity SSD can do around 40k random reads per second or 20k random writes which is about 200x what you would expect from a traditional magnetic hard drive.
  2. Linear read and write throughput is better too, but not that much better (say 200 MB/second compared to 50-100MB/sec).
  3. Random failures are less likely than on traditional hard drives as SSDs are not mechanical devices with moving parts. Estimates put this at about 20% of the failure rate, but these kinds of estimates are notoriously inaccurate.
  4. SSDs, like hard drives, are divided into blocks. On SSDs each block can be written only a small number of times before it becomes unusable. This limited number of writes is referred to as “write endurance” of the device.
  5. SSDs have a large block size, often 512KB. This means that if you write 1 byte, the SSD must erase and rewrite a full 512KB block just as if you had written 512KB (this is true of a normal drive, too, but the block size is much smaller). This phenomenon is called write amplification.
So clearly the big change here is the drop in latency for random request streams and the new weakness of limited write endurance.

How SSDs may impact data system design

A large part of system design is driven by the cost, latency and throughput ratios for different types of storage and networking technologies. This is not very different from how the laws of physics constrain the possible designs for a hammer or a kitchen sink, except that in the case of data systems the laws of physics change over time. One fun presentations of some of the trends in latency and throughput are given in the presentation "Latency Lags Bandwidth" by David Patterson. SSDs represent an area where latency suddenly gets many orders of magnitude better, invalidating a lot of design tradeoffs in existing systems. It is a pure speculation as to how this kind of change will effect the design of systems, but it is a good heuristic is to assume that people eventually move towards the design that provides the best price-performance tradeoff.

For distributed data systems the big change SSDs introduce is the relative latency of a random disk access versus a remote network hop. In the case of a traditional hard drive a single seek may have a latency cost easily 10 or 20x that of TCP request on a local network, which means a remote cache hit is much cheaper than a local cache miss. SSDs essentially erase this difference making them fairly close in terms of latency. The consequence should be favoring designs that store more data per machine and do fewer network requests.

The extreme version of this for databases is just abandoning partitioning altogether and storing all data on all machines. After all if the write load is small and data size isn’t going to go beyond 1TB then an unpartitioned, replicated mysql or postgres may be good enough. Since partitioning adds huge constraints on the richness of queries that can be implemented there may be something to be said for unpartitioned designs in cases where the data size isn’t going to grow without bound and the number of writes is tolerable (since in an unpartitioned design all writes must go to all nodes).

Eliminating partitioning is one way to reduce network hops. To take this line of thinking further one could actually co-locate the storage and the application it serves. This is a radical re-architecture, but could make a certain amount of sense. A network server often peaks at around 40,000 requests-per-second. This is not a limitation for a data system serving a random request stream for potentially disk resident data stored on traditional hard drives as each drive can only do a few hundred random accesses per second. If the drive were capable of hundreds of thousands of operations, the network request throughput limit might become a real limitation, and one might think about just co-locating the data and the application. The second reason this might make sense is that the data system is no longer a common integration layer for allowing multiple applications to share the same data. Implicit in the design of current RDBMS is that multiple applications will access the same set of tables (hence the need for so much structure and correctness checking in the database so it can be shared by multiple applications safely without validating the code in all the applications). In reality, the modern way of sharing data amongst applications is not though direct database access against a shared DB but via some kind of service API (REST, usually). There are still a lot of non-performance-related reasons to recommend keeping data systems in the client-server mode (for example allowing you to scale the CPU intensive part of the application separate from the IO intensive part, for example), but if the performance argument became strong enough these reasons might not be enough.

Less radically SSDs will likely change how caching is done. Many web companies have large memcached installations. Memcached is very good at serving high throughput at low latency on a small dataset, but since everything is in RAM it is actually rather expensive if you are space rather than CPU bound. If you place 32GB of cache per server, then 5TB of total cache space requires 160 servers. Having 5 servers each with 1TB of SSD space may be a huge win. Furthermore caching in RAM has a practical problem: restarts dump a full server worth of cache. This is an annoyance if you need to restart your cache servers frequently or if you need to bring up a new stack with completely cold cache as you may not actually be able to run your application without any caching (if you can, then why have caching, after all).

For real-time applications SSDs also enable access patterns that involve many seeks per request in a latency-sensitive application. For example light graph traversal against a disk-resident data set is feasible on SSDs but generally not on traditional hard drives. Graphs are generally not amenable to clean partitioning (they are, by nature, intertwined). As a result, a disk resident blob store running on SSDs may now be a pretty good way to implement social graph or “follower” functionality.

For offline applications, reduction in latency makes random access a possibility once again. MapReduce was designed, in large part, to work with only linear I/O patterns and eliminate random I/O. This is a huge performance win for batch processing on disk-resident data sets. But to accomplish this requires providing only a fairly limited programming model. This means Hadoop cannot easily implement the equivalent of a “hash join” or similar things except when the hash fits in memory. It is interesting to think how MapReduce might have been differently designed in a world where local random reads and writes were cheap.

Making SSDs Cheap Without Losing All Your Data

A move to SSDs also impacts the internals of data systems. Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!

A little background. SSDs currently come in two flavors: enterprise-grade (SLCs) and consumer-grade (MLCs). Enterprise-grade SSDs are quite expensive (comparable to memory prices) and so won’t be an option for many scale-out deployments. They are a good option for small deployments, though, where the high cost is less of an issue. SSDs have other differences in the sophistication of the firmware and whether they attach through the PCI bus or via SATA, but these things are less important. If you are used to mechanical hard drives, virtually any SSD will all be faster than you know what to do with. The network access time will probably eliminate any more nuanced performance differences between SSD types.

For example, when I started doing experiments with MLC drives for storage I was warned the consumer-grade devices would have periodic large latency spikes as they did various internal compaction operations. This is true, there were occational large spikes in 99th percentile time. But although the spikes are huge in relative terms, they are extremely minor in absolute terms (under 1 ms). In comparison to memory this kind of variance is terrible, but in comparison to disk access the SSD’s 99th percentile looks closer to the hard drive’s uncached 1st percentile access time.

The important difference between MLC and SLC is the number of writes they can handle. I will walk through the basic arithmetic on how to model this. SSDs are broken into blocks that are usually around 512KB and each write must first erase and then rewrite an entire 512KB block. Each block can only be erased some number of times before it starts corrupting and giving the wrong data. To avoid this the SSD manufactures seem to cap each block to a fixed number of program-erase cycles. This means that after a certain number of writes to a particular block that block will stop accepting writes. SLCs and MLCs both work the same in this respect, except that SLCs will take roughly an order of magnitude more writes per block before craping out.

Here is a table that compares prices and write-capacity per block for MLC, SLC, RAM, and SAS drives. These are taken at random off the internet, your milage would certainly vary, but this gives some idea of the pricing as of May 2012.

Cost/GBProgram-Erase Cycles
RAM $5-6 Unlimited
15k RPM SAS Hardrive $0.75 Unlimited
MLC SSD $1 5,000-10,000
SLC SSD $4-6 ~100,000

A few obvious conclusions are that SLC SSDs are priced roughly the same as memory. In a sense SLC SSDs are better than memory since they are persistent, but if keeping all your data in memory sounds expensive then so will SLCs. And in any case you can’t eliminate memory caching entirely as at least some part of the index likely needs to reside in memory even with the faster access times SSDs provide.

The MLCs, on the other hand, are actually very close in price to good hard disks. There is a slight price premium, but for most online data systems this is misleading. Since most real-time storage systems are bound by seek capacity, not data size or CPU, increasing the number of seeks available may well reduce overall capacity needs very significantly. With more seeks per machine each machine can handle more requests. For our uses we found we could comfortably take at least 5-10x more requests per machine before we started to hit limits around disk space and CPU. So the question is, is there a way to live with the low write-endurance of the MLC devices and still get the great performance and cost?

Here is where the complex dependency between the storage format and the SSD comes in. If your storage engine does large linear writes (say a full 512KB block or more) then calculating the number of writes you can do on one drive before it is all used up is easy. If the drive has 300GB and each block can be rewritten 5,000 times then each drive will allow writing 5000 x 300GB (about 1.4 petabytes). Let’s say you have 8 of these in a box with no RAID and that box takes 50MB/sec of writes 24 hours a day evenly balanced over the drives, then the drives will last around 7.8 years. This should be plenty of time, for most systems. But this lifetime is only realistic for large linear writes—the best possible case for SSD write-endurance.

The other case is that you are doing small random writes immediately sync’d to disk. If you are doing 100 byte random writes and the SSD’s internal firmware can’t manage to somehow coalesce these into larger physical writes then each 100 byte write will turn into a full program-erase cycle of a full 512KB block. In this case you would expect to be able to do only 5000*100 bytes = 500KB of writes per block before it died; so a 300GB drive with 300GB/512KB = 614,400 blocks would only take around 286GB of writes total before crapping out. Assuming, again, 8 of these drives with no RAID and 50MB/sec, you get a lifetime of only about half a day. This is the worst case for SSDs and is obviously totally unworkable.

It is worth noting that it doesn’t actually matter if the writes to the filesystem are large or small so long as the writes to the physical disk are large. If the writes are linear, that is written in order on disk, and are not sync’d to the physical drive by the OS until significant data has accumulated that will suffice. A number of small sequential writes to the filesystem will be coalesced into a single large write to the physical device by the operating system’s I/O scheduler provided there is no intervening call to fsync (or the equivalent).

To address this limitation SSDs attempt to implement some kind of internal write-ahead format to try to turn random I/O into a larger linear set of writes. However the efficacy of this seems to vary greatly by device, and I consider it a bit dangerous to bet your data on it working for all workloads. Likewise this can introduce a new read fragmentation problem since the updates that are meant to be co-located are actually scattered all over in different blocks.

A better option is just to use a storage format that naturally does linear writes. Traditional storage formats include the B+Tree and linear hashing. These formats group data into blocks by key, and hence writes are randomly scattered on the disk unless the write order happens to match the key ordering (which you can’t count on except in bulk load situations where you can chose the order in which records are updated). Buffering may help this a bit, but when the buffer is flushed it is likely that a random subset of blocks will be rewritten with small changes to each. Log-structured formats are the other alternative, they store data in the order in which it was written, and hence always do linear writes. Log-structured merge trees and Google’s SSTable variation are examples of this. Various hashing and tree formats can all be designed in a log-structured manner.

On traditional hard drives the trade-off between in-place and log-structured storage is a bit of a wash. Write performance for log-structured storage is vastly better, but most applications have more reads than writes. Read performance may be better or worse depending on the details of the implementation (a traditional log-structured merge tree is definitely worse for reads since it does multiple seeks for each uncached read, but hashing variants or SSTables that use bloom filters to avoid unnecessary lookups need not be). However the move to SSDs completely changes this dynamic. Since SSDs have fast seek performance grouping data by key is much less important. Using a log-structured storage format makes it possible to use cheap consumer grade MLC SSDs even under high write loads that would wear out the drive in a few months if run on an in-place storage format.

A particularly important factor in making this work is whether the storage engine requires an immediate fsync to disk with each write. Many systems do require this for data integrity. An immediate fsync will, of course, require a small write unless the record being written is itself large. On a single-node system avoiding fsync may mean that the last few records may be lost in a crash. On a properly designed distributed system, though, this need not be the case—replication to other nodes can take the place of the flush to disk. Replication and disk sync have different failure modes (i.e. in the case of a power outage where all nodes fail simultaneously replication won’t help, whereas in the case of disk corruption or total machine death flush generally doesn’t help). So an important question to ask about any data system design to determine if it is going to be SSD friendly is whether it requires immediate flushing of data to disk. In particular a good system should be able to give replication guarantees without waiting on the disk flush.

I did a pretty detailed evaluation of some SSDs about a year back in co-ordination with the Voldemort team at LinkedIn. One of the tests was replaying a production I/O trace at high speed to simulate the equivalent of 5 years of production load on BDB JE. This model of wear does indeed work exactly as modeled: buffered linear writes make it possible to get almost a decade of high-write usage out of a cheap MLC. The Voldemort guys wrote up a few more details on work they did to make SSDs work well with BDB JE.

If you are evaluating storage engines, here is a quick overview of which ones are in-place or log-structured. InnoDB, BDB, Toyko and Kyoto Cabnet as well as MongoDB all have in-place storage formats and are are not well-suited to cheap SSDs under high write load. LevelDB, BDB-JE, Krati, Cassandra's SSTable implementation and Bitcask are all log-structured. Riak and Voldemort both support pluggable storage engines and default to a log-structured format. The Cassandra folks at Datastax did a pretty good presentation on running Cassandra on SSDs.

An interesting question is whether cloud hosting providers will rent instances with SSDs any time soon. The write-endurance problem makes SSDs somewhat problematic for a shared hosting environment, so they may need to add a way to bill on a per block-erase basis. I have heard rumors that Amazon will offer them, but I have no idea in what form or how much they will cost (which is the critical detail).