What’s New In Kafka 0.8.1

We just released Apache Kafka 0.8.1. There are a couple of interesting things in this release.

The biggest new feature is log compaction. This allows de-duplicating the data in the partitions of a Kafka topic by primary key. This makes Kafka an excellent backend for really massive partitioned event sourcing, database data replication, and other log-centric architectures that model mutable data. This feature also acts as the basis for stateful stream processing in Samza. More details on how you can use this here.

This release also included a big cleanup of the log layer. Most noticeably this does a much better job of managing fsyncs. For those who like smooth latency graphs under load with no performance tuning this is a big win.

We also improved a lot of operational activities that previously required bouncing the brokers. We added commands for adding partitions and deleting topics online. Documentation for using this is here.

We also made all per-topic configurations dynamically managed. This means you can change the retention on a topic, or change it’s segment file size with a simple command and no bouncing brokers. These configs are documented here.

We added functionality to automatically balance leadership amongst brokers (previously you had to run a special command to make this happen). You can enabled this by setting auto.leader.rebalance.enable=true. We also added code to have leaders proactively transfer leadership during intentional shutdowns, this is more graceful than the transfer that happens with a hard kill. You can enable this with controlled.shutdown.enable=true. We will be enabling both by default once we have a little more experience using them.

There are also dozens of bug fixes and minor improvements.

This is an in-place, no-downtime release. You shouldn’t need to do much other than push out the updated code and do a rolling bounce on your servers. However you may want to glance over the new configs and tools first.

As usual we have been running pre-release versions of this code at LinkedIn for several months now so it should be pretty stable. But if you see anything unexpected please let us know.

So what’s next?

The Kafka committers have been working on a bunch of exciting new stuff. There will probably be a 0.8.2 release in the next month or so with improved consumer offset management (built on top of our new log compaction support) as well as a beta version of a completely rewritten Kafka producer. The final version of the producer and consumer will be in 0.9. If you have thoughts on the api or feature set for the producer or consumer we have been actively discussing them on the mailing list and would love to hear people’s thoughts.

Happy logging!

A Few Notes on Kafka and Jepsen

Kyle has a good write-up on replication and partitions in Kafka. I am a big fan of this methodology (empiricism FTW), though it is a bit uncomfortable to watch ones own project strapped to the apparatus.

Kyle’s explanation is pretty clear, so I have only a few things to add.

First, as Kyle notes, we have unfortunately not yet beaten the CAP theorem, although to maintain parity with the rest of the distributed systems ecosystem we promise to start working on this as soon as possible.

It’s worth pointing out our application domain to understand our motivations. Kafka is meant as a data journal (like HDFS, but targeted at real-time streams) rather than a metadata journal (like Zookeeper). These often have slightly different design tradeoffs. For example the HDFS namenode handles metadata and now uses a majority vote quorum approach for HA (or just a single master for older versions), but the HDFS data nodes need to be a little bit more parsimonious with the actual data blocks (because people don’t want to have to store five copies of all their data to tolerate two failures). Kyle describes our motivation as handling more failures, which is correct and equivalent, but it is a little more intuitive to think of keeping fewer copies of data since your failure tolerance requirements are probably fixed and you add replicas to meet that. Essentially we want to make replication a practical default approach to handling high-volume real-time data problems.

At LinkedIn, for example, we are running Kafka with replication in production for all data and are doing north of 40 billion writes a day through this replicated log. So the number of writes required is a real and practical concern.

As Kyle points out there is no correct algorithm for guaranteeing consistency in the face of f failures with fewer than 2f+1 servers, but this turns out to not actually require 2f+1 copies of each piece of data. The trick to getting fewer copies of data while maintaining consistency is to treat data writes different from configuration changes (which for anyone who is not a log-replication nerds mean changes to the set of brokers replicating writes). This idea is by no means original to us. Cheap Paxos is one of Leslie Lamport’s Paxos variants that does something along these lines with the same goal, and PacificA is a system utilizing a similar technique. This split makes sense for a system designed to handle large data volume because the data will be much much larger than the metadata about configuration (you might have tens or hundreds of data nodes, but metadata remains tiny).

A log replication algorithm typically needs to guarantee something along the lines of “committed writes are not lost”. In a leader-based log replication algorithm this usually means that the leader must have all the committed writes. To ensure that this property can hold even when a new leader is chosen there must be some overlap between the replicas who have the write and the set of nodes who participate in choosing the new leader (to ensure that the chosen leader has all the committed writes). This is the quorum property. Any overlap between these sets will work: for example you can require a majority vote for the write and a majority vote for leader election, or you can require only a single acknowledgement on write but require unanimous vote to elect a leader (not at all useful, but correct!).

Kafka does something a little different, it maintains a dynamic set of in-sync replica brokers (the ISR) that grows and shrinks as failures occur. Each broker in this set must all acknowledge each write for it to be committed, as a result any broker in the ISR has all committed messages and is eligible for election. However failure to acknowledge will cause a broker to drop out of the in-sync set, reducing the set of nodes that must acknowledge. This is great, but pushes the consistency problem into maintaining the ISR. We hold the ISR in Zookeeper which does a full majority quorum for writes. In order to rejoin the ISR a failed node must catch up on replicating the log to come back into sync with the master.

In this sense (as with Cheap Paxos and other variants) there is actually a caveat on the failure tolerance: we can tolerate N-1 Kafka node failures, but only N/2-1 Zookeeper failures. This actually is sensible, though, as Kafka node count scales with data size but Zookeeper node count doesn’t. So we would commonly have five Zookeeper replicas but only replication factor 3 within Kafka (even if we have many, many Kafka servers—data in Kafka is partitioned so not all nodes are identical).

This approach has pros and cons. The pro is basically fewer copies of data which makes the kind of large data volume problems we target a lot more practical. The con is primarily having what amounts to two types of quorums (Kafka’s quorum and Zookeepers Quorum) and more nuanced failure detection. The criteria for a node to be alive thus includes both replicating the leader and maintaining a Zookeeper connection. Failure detection can always be a bit finicky in practice so this is a real issue. But since Kafka already had a Zookeeper dependency for a variety of other uses this seemed an okay tradeoff to make.

The issue Kyle demonstrates makes for a good illustration. In this scenario Kyle kills off all but one node in the ISR, then writes to the remaining node (which is now the leader), then kills this node and brings back the other nodes. I actually think the issue here is what we call “unclean leader election” rather than our approach to quorums or anything specific to network partitions. Any type of failure executed in this pattern should work just as well as network partitions to reproduce this case.

An equivalent scenario can be constructed for a majority vote quorum. For example consider using a majority vote quorum with 3 nodes (so you can tolerate 1 failure). Now say that one of your nodes in this three node quorum is obliterated. If you accept a write with only two servers the failure of another server breaks the quorum property so you will no longer be able to elect a new master or guarantee consistency. I think this is the same as accepting a write in Kafka with only a single remaining server given our quorum properties—both cases are one server failure away from data loss.

The data loss that Kyle causes actually comes from our behavior in handling the case where all nodes die. This is an interesting thing to consider. For example, in the majority vote quorum, what do you do if you have only a single server remaining out of your three node cluster? To maintain your formal distributed system guarantee you need not do anything since the precondition of your guarantee has been broken. At this point you can can technically just printf(“Sucks to be you”) and exit(666). But a practical system likely needs to think this through. After all you would argue, I still have almost all my data on my one remaining server. The crux is that if you use this data you potentially violate consistency and lose (or gain!) committed writes, if you don’t you remain consistent but if you can’t revive your other machines you’ve lost everything.

Is it better to be alive and wrong or right and dead?

As usual the right thing to do depends on the application. Perhaps you can recover the nodes and bring them back with data intact in which case waiting around with the system down may be the best thing. But it may also be that you are running a “system of record” for writes from online traffic in which case downtime equals data loss, and you had better bring back up whatever you’ve got right-the-fuck-now.

Currently we are more than a little aggressive about pursuing the later approach—when all nodes are down we will elect the first node to come up as the leader. This can be dangerous for use cases which require strong consistency: we will do this, print a nasty warning in the logs about data loss, and continue merrily along.

For Kafka we actually do have both types of application. When used as a log directly by applications, Kafka is the system of record so downtime generally means data loss. In this case I think availability is key and our consistency guarantees are already more than good enough. But for processing downstream of this—say a Samza job that pulls data out of upstream Kafka topics, processes it, and produces new data to downstream topics—downtime need not mean data loss, just delay (provided you can eventually restore on formerly in-sync node).

Kyle’s recommendation of disabling “unclean election” and thus requiring a node in the “in sync set” to come back before accepting further reads and writes is a good one, and I think this provides more flexibility around handling this case. We have had a JIRA to add this for a while, but haven’t quite gotten around to it.

The other suggestion of requiring additional acks beyond the minimum is interesting. This does reduce the probability of a committed message being lost, though to do so reduces the availability for writes. But you could actually make the same argument for a majority quorum algorithm to have a setting that allows setting the minimum acknowledgement higher than majority to avoid “risky” writes.

There is a lot more detail on Kafka’s approach to replication and the tradeoffs it involves here. There are a lot of other interesting bits there for fellow distributed systems nerds.

How testing could transform higher education

I think the online education movement is amazing. College classes, like youth, are wasted on the young. College students are usually getting their first chance to live away from their parents and figure out who they are, which, understandably, seems more immediate and alive than 19th century English lit. or linear algebra or whatever class material they happen to study. So for a lot of students college is just the last hoop in a series of hoops they have to jump through before they are allowed to start their own life. So I think classes full of people who want to learn that subject is a great idea.

The most interesting phenomenon in this transformation is that education is changing from a non-scalable medium like theater or opera performances to a scalable medium like film or mp3s. This has the usual effect: average quality rises dramatically, price drops, and demand rises. It is all well and good to debate whether the Stanford students who took the Machine Learning class in person got a better experience than those who took the class online, but very likely all but a handful or those who took the online class wouldn’t have been admitted as Stanford students at all.

I do think this will transform the university. With a non-scalable medium, taking a class from the third best algorithms professor in the world is a great opportunity, but with a scalable medium it isn’t clear why anyone wouldn’t just want the best. So expect huge pressure to improve the quality of teaching (which is completely lacking today). And expect these top professors to be treated much more like rock stars, and make a lot of money from teaching. (This isn’t suprising, the value created by a class of 200k people is just so much higher than from a class of 150, it would be surprising if the professor couldn’t capture at least some of that).

But that isn’t what I want to talk about. I want to talk about testing and how I think testing, done right, could have an equal impact on higher education.

People say that these online classes will never replace colleges, and that may be true in some sense. I think if you break up the value of college into its constituant pieces there are really three parts:

  1. Learning. This gets a lot of the press though, as I said, it may not be the most central aspect for many students.
  2. Certification. It is not enough to know something, you need to prove to others that you know it. Companies don’t have the time to devise deep evaluations of everything you were supposed to learn in school so they use silly heuristics like your grades or the reputation of your school. For many people this is why they go to school, to get a degree.
  3. “The college experience.”

Let’s go through each of these.

Learning, is what all articles about MOOCs and online education cover, and though I have a lot to add, I will save it for another time.

By “the college experience” I mean all the non-educational aspects of college. This is the friends, late night conversations, sex, drugs, alcohol, and all that. This is the first time many kids have to move out on their own, away from family and friends who have known them sense grade school, and kind of start fresh. For many people who start a career directly after college this may be both the first and the last chance they have to reinvent themselves. But college administrators have no particular expertise in providing this, and fortunately the college experience isn’t that hard to replicate. I think you just need dorms—i.e. housing where lots of young people are close together—and you need kids to move away from where they grew up and the rest takes care of itself. The housing could probably be cheap and nice too if it weren’t provided by universities, which, whatever their strengths, are not exactly the most efficient landlords.

So that leaves certification. That is what I really want to talk about.

Unlike this article, I think online education companies will make lots of money. The reasons are simple. Education takes a lot of time, so people will pay for better quality. If you are going to spend a few hundred hours in a class you want it to be good, and you would pay a few hundred dollars to get something 10% better for your time. And that doesn’t even address the more irrational reasons. People are used to paying a lot for education, and I think there is an irrational part of human nature that tends to assess prices relative to what they are used to.

But I don’t think producing classes is the best business in this space, and it may not even be the most transformative for education. The best business is certification or testing.

I think people can’t see how important this is because certification and testing is so broken now. How is it broken?

First, it has become customary that colleges get to assess their own students, which, since colleges are paid by the students, has led to the inevitable grade inflation.

Second, colleges have no motivation to make grades good—they don’t benefit by making grades comparable or accurate. How does a 3.3 average from Berkeley compare to a 3.5 from Georgia Tech? Nobody knows and I doubt either Berkeley or Georgia Tech would want to answer that question if they could.

Third GPAs, the usual summary measurement of grades, is a terrible statistic. It averages together lots of subjects in a way that likely isn’t meaningful for anything in particular other than measuring how much, on average, the candidate cared about getting good grades. It is easily manipulated by taking easy classes, which is exactly the wrong thing to reward. And it values consistency over all else (since getting an A is pretty easy these days, getting a high GPA is more about never screwing up then being particularly good at anything).

I have done a fair amount of hiring which let’s you look at GPAs and then do an in person assessment. GPAs aren’t worthless but neither are they worth much.

In short colleges do a terrible job at assessment which has made hiring use grades less than they should.

Outside of grades, most tests kind of suck. The normal “standardized” tests are overly general (one 3 hour test may cover your whole high school or college education). They also try to test subjects like English that are hard to test. Boiling your high school education down to an SAT score or your college education down to a GRE score is silly.

Interestingly the concept of “certification” has arisen in the professional context appart from schools. This is your “Microsoft Certified Systems Engineer” and the like. These certifications have a bad reputation purely because they are pass/fail and aimed at fairly low-end jobs. Having an MCSE is kind of like putting on your resume that you passed high school algebra. It’s not a bad thing, but if you have to say so (or you have to ask) that isn’t good. Harder certifications—an MD, for example—has a better reputation. But any pass/fail test will be aimed at preventing very bad quality rather than finding very good quality (having an MD, after all, doesn’t indicate a good doctor).

But imagine a group of people who care deeply about data working seriously on the idea of assessing skills. First your score would have to be closer to a percentile rank not pass/fail, and that rank would have to be against other people taking the test. Percentiles are wonderful because you know exactly what it means (a 99.9 means the candidate was better than 99.9 percent of all test takers) where as an ‘A’ doesn’t come with that. There are plenty of hard problems to get right: you have to randomize the question pool to avoid cheating, but you have to guarantee a fixed performance (can’t have people lucking out and getting all easy questions).

The other problem with existing tests is that they are too general. This makes studying for them stressful. Tests should cover a single specific area (i.e. “Linear Algebra”) not a general field (“math”). One can always create scores for general areas by taking scores in a set of tests for core subjects in that area.

I think this kind of testing would need to be done in person over a period of a day or so per subject. This sounds ridiculously time consuming compared to short tests like the SATs, but I think that is not an unreasonable percentage of time to spend on assessment and it would stand in for the “final” since this would be a much better, more comparable test.

I think it is easy to miss how powerful this is. To see it, you have to think about the hiring process as it works today. Let’s say there are three types of hiring: unskilled, skilled but inexperienced, and skilled and experienced. Unskilled hiring is not complicated (“have you ever been convicted of a felony?”). Hiring skilled, experienced people is generally done based on what they have accomplished; if they are really good they have been working for a while then they will have done some big things and have people who will vouch for them. This is going to be better than any test. In other words, LinkedIn can solve that problem. But hiring skilled, inexperienced people is pretty hard because they haven’t had an opportunity to do anything yet.

Let me illustrate this by describing the process for hiring new college graduates for a programming job. These are people who have specialized skills (programming) but no real work experience to judge. This process is pretty standard for good silicon valley companies and goes something like this. First you eliminate candidates from all but the best schools. You know that the best programmers at the worse schools are better than the worse programmers at the best schools, but this kind of blunt heuristic is all you have. Then you interview these candidates essentially at random (maybe you look at their projects or internships or whatever but it is done so quickly it is basically random). The first round of interview is usually a one hour phone screen. Assessing a persons skill set in one hour over the phone is, of course, totally impossible. So you reject lots of good people for silly reasons like taking a little longer to answer the one question you had time for. Interviewers are generally poorly calibrated against one-another so it matters almost as much whether you get an easy interviewer as how well you answer unless you are a complete failure. This, if successful, will be followed up by a series of in person one hour interviews. Refining this process, standardizing the question set, avoiding cheating, and calibrating the scores of your interview process are a huge amount of work and usually done wrong.

But the real inefficiency is this. Once you have invested a few dozen hours in assessment of a candidate, what happens to that assessment? Well, for most candidates, say 95%, the answer is “no hire”. This means that another company does exactly the same thing (pick a good school, ask simplistic questions, etc). Basically all the information gained in the first interview is thrown away. In total a candidate may go through 40 hours of interviews at different companies, but the coverage is terrible since all the companies ask the same questions and don’t share the results.

This problem doesn’t just impact companies, it impacts candidates. Candidates who are fantastic programmers but who lack the right degree, or went to the wrong school will not be given an interview at all. It is just too expensive to risk it because the rejection rate for that group, on average, is a little higher. My friend just went through this. He has a math degree from Harvard, and taught himself programming. After four years working as a concert cellist, he wanted to get into software engineering. The problem was, how to convince companies that you know what you know? They don’t have the time to let you prove that, and most won’t even return your calls. Meanwhile anyone with a Stanford CS degree has companies stalking them at home. This is an inefficient market. The Stanford CS kids are better, on average, but they aren’t that much better.

Now imagine that there is a central organization that does assessment. Let’s say that this assessment is granular (say one test per few classes) and let’s say that it is put together by real data people with a lot of effort to make the test material and the ranking itself high quality.

Naive thinking would lead you to believe that companies would hire by taking resumes, applying their random filtering, and then requesting test scores from this central organization for those resumes. But of course that isn’t how it would work at all. Instead you would just query and rank ALL candidates who met your standards on the skills you cared about.

This is the key to why testing is such a good business. It isn’t about charging test takers and competing with ETS. It is about being the sole entity with the ability to query a database that has all the skills of all the professionals and having a deep and well-calibrated assessment of that skill. There is no reason this testing should be limited to things currently taught in college classes, I think it could extend to any quantifiable skill.

If you believe that education will become very cheap as it moves from a non-scalable to a scalable model then this will result in lots of people who can learn things from these classes, but without the usual ability to certify their knowledge (e.g. a degree from Stanford). Of course the online education providers can try to provide this, but what does it mean to have a degree from Coursera or Udacity? I think this is just a digital imitation of the current (bad) practice.

Obviously testing like this wouldn’t replace interviews. But it would replace the part of interviews that companies are bad at (well calibrated assessment of basic skills) and give more time for assessing fit and personality.

Likely people would resent being tested and scored. But people don’t like being interviewed either, and at least this kind of model would mean shorter lower pressure interviews and the ability to “do over” if you happen to get tested on a bad day. Because the “query model” changes, the tests effectively apply you to all companies, rather than having to interview at each one.

This idea is only really easy to scale for quantitative ares. For math, engineering, and the sciences testing, when done right, can be very high quality. For these areas there is no reason for silly pre-computer relics like multiple choice, you can give open ended problems without a pre-defined set of answers. In computer programming you could actually have the person write a computer program.

Non-quantitative disciplines like english are harder to scale, but they are assessable. I think writing can be graded, and I think a really good writing certification could be a lot more useful then college literature classes (which focus on literary critique more than writing) for most uses. So the humanities could be assessed as well, but it would cost more since a human being would need to read the writing.

Its worth noticing the impact this would have on the world if it succeeded. First of all I think having a well-calibrated measurement would put a lot more focus on learning the material and much less on how you learned it. No one will care too much which book you read, which online class you took, or what exercises you did. Those are all just ways to learn. Second, this would truly democratize education—the self taught Bangladeshi is on equal footing with the legacy Harvard admittee.

Another way to say this is that testing, if it is good enough, commoditizes education. This sounds bad, since somehow commodities have gotten a bad wrap. But I mean this in the technical sense. A commodity is something that has a measurement of quality which is accurate enough that the details of the good’s production are irrelevant. If you know the ratings for a diamonds size, clarity, and color, you don’t need to worry about what mine it came from. In the absence of good measurements we are forced to rely on brands or other heuristics. But this is exactly how learning should work. If you learned linear algebra it shouldn’t somehow count for more because that learning happened in a Yale classroom instead of in your bedroom. All that should matter is how well you learned it.

Actually doing this would be a lot of work. Hopefully someone will steal this idea and do it.

Why do write calls sometimes block for a long time in Linux?

Update: Explanation at the bottom.

Anyone know why fwrite() calls sometimes block?

Here is a test I did. I sequentially append 4096 byte chunks to 10 files in a round robin fashion. Throughput is artificially limited to a fixed rate set at about 1/4 the maximum throughput of the drives, which I accomplish by sleeping at subsecond intervals in between my writes. I time each call to write. My expectation is that writes go immediately to pagecache and are asynchronously written out to disk (unless I were to call fsync, which I don’t).

In fact this is usually what happens, the average time is just a few microseconds, but sometimes the write calls block for somewhere between 400 ms and a few seconds. I am using Linux 3.6.32 (RHEL 6) with ext4. I am using default configurations otherwise (no change to any of the /proc/sys/vm stuff and fiddling with those parameters don’t seem to help).

Here is a trace of average and max latencies taken over 1 second intervals. Note the regular spikes. What is the cause of this? Locking between the flush threads and the write thread? Is there anything you can do to mitigate it? This effects anything that does logging—i.e. almost everything.

I understand why this would happen if I exceeded the throughput that Linux’s background flush threads can handle, but that is clearly not the case here as the drive can sustain 400MB writes over a long period.

I tried this on a few different machines, some with RAID, some with a single disk and all showed the same behavior to varying degrees.

Throughput (mb/sec)Avg. Latency (ms)Max Latency (ms)
99.9150.000 0.054
99.9730.000 0.087
100.0050.000 0.057
100.0890.000 0.041
99.9650.000 0.071
99.9770.000 0.098
99.9990.000 0.076
99.9950.000 0.104
99.9610.000 0.057
100.0160.000 0.226
99.9660.000 0.100
99.9250.000 0.093
100.0010.000 3.070
100.0740.000 0.084
100.1930.000 0.054
100.2070.000 0.053
99.9980.000 0.055
99.9850.000 0.054
99.9480.000 0.061
99.9910.000 0.090
99.9730.000 0.046
99.9890.000 11.923
100.0350.000 0.041
100.3550.000 2.698
99.9990.000 0.052
100.0000.000 0.055
99.9630.000 12.534
99.9750.000 0.058
100.3510.000 0.044
99.9900.000 2.889
100.2840.000 0.042
99.9310.000 0.042
100.2180.000 0.056
99.9920.000 0.065
100.1910.000 0.057
100.0230.000 0.401
99.9180.000 1.957
100.0040.000 61.265
99.9380.000 0.092
100.1790.000 0.057
99.9960.000 0.062


A number of people gave interesting and helpful suggestions, such as “this is your punishment for not using Solaris.” The best suggestion was from Mark Wolfe which was to install latencytop and measure it. To do this on Red Hat you need to install their debug kernel and reboot with that, then latencytop will capture the highest latency kernel operations for each process. This gives a great deal of insight into what is going on.

For those who are curious here are a few of the traces that pop up as causing hundreds of ms of latency:

        ext4_da_write_begin() [in this case da means delayed allocation]

This trace seems to be due to delayed allocation. Turing off delayed allocation makes it go away, though probably at the cost of some throughput.

Here is another one, this one seems to be related to journalling.
You can check out the details of the code in one of these nice Linux kernel code browsers. My take away from all this was that maybe it is time to look at XFS since that allegedly also has better file locking for fsync which is my other problem.

Alan Turing On Software Engineers

From Alan Turing’s lecture to the London Mathematical Society in 1947. For context, he had more or less invented the theoretical construct behind computation in 1936, but actual computers barely existed (the very primitive ENIAC was made public in 1946). This lecture was given while he was working on developing ACE (Automatic Computing Engine) the British answer to the ENIAC.

Roughly speaking those who work in connection with the ACE will be divided into its masters and its servants. Its masters will plan out instruction tables for it, thinking up deeper and deeper ways of using it. Its servants will feed it with cards as it calls for them. They will put right any parts that go wrong…As time goes on the calculator itself will take over the functions both of masters and of servants…The masters are liable to get replaced because as soon as any technique becomes at all stereotyped it becomes possible to devise a system of instruction tables which will enable the electronic computer to do it for itself. It may happen however that the masters will refuse to do this. They may be unwilling to let their jobs be stolen from them in this way. In that case they would surround the whole of their work with mystery and make excuses, couched in well chosen gibberish, whenever any dangerous suggestions were made.

SSDs and Distributed Data Systems

Data systems have always been designed around the limitations of physical hardware. I think of the design of these systems as being a compromise between the external API you want to provide and the ingredients you have to build it with. In particular, a discouraging portion of the lines of code of a database or filesystem are there to try to mask the latency of disk drives. I think most people understand that SSDs are somehow faster, but I wanted to give a few thoughts on what can be done with the current batch of SSDs and what might be possible in the future if the trend in price and size for SSDs continues (and not everyone thinks it will).

Here are the facts you need to know about SSDs:

  1. They eliminate “seek time” and consequentially make random reads very, very fast. A commodity SSD can do around 40k random reads per second or 20k random writes which is about 200x what you would expect from a traditional magnetic hard drive.
  2. Linear read and write throughput is better too, but not that much better (say 200 MB/second compared to 50-100MB/sec).
  3. Random failures are less likely than on traditional hard drives as SSDs are not mechanical devices with moving parts. Estimates put this at about 20% of the failure rate, but these kinds of estimates are notoriously inaccurate.
  4. SSDs, like hard drives, are divided into blocks. On SSDs each block can be written only a small number of times before it becomes unusable. This limited number of writes is referred to as “write endurance” of the device.
  5. SSDs have a large block size, often 512KB. This means that if you write 1 byte, the SSD must erase and rewrite a full 512KB block just as if you had written 512KB (this is true of a normal drive, too, but the block size is much smaller). This phenomenon is called write amplification.
So clearly the big change here is the drop in latency for random request streams and the new weakness of limited write endurance.

How SSDs may impact data system design

A large part of system design is driven by the cost, latency and throughput ratios for different types of storage and networking technologies. This is not very different from how the laws of physics constrain the possible designs for a hammer or a kitchen sink, except that in the case of data systems the laws of physics change over time. One fun presentations of some of the trends in latency and throughput are given in the presentation "Latency Lags Bandwidth" by David Patterson. SSDs represent an area where latency suddenly gets many orders of magnitude better, invalidating a lot of design tradeoffs in existing systems. It is a pure speculation as to how this kind of change will effect the design of systems, but it is a good heuristic is to assume that people eventually move towards the design that provides the best price-performance tradeoff.

For distributed data systems the big change SSDs introduce is the relative latency of a random disk access versus a remote network hop. In the case of a traditional hard drive a single seek may have a latency cost easily 10 or 20x that of TCP request on a local network, which means a remote cache hit is much cheaper than a local cache miss. SSDs essentially erase this difference making them fairly close in terms of latency. The consequence should be favoring designs that store more data per machine and do fewer network requests.

The extreme version of this for databases is just abandoning partitioning altogether and storing all data on all machines. After all if the write load is small and data size isn’t going to go beyond 1TB then an unpartitioned, replicated mysql or postgres may be good enough. Since partitioning adds huge constraints on the richness of queries that can be implemented there may be something to be said for unpartitioned designs in cases where the data size isn’t going to grow without bound and the number of writes is tolerable (since in an unpartitioned design all writes must go to all nodes).

Eliminating partitioning is one way to reduce network hops. To take this line of thinking further one could actually co-locate the storage and the application it serves. This is a radical re-architecture, but could make a certain amount of sense. A network server often peaks at around 40,000 requests-per-second. This is not a limitation for a data system serving a random request stream for potentially disk resident data stored on traditional hard drives as each drive can only do a few hundred random accesses per second. If the drive were capable of hundreds of thousands of operations, the network request throughput limit might become a real limitation, and one might think about just co-locating the data and the application. The second reason this might make sense is that the data system is no longer a common integration layer for allowing multiple applications to share the same data. Implicit in the design of current RDBMS is that multiple applications will access the same set of tables (hence the need for so much structure and correctness checking in the database so it can be shared by multiple applications safely without validating the code in all the applications). In reality, the modern way of sharing data amongst applications is not though direct database access against a shared DB but via some kind of service API (REST, usually). There are still a lot of non-performance-related reasons to recommend keeping data systems in the client-server mode (for example allowing you to scale the CPU intensive part of the application separate from the IO intensive part, for example), but if the performance argument became strong enough these reasons might not be enough.

Less radically SSDs will likely change how caching is done. Many web companies have large memcached installations. Memcached is very good at serving high throughput at low latency on a small dataset, but since everything is in RAM it is actually rather expensive if you are space rather than CPU bound. If you place 32GB of cache per server, then 5TB of total cache space requires 160 servers. Having 5 servers each with 1TB of SSD space may be a huge win. Furthermore caching in RAM has a practical problem: restarts dump a full server worth of cache. This is an annoyance if you need to restart your cache servers frequently or if you need to bring up a new stack with completely cold cache as you may not actually be able to run your application without any caching (if you can, then why have caching, after all).

For real-time applications SSDs also enable access patterns that involve many seeks per request in a latency-sensitive application. For example light graph traversal against a disk-resident data set is feasible on SSDs but generally not on traditional hard drives. Graphs are generally not amenable to clean partitioning (they are, by nature, intertwined). As a result, a disk resident blob store running on SSDs may now be a pretty good way to implement social graph or “follower” functionality.

For offline applications, reduction in latency makes random access a possibility once again. MapReduce was designed, in large part, to work with only linear I/O patterns and eliminate random I/O. This is a huge performance win for batch processing on disk-resident data sets. But to accomplish this requires providing only a fairly limited programming model. This means Hadoop cannot easily implement the equivalent of a “hash join” or similar things except when the hash fits in memory. It is interesting to think how MapReduce might have been differently designed in a world where local random reads and writes were cheap.

Making SSDs Cheap Without Losing All Your Data

A move to SSDs also impacts the internals of data systems. Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!

A little background. SSDs currently come in two flavors: enterprise-grade (SLCs) and consumer-grade (MLCs). Enterprise-grade SSDs are quite expensive (comparable to memory prices) and so won’t be an option for many scale-out deployments. They are a good option for small deployments, though, where the high cost is less of an issue. SSDs have other differences in the sophistication of the firmware and whether they attach through the PCI bus or via SATA, but these things are less important. If you are used to mechanical hard drives, virtually any SSD will all be faster than you know what to do with. The network access time will probably eliminate any more nuanced performance differences between SSD types.

For example, when I started doing experiments with MLC drives for storage I was warned the consumer-grade devices would have periodic large latency spikes as they did various internal compaction operations. This is true, there were occational large spikes in 99th percentile time. But although the spikes are huge in relative terms, they are extremely minor in absolute terms (under 1 ms). In comparison to memory this kind of variance is terrible, but in comparison to disk access the SSD’s 99th percentile looks closer to the hard drive’s uncached 1st percentile access time.

The important difference between MLC and SLC is the number of writes they can handle. I will walk through the basic arithmetic on how to model this. SSDs are broken into blocks that are usually around 512KB and each write must first erase and then rewrite an entire 512KB block. Each block can only be erased some number of times before it starts corrupting and giving the wrong data. To avoid this the SSD manufactures seem to cap each block to a fixed number of program-erase cycles. This means that after a certain number of writes to a particular block that block will stop accepting writes. SLCs and MLCs both work the same in this respect, except that SLCs will take roughly an order of magnitude more writes per block before craping out.

Here is a table that compares prices and write-capacity per block for MLC, SLC, RAM, and SAS drives. These are taken at random off the internet, your milage would certainly vary, but this gives some idea of the pricing as of May 2012.

Cost/GBProgram-Erase Cycles
RAM $5-6 Unlimited
15k RPM SAS Hardrive $0.75 Unlimited
MLC SSD $1 5,000-10,000
SLC SSD $4-6 ~100,000

A few obvious conclusions are that SLC SSDs are priced roughly the same as memory. In a sense SLC SSDs are better than memory since they are persistent, but if keeping all your data in memory sounds expensive then so will SLCs. And in any case you can’t eliminate memory caching entirely as at least some part of the index likely needs to reside in memory even with the faster access times SSDs provide.

The MLCs, on the other hand, are actually very close in price to good hard disks. There is a slight price premium, but for most online data systems this is misleading. Since most real-time storage systems are bound by seek capacity, not data size or CPU, increasing the number of seeks available may well reduce overall capacity needs very significantly. With more seeks per machine each machine can handle more requests. For our uses we found we could comfortably take at least 5-10x more requests per machine before we started to hit limits around disk space and CPU. So the question is, is there a way to live with the low write-endurance of the MLC devices and still get the great performance and cost?

Here is where the complex dependency between the storage format and the SSD comes in. If your storage engine does large linear writes (say a full 512KB block or more) then calculating the number of writes you can do on one drive before it is all used up is easy. If the drive has 300GB and each block can be rewritten 5,000 times then each drive will allow writing 5000 x 300GB (about 1.4 petabytes). Let’s say you have 8 of these in a box with no RAID and that box takes 50MB/sec of writes 24 hours a day evenly balanced over the drives, then the drives will last around 7.8 years. This should be plenty of time, for most systems. But this lifetime is only realistic for large linear writes—the best possible case for SSD write-endurance.

The other case is that you are doing small random writes immediately sync’d to disk. If you are doing 100 byte random writes and the SSD’s internal firmware can’t manage to somehow coalesce these into larger physical writes then each 100 byte write will turn into a full program-erase cycle of a full 512KB block. In this case you would expect to be able to do only 5000*100 bytes = 500KB of writes per block before it died; so a 300GB drive with 300GB/512KB = 614,400 blocks would only take around 286GB of writes total before crapping out. Assuming, again, 8 of these drives with no RAID and 50MB/sec, you get a lifetime of only about half a day. This is the worst case for SSDs and is obviously totally unworkable.

It is worth noting that it doesn’t actually matter if the writes to the filesystem are large or small so long as the writes to the physical disk are large. If the writes are linear, that is written in order on disk, and are not sync’d to the physical drive by the OS until significant data has accumulated that will suffice. A number of small sequential writes to the filesystem will be coalesced into a single large write to the physical device by the operating system’s I/O scheduler provided there is no intervening call to fsync (or the equivalent).

To address this limitation SSDs attempt to implement some kind of internal write-ahead format to try to turn random I/O into a larger linear set of writes. However the efficacy of this seems to vary greatly by device, and I consider it a bit dangerous to bet your data on it working for all workloads. Likewise this can introduce a new read fragmentation problem since the updates that are meant to be co-located are actually scattered all over in different blocks.

A better option is just to use a storage format that naturally does linear writes. Traditional storage formats include the B+Tree and linear hashing. These formats group data into blocks by key, and hence writes are randomly scattered on the disk unless the write order happens to match the key ordering (which you can’t count on except in bulk load situations where you can chose the order in which records are updated). Buffering may help this a bit, but when the buffer is flushed it is likely that a random subset of blocks will be rewritten with small changes to each. Log-structured formats are the other alternative, they store data in the order in which it was written, and hence always do linear writes. Log-structured merge trees and Google’s SSTable variation are examples of this. Various hashing and tree formats can all be designed in a log-structured manner.

On traditional hard drives the trade-off between in-place and log-structured storage is a bit of a wash. Write performance for log-structured storage is vastly better, but most applications have more reads than writes. Read performance may be better or worse depending on the details of the implementation (a traditional log-structured merge tree is definitely worse for reads since it does multiple seeks for each uncached read, but hashing variants or SSTables that use bloom filters to avoid unnecessary lookups need not be). However the move to SSDs completely changes this dynamic. Since SSDs have fast seek performance grouping data by key is much less important. Using a log-structured storage format makes it possible to use cheap consumer grade MLC SSDs even under high write loads that would wear out the drive in a few months if run on an in-place storage format.

A particularly important factor in making this work is whether the storage engine requires an immediate fsync to disk with each write. Many systems do require this for data integrity. An immediate fsync will, of course, require a small write unless the record being written is itself large. On a single-node system avoiding fsync may mean that the last few records may be lost in a crash. On a properly designed distributed system, though, this need not be the case—replication to other nodes can take the place of the flush to disk. Replication and disk sync have different failure modes (i.e. in the case of a power outage where all nodes fail simultaneously replication won’t help, whereas in the case of disk corruption or total machine death flush generally doesn’t help). So an important question to ask about any data system design to determine if it is going to be SSD friendly is whether it requires immediate flushing of data to disk. In particular a good system should be able to give replication guarantees without waiting on the disk flush.

I did a pretty detailed evaluation of some SSDs about a year back in co-ordination with the Voldemort team at LinkedIn. One of the tests was replaying a production I/O trace at high speed to simulate the equivalent of 5 years of production load on BDB JE. This model of wear does indeed work exactly as modeled: buffered linear writes make it possible to get almost a decade of high-write usage out of a cheap MLC. The Voldemort guys wrote up a few more details on work they did to make SSDs work well with BDB JE.

If you are evaluating storage engines, here is a quick overview of which ones are in-place or log-structured. InnoDB, BDB, Toyko and Kyoto Cabnet as well as MongoDB all have in-place storage formats and are are not well-suited to cheap SSDs under high write load. LevelDB, BDB-JE, Krati, Cassandra's SSTable implementation and Bitcask are all log-structured. Riak and Voldemort both support pluggable storage engines and default to a log-structured format. The Cassandra folks at Datastax did a pretty good presentation on running Cassandra on SSDs.

An interesting question is whether cloud hosting providers will rent instances with SSDs any time soon. The write-endurance problem makes SSDs somewhat problematic for a shared hosting environment, so they may need to add a way to bill on a per block-erase basis. I have heard rumors that Amazon will offer them, but I have no idea in what form or how much they will cost (which is the critical detail).

Mike Daisy and Globalization

People may have seen the episode of This American Life that retracted the previous piece with Mike Daisy. It is a pretty uncomfortable thing to listen to. The relevant thing is that Mike Daisy fabricated much of the details of his first-hand account to make Chinese labor look much harsher than it is. Then he lied to the NPR fact checkers to avoid their uncovering this.

The question is, does it matter that he lied? Everyone agrees Chinese labor conditions are harsh in comparison to western standards. His argument is that although, yes, he did make up much of his story, it could have been true. Critics say that this doesn’t really undermine his main point; counter critics say that it does.

Personally I think it is mostly an irrelevant distraction. Globalization is literally remaking the lives of billions of people, and whether you believe the change is good or bad should be primarily an empirical discussion about whether things are getting better or worse in these countries. It shouldn’t be carried out in overly emotional, under-reasoned stage performances. Whether or not Mike Daisy’s anecdotes about his visit to China actually took place shouldn’t really matter to us because we shouldn’t make decisions about global trade policy on the basis of theatrical anecdotes.

The important debate we should be having is about whether globalization—particularly shipping low-skill work to other countries with lower wages—is a good or bad thing for these countries. Popular opinion is that it is bad because it enriches large corporations by exploiting workers. However there is an argument to be made that it is good. That argument is best layed out in this 1997 article, by Paul Krugman. I won’t repeat the content of the article because I think it’s short and worth reading.

I read that article in the late nineties and it greatly changed my perception of economics and the debate around globalization. The reason I think everyone should read it is because it phrases the debate in terms of what is good for third world countries rather than emotional appeals. Economists see globalization as a huge transfer of money, technology, and skills from the first world to the third world. They see this as permanently transforming the standard of living in these countries in a way that a century of poverty relief and charitable aid hasn’t. I think the focus of the discussion should be less about whether Mike Daisy lied, and more about whether these beliefs are true.

I am not an economist, I haven’t studied this, so I don’t know if this is correct. But my values are pretty straight-forward. I think people born in the US won the equivalent of the geographical lottery and have opportunities that are either completely out of reach or exceedingly unlikely for people in other parts of the world. We have a moral obligation to strive for equalization; I would support this process even if it was bad for the US (which it doesn’t seem to be). I come from a particularly liberal family so it particularly galls me that the left seems to be willing to decide an issue that one way or another deeply impacts the lives of billions of people purely based on a kind of kneejerk belief that large corporations are evil.

Look, we could learn that this doesn’t work the way we thought. Maybe the manufacturing jobs do not lead to a development of industry in the way it’s supposed to. Or maybe increasing economic vitality doesn’t lead to greater political freedom in repressive regimes. Or maybe there are other unforeseen consequences. Economists are, after all, the people most likely to fuck up a country on an unproven but intellectually stimulating theory. But let’s talk about the evidence of these things happening or not happening.

For people who oppose globalization, I think the question is how will you accomplish this global equalization? It isn’t enough to have a vague esthetic icky feeling about buying a phone made by people who make less than you, you have to have a plausible way to fix this inequality. The people who have thought the most about this think that buying the phone is the best way to fix it, it is really important to be sure they are wrong before you try to stop people from buying phones.

Getting Real About Distributed System Reliability

There is a lot of hype around distributed data systems, some of it justified. It’s true that the internet has centralized a lot of computation onto services like Google, Facebook, Twitter, LinkedIn (my own employer), and other large web sites. It’s true that this centralization puts a lot of pressure on system scalability for these companies. Its true that incremental and horizontal scalability is a deep feature that requires redesign from the ground up and can’t be added incrementally to existing products. It’s true that, if properly designed, these systems can be run with no planned downtime or maintenance intervals in a way that traditional storage systems make harder. It’s also true that software that is explicitly designed to deal with machine failures is a very different thing from traditional infrastructure. All of these properties are critical to large web companies, and are what drove the adoption of horizontally scalable systems like Hadoop, Cassandra, Voldemort, etc. I was the original author of Voldemort and have worked on distributed infrastructure for the last four years or so. So in-so-far as there is a “big data” debate, I am firmly in the “pro-” camp. But one thing you often hear is that this kind of software is more reliable than the traditional alternatives it replaces, and this just isn’t true. It is time people talked honestly about this.

You hear this assumption of reliability everywhere. Now that scalable data infrastructure has a marketing presence, it has really gotten bad. Hadoop or Cassandra or what-have-you can tolerate machine failures then they must be unbreakable right? Wrong.

Where does this come from? Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures. The core argument that gets used for these systems is that if a single machine has probability P of failure, and if the software can replicate data N times to survive N-1 failures, and if the machines fail independently, then the probability of losing a particular piece of data must be PN. So for any desired reliability R and any single-node failure probability P you can pick some replication N so that PN < R. This argument is the core motivation behind most variations on replication and fault tolerance in distributed systems. It is true that without this property the system would be hopelessly unreliable as it grew. But this leads people to believe that distributed software is somehow innately reliable, which unfortunately is utter hogwash.

Where is the flaw in the reasoning? Is it the dreaded Hadoop single points of failure? No, it is far more fundamental than that: the problem is the assumption that failures are independent. Surely no belief could possibly be more counter to our own experience or just common sense than believing that there is no correlation between failures of machines in a cluster. You take a bunch of pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

PN is an upper bound on reliability but one that you could never, never approach in practice. For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.

The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort. This software hasn’t been around for as long as MySQL or Oracle, and worse, the expertise to run it reliably is much less common. MySQL and Oracle administrators are plentiful, but folks experience with, say, serious production Zookeeper operations knowledge are much more rare.

Kernel filesystem engineers say it takes about a decade for a new filesystem to go from concept to maturity. I am not sure these systems will be mature much faster—they are not easier systems to build and the fundamental design space is much less well explored. This doesn’t mean they won’t be useful sooner, especially in domains where they solve a pressing need and are approached with an appropriate amount of caution, but they are not yet by any means a mature technology.

Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. Code that deals with failure cases and is “cluster aware” is extremely tricky to get right. The root of the problem is that dealing with failures effectively explodes the possible state space that needs testing and validation. For example it doesn’t even make sense to expect a single-node database to be fast if its disk system suddenly gets really slow (how could it), but a distributed system does need to carry on in the presence of single degraded machine because it has some many machines, one is sure to be degraded. These kind of “semi-failures” are common and very hard to deal with. Correctly testing these kinds of issues in a realistic setting is brutally hard and the newer generation of software doesn’t have anything like the QA processes its more mature predecessors had. (If you get a chance get someone who has worked at Oracle to describe to you what kind of testing they do to a line of code that goes into their database before it gets shipped to customers). As a result there are a lot of bugs. And of course these bugs are on all the machines, so they absolutely happen together.

Likewise distributed systems typically require more configuration and more complex configuration because they need to be cluster aware, deal with timeouts, etc. This configuration is, of course, shared; and this creates yet another opportunity to bring everything to its knees.

And finally these systems usually want lots of machines. And no matter how good you are, some part of operational difficulty always scales with the number of machines.

Let’s discuss some real issues. We had a bug in Kafka recently that lead to the server incorrectly interpreting a corrupt request as a corrupt log, and shutting itself down to avoid appending to a corrupt log. Single machine log corruption is the kind of thing that should happen due to a disk error, and bringing down the corrupt node is the right behavior—it shouldn’t happen on all the machines at the same time unless all the disks fail at once. But since this was due to corrupt requests, and since we had one client that sent corrupt requests, it was able to sequentially bring down all the servers. Oops. Another example is this Linux bug which causes the system to crash after ~200 days of uptime. Since machines are commonly restarted sequentially this lead to a situation where a large percentage of machines went hard down one after another. Likewise any memory management problems—either leaks or GC problems—tend to happen everywhere at once or not at all. Some companies do public post-mortums for major failures and these are a real wealth of failures in systems that aren’t supposed to fail. This paper has an excellent summary of HDFS availability at Yahoo—they note how few of the problems are of the kind that high availability for the namenode would solve. This list could go on, but you get the idea.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.

These difficulties are one of the core barriers to adoption for distributed data infrastructure. LinkedIn and other companies that have a deep interest in doing creative things with data have taken on the burden of building this kind of expertise in-house—we employ committers on Hadoop and other open source projects on both our engineering and operations team, and have done a lot of from-scratch development in this space where there was gaps. This makes it feasible to take full advantage of an admittedly valuable but immature set of technologies, and let’s us build products we couldn’t otherwise—but this kind of investment only makes sense at a certain size and scale. It may be too high a cost for small startups or companies outside the web space trying to bootstrap this kind of knowledge inside a more traditional IT organization.

This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.

I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations. That’s it. In some ways the most obvious thing, but not typically what you hear when people talk about these systems. I will believe the system can tolerate (some) failures when I see it tolerate those failures; I believe it can run for a year without downtime when i see it run for a year without downtime. I call this empirical reliability (as opposed to theoretical reliability). And getting good empirical reliability is really, really hard. These systems end up being large hunks of monitoring, tests, and operational procedures with a teeny little distributed system strapped to the back.

You see this showing up in discussions of the CAP theorem all the time. The CAP theorem is a useful thing, but it applies more to system design than system implementations. A design is simple enough that you can maybe prove it provides consistency or tolerates partition failures under some assumptions. This is a useful lens to look at system designs. You can’t hope to do this kind of proof with an actual system implementation, the thing you run. The difficulty of building these things means it is really unthinkable that these systems are, in actual reality, either consistent, available, or partition tolerant—they certainly all have numerous bugs that will break each of these guarantees. I really like this paper that takes the approach of actually trying to calculate the observed consistency of eventually consistent systems—they seem to do it via simulation rather than measurement, which is unfortunate, but the idea is great.

It isn’t that system design is meaningless, it is worth discussing the system design as it does act as a kind of limiting factor on certain aspects of reliability and performance as the implementation matures and improves, but don’t take it too seriously as guaranteeing anything.

So why isn’t this kind of empirical measurement more talked about? I don’t know. My pet theory is it has to do with the somewhat rigid and deductive mindset of classical computer science. This is inherited from pure math, and conflicts with the attitude in scientific disciplines. It leads to a preference for starting with axioms, and then proving various properties that follow from these axioms. This world view doesn’t embrace the kind of empirical measurements you would expect to justify claims about reality (for another example see this great blog post on programming language productivity claims in programming language research). But that is getting off topic. Suffice it to say, when making predictions about how a system will work in the real world I believe in measurements of reality a lot more than arguments from first-principles.

I think we should insist on a little more rigor and empiricism in this area.

I would love to see claims in academic publication around practicality or reliability justified in the same way we justify performance claims—by doing it. I would be a lot more likely to believe an academic distributed system was practically feasible if it was run continuously under load for a year successfully and if information was reported on failures and outages. Maybe that isn’t feasible for an academic project, but few other allegedly scientific academic disciplines can get away with making claims about reality without evidence.

More broadly I would like to see more systems whose operation is verifiable. Many systems have the ability to log out information about their state in a way that makes programmatic checking of various invariants and guarantees possible (such as consistency or availability). An example of such an invariant for a messaging system is that all the messages sent to it are received by all the subscribers. We actually measure the truth of this statement in real-time in production for our Kafka data pipeline for all 450 topics and all their subscribers. The number of corner-cases one uncovers with this kind of check, run through a few hundred use cases and a few billion messages per day is truly astounding. I think this is a special case of a broad class of verification that can be done on the running system that goes far far deeper than what is traditionally considered either monitoring or testing. Call it unit testing in production, or self-proving systems, or next generation monitoring, or whatever, but I think this kind of deep verification is something that makes turning the theoretical claims a design makes into measured properties of the real running system.

Likewise if you have a “NoSQL vendor” I think it is reasonable to ask them to provide hard information on customer outages. They don’t need to tell you who the customer is, but they should let you know the observed real-life distribution of MTTF and MTTR they are able to achieve, not just highlight one or two happy cases. Make sure you understand how they measure this, do they have automated test load that runs or just wait for people to complain? This is a reasonable thing for people paying for a service to ask for. To a certain extent if they provide this kind of empirical data it isn’t clear why you should even care what their architecture is beyond intellectual curiosity.

Distributed systems solve a number of key problems at the heart of scaling large websites. I absolutely think this is going to become the default approach to handling state in internet application development. But no one benefits from the kind of irrational exuberance that currently surrounds the “big data” or nosql systems communities. This area is experiencing a boom—but nothing takes us from boom to bust like unrealistic hype and broken promises. We should be a little more honest about where these systems already shine and where they are still maturing.

Scala Macros: “Oh God Why?”

Since this is now being read by people I don’t know, let me add this preface. Scala is one of the better things happening in programming. Any complaints come from the fact that I want it to work as more than a playground for innovation in language design and I think there is always a tension between innovation and refinement. Refutations from Martin Odersky, surely one of the world’s more calm and reasonable people, and Havoc Pennington provide a good counter-balance.

I saw this conversation on twitter:

This was my reaction to the Scala macros proposal too. Not because there is anything necessarily bad about macros or the proposal, but just because—is this really the most critical thing? For context Coda Hale works at Yammer, Stu Hood works at Twitter, Ted Nyman works at (Bank) Simple, and I work at LinkedIn. Each of these companies is trying to write stuff in Scala and so we have a high investment in Scala working out as a language that is good for writing production code (not just as a language that tests out new ideas, or that shows solidarity with the functional programming movement, or that acts a signaling mechanism to tell potential applicants you are hipper than the average Java programmer).

I think the Scala folks have done a good job of putting together something that is significantly better than Java as a language. It’s fast, concise, statically type-checked, and inherits a lot of the good operational properties of the JVM. But languages aren’t just, well, languages; and they aren’t even runtime environments for that matter. They are platforms and tool chains and libraries and bodies of documentation, and cultures, and practices. And here Scala does not yet do so well.

My personal experience with Scala came from working on Apache Kafka, which is a distributed messaging system that serves as LinkedIn’s data pipeline. So I have been working in Scala for the last few years. I still think using Scala was a good decision for a hard, stand-alone piece of software like this, and I am glad we did. Scala fixes all kinds of nastiness in the Java language. But it isn’t all roses, Scala has all kinds of gaps everywhere else. You end up with very pretty code, but gobs of practical annoyances in trying to write that code. The strong hope I have is that the Scala folks will focus on fixing these bread and butter issues, and those of us who just want to use programming languages will be left with something that is a better all around developer experience.

Here is a list of basic issues that hurt actual use of Scala for writing programs. These are pretty well known and much griped about elsewhere, but it bears repeating.

  • The compiler is slow. C++ slow. Waiting on a compiler is not a good way to spend your day.
  • This would be less of an issue except that IDE support is also weak. A lot of the advantages of static typing and rich libraries really surface when you have an IDE to help navigate, auto-complete, and continuously compile, and neither Eclipse nor IntelliJ is quite there yet. Progress has been made on Eclipse over the last few years, much of it recently, but there is a long way to go before it is the kind of tool you can work with seamlessly without the semi-regular pauses and crashes and spurious errors that are just really annoying in the kind of tool you stare at all day long.
  • The generated jar files are comically large.
  • There does seems to be some interest in release-to-release compatibility but it is somewhat haphazard. Upgrades have been nightmarish for us because transitive dependencies between internal libraries combined with non-compatibility leads to a kind of “stop the company and everyone upgrade” firedrill each time we do it.
  • The error messages from the compiler are often deeply cryptic.
  • Comprehensible runtime stack traces and profiling is defeated by the many layers of Scala compiler magic.
  • There isn’t much in the way of a standard library, which means always relying on Java’s standard libraries. What is there often has some questionable parts (e.g. virtually any attempt to just instantiate a DoubleLinkedList in the shipped version of 2.8.0 seemed to throw an NullPointerException which kind of implies no one tried running it). Java’s standard libraries are pretty awkward to use even in Java and much more awkward when mixed in with all your pretty Scala code.
  • The documentation is pretty impenetrable. Unlike Java, type signatures in Scala don’t really explain idiomatic usage to mortals very well. Needless to say, the docs don’t have anything so pedestrian as examples. (Of course I would kill for the kind of user-annotated documentation that MySQL and PHP have).

Note how deeply unfancy these problems are. I think most programming time is taken up by this kind of deeply unfancy problem: renaming things, waiting for compiles, stupid errors, debugging, understanding libraries, etc. This is true whether you are working on the dreariest enterprise workflow thingy or the sexiest mobile social distributed bigdata cloud framework.

When it comes to Scala language development or community, I am barely even a bystander. I don’t follow the mailing list or contribute code, so I’m not really in a position to describe the values or focus of that community. Even this kind of commentary is a bit whiny. But from what I see the focus of Scala development seems to be on innovation. I am sympathetic. Innovation is fun. Scala folks seem to want to do things like macros, or actors, or parallel collections. Unfortunately my experience with Actors left me convinced we should stick with Doug Lea’s code for any production usage. Neither, frankly, is the lack of a parallel collection library the thing holding back Scala adoption by any stretch of the imagination. A parallel collections library is way high up on the Maslow Hierarchy of programming needs, a reasonably complete, well-thought-out, bug-free, Scala-friendly I/O package in the standard library is much closer to “food, clothing, and shelter”.

Back to macros. Does Scala need them? I don’t know, maybe—I’m not knowledgeable enough to say. In most languages, I understand macros to provide two things (1) forced code in-lining and (2) a kind of non-strict semantics or lazy evaluation of arguments. Scala already has the ability to do lazy evaluation of arguments with functions, and the JVM does make some reasonable attempt at on-the-fly code in-lining, so I don’t see the compelling need. But this is not my point. The Scala folks seem super smart, so I assume that Scala will be better off with macros and that surely I will understand why when they have done it. Rather, what I am asking is why this now? Given all the other issues why keep on adding deep, hard language features that must certainly complicate fixing all the dreary unsexy problems we poor Scala users already have?

I understand why no one wants to work on writing documentation or putting together a fast rigorously regression tested standard library. Those things are boring. It is reasonable that people who are smart enough to work on building a language like Scala don’t want to spent their time on that kind of thing.

My own experience is that you can only push the innovation envelope so far. You can club together maybe two or five or ten audacious new ideas, but at some point if you want to make something valuable you need to stop having new ideas and start fixing the ten thousand little things that prevent people from making use of your bold thoughts.

Does off-the-shelf machine learning need a benchmark?

I really like the blog post "Why Generic Machine Learning Fails" by Joe Reisinger of metamarkets. His point is that successful industrial applications of machine learning are usually not based on black box algorithms being blindly fed data; rather they come from humans thinking deeply about the nature of the problem and iteratively working with a data set and algorithm—fiddling with features, tuning parameters, loss functions, etc. Usually the results are dependent on domain-specific insights that are not transferable to any other problem. A related discussion on the specific importance of domain-specific insight versus modeling skill can be found here.

This take on the discipline of modeling isn’t algorithm-centric, instead the person doing the modeling basically has access to large bag of mathematical and algorithmic tricks and creatively applies them to solve the problem at hand. There may be little that is reusable between problems, because each trick has its pros and cons, and the tricks that work in one application are unlikely to be the tricks that work in another application. In school my adviser referred to this approach to prediction as “cooking”—namely experimenting with lots of ingredients until you get something that tastes good. I think this was meant to be both a little derogatory in that cooking isn’t really based on a guiding theory, but also somewhat admiring, since cooking is a bit of an art (and he liked to cook).

The practical superiority of this “cooking approach” matches my experience as well. Most of the gains we have had in data and modeling problems I was involved in at LinkedIn came from creative feature engineering and analyzing results to debug where our models did well and where they did poorly (and then brainstorming solutions). It’s easy to see why academia doesn’t focus on this: research should be broadly applicable, so domain-specific feature engineering isn’t very sexy. This isn’t only a matter of preference, either, in commercial settings there are certain advantages. Specifically, our data sets aren’t a static thing. You often own the source of the data as well as the model, so you can always go and get new independent predictors by collecting new things or creating new feedback loops. This is generally not the case for academic research which works against fixed, static data sets by necessity. For a fixed, static data set, of course, the algorithm is the only thing that can be changed.

Since, after all, organizations often only have a few dozen key data problems, it is not too burdensome to do custom modeling work on each problem. In addition to producing better results, this is generally cheaper and simpler than building out generic infrastructure. Even for products which are very prediction-centric (e.g. recommendations) the actual model building portion is a small fraction of the overall effort of a production system, so a custom approach to modeling doesn’t add much to the work involved.

But—and here is what I really want to argue—even if the “cooking” approach to data problems is currently the best path for folks in industry, that doesn’t make it a good paradigm for machine learning and statistics research. After all one of the main motivations for machine learning as a discipline distinct from statistics was an interest in “black box” prediction without the bag of distribution-centric tricks and expertise that traditional statistical modeling favored. The end goal should be to automate the statistician as completely as possible, not just to add more tricks to her bag. I think this is an area where academia falls a little bit short.

A related point is discussed in a great opinion paper by Larry Wasserman (he of All of Statistics fame). The thesis of the paper is that although the primary distinction of the last decade or so in statistics has been frequentist versus Bayesian styles of modeling, a more important goal is achieving low-assumption high-dimensional modeling. “Assumptions” are, after all, what make a model fail to be “off-the-shelf” since each assumption requires checking and tweaking to make sure it is true and limits the approach to only certain applicable domains.

In some areas you can actually see things moving backwards. For example the fiddliness of Bayesian statistics is something that has always bothered me. On the theoretical side Bayesians have collapsed lots of disjoint heuristics into a core set of principles, but unfortunately the actual modeling practice advocated by most Bayesians seems to move further from off-the-shelf. Hierarchical modeling and priors and messing around with MCMC, and all of that, add to the assumptions and make the role of a good statistician conducting the analysis even more central. For example, I enjoy the fantastic walk-throughs of modeling problems in Andrew Gelman’s Bayesian Data Modeling but clearly the conception of modeling is of something done with a computer, not something done by a computer. Of course there is no reason this has to be true. Bayesian ideas can be used for black box approaches (though I gather that some Bayesian’s feel that non-informative priors and predictive checks and are somewhat sacrilegious). But in the common practice of Bayesian modeling the statistician seems to play an even more central role than in traditional statistics.

Even for the fairly “off-the-shelf” algorithms we do have, there are numerous practical obstacles that prevent anyone but an expert from making effective use of them. These algorithms often go wrong and need a fair amount of hand-holding to produce good performance. The more complex the algorithm the less likely you will be able to effectively do this analysis and debugging. I think the existence of a good set of theory and tools for debugging models are what, more than anything else, has lead to the pervasive success in industry of simple models like logistic regression and the relative rarity of a lot of what you would find in, say, a recent ICML publication. Part of the problem is that in publication most algorithms are kept as simple as possible to simplify presentation, but this tends to punt a large number of practical problems into hyper-parameters and other tricks that have to be guessed by magic or set via cross-validation. Likewise getting a fast and robust implementation of the algorithm that converges quickly and reliably without manual tweaking and debugging any numerical algorithm can be an art form in and of itself. (For example, I love that there is this book called "Numerical Algorithms that Usually Work"). In other words, there is an engineering aspect to the problem in addition to the research and this is often not really common knowledge.

As an example, pick up most introductory machine learning book and read the description of nearest neighbor algorithms. Few actually describe a procedure that will work well on real data. Simple nearest neighbor prediction can do very well on certain kinds of problems, but it has a number of problems that need to be fixed when used with any common distance like the L2 norm. First the scale of the feature values is important and input data has to be scaled. Say you are trying to classify men and women using height and weight. If height is in feet and weight in pounds, weight will dominate the distance. Likewise it is not tolerant of irrelevant features or features with non-uniform predictive value, since L2 norm weights all features equally, with no attempt to weight more predictive features more heavily. That is, if we add a weakly predictive feature predictive performance may get worse. Both of these problems are pretty obvious and can easily be fixed, but the key point is that the algorithm that is most commonly taught doesn’t actually work that well. For more sophisticated algorithms the fixes become less obvious.

This is unfortunate. I think there is a lot to be said for turning ideas into infrastructure. By infrastructure I mean something with well known characteristics, something you can rely on without thinking about the internals too much. A classic example would be the socket API which abstracts much of the gory details of networking from the system programmer. A programmer can be fairly successful writing network programs with a pretty naive understanding of how physical networks actually work. This kind of intellectual infrastructure is what makes the widespread use of deep, hard ideas possible, and I think not enough has been done for predictive machine learning to make it usable in this fashion.

Ignoring the nasty little details of a good implementation is something that made a lot more sense ten or fifteen years ago when machine learning was more a theoretical pursuit, just a more grounded offshoot of AI research. These days it is big business, and people have a practical and pressing need for algorithms that work.

Of course, no matter how successful researchers are in creating off-the-shelf algorithms, there will always be room for custom modeling to do better. Still, I think there is a lot to be said for algorithms that are cheap, easy to use, and predictable even if they perform worse than the best thing an expert could devise. Think of this as Ikea instead of high-end hand-crafted furniture: everyone agrees the hand-made stuff is better, but most people don’t want to pay for it in most cases.

Of course, no algorithm, no matter how sophisticated, can add new independent predictors not given to it as input. But I think guessing features is hardly the core difficulty.

In any case, thinking about all of this made me wonder if one of the reasons for lack of progress in this area is the absence of any kind of standard benchmark for “off-the-shelf” predictive performance. Typically in publication an algorithm is presented with comparisons to a few other state-of-the-art (or at least well known) algorithms on 2-5 common, benchmark data sets. MNIST or what have you. There are a number of well known problems with this. First, the data sets are often chosen to put the algorithm’s best foot forward, showing only problems on which performance is good. It may well be that the algorithm is worse than the state of the art on many or most other problems, but there is no way to tell this from the paper. Second computational performance and scalability information for larger data sets is often missing. This is a critical factor for commercial applicability—small improvements in predictive performance are not always worth crippling performance challenges. And of course predictive performance itself is a factor of the volume of training data—the most accurate algorithm with a training set of ten thousand is often not the most accurate when run with a training set of ten million. Finally, and perhaps most importantly, there may not be a reference implementation of the algorithm that is available to test against, so the comparison algorithm in publication may have a fairly naive implementation.

I wonder if most of these problems couldn’t be solved by a good standard benchmark and some reporting infrastructure. I have always found that groups of people work better when there is a clear, plausible metric for success. It isn’t possible to have a single data set that is representative of all problems, so I think the best approach is something that just throws many, many data sets at the algorithm, and breaks down the results along various dimension. What I am thinking is something along the lines of the computer language shootout. This would be a setup that allows people to submit algorithm implementations and have them run against a variety of data sets. These days one could easily set up some scripts that could run tests on EC2, making it fairly easy to run hundreds of algorithms on thousands of data sets for a modest cost. One might object that it would be hard to choose a representative sample of data sets, but I think as long as they represent real problems of interest, and as long as I can filter results to just problems with technical similarity to mine (say multiclass prediction, or high dimensional problems, or what have you), this would be fine.

In addition to predictive accuracy it would be nice to have some measurements of efficiency and scalability to measure actual CPU time. This would encourage efficient implementations to develop, which are often something of an art in themselves.

The infrastructure to run this would actually be useful for research as well since it would provide an easy test harness to run your own algorithm, a bunch of data sets for predictive problems in a standardized format, and a reference implementation for other algorithms.

The other part to this would be good metadata and reporting to understand how the algorithms perform in different scenarios. There are a lot of things you may want to know about an algorithm. How does performance change with the number of training examples? How well does it handle multiclass (if at all)? Missing features? Low probability classes? High dimensions? Outliers or mislabeled data? Categorical variables? Useless features? Furthermore what is the cost (in CPU seconds) per training example for a good implementation? What is the cost per prediction? There would likely need to be a few different benchmark categories for classification and regression, and perhaps one could argue for some special categories for structured problems like text and images.

A few thoughts on implementation. The framework should be open source and available for anyone to inspect and improve, as with anything in research. I don’t think the idea has much commercial value so I doubt there would be any objection to this. This would be well suited to running on AWS. EC2 would make it easy to get ahold of lots of computer time in an elastic fashion, handle some of the storage problems. Likewise it would allow a setup where the algorithm developer pays the cost of benchmarking through their own account. This is the only way to easily run other people’s code without fear of an infinite loop or a malicious user tying up your machines. This would also make it easy to parallelize the execution by running lots of tests at the same time. Having a high-quality driver that would run a test algorithm on hundreds of data sets in parallel and break down the results would be valuable in and of itself.

One would want the test framework to assume as little as possible about the implementation of the algorithm. I think probably all you would specify would be some driver script train.sh and test.sh that took the data set as input, this script could call whatever kind of binary it wanted. This would give freedom for people to use whatever programming languages or libraries they like without too many integration hassles.

It is important that the framework enforce a purely black box approach, any tuning or hyperparameters would have to be hard coded for all data sets or set automatically by the algorithm itself as part of its training. If an algorithm implementer wants to set custom hyper-parameters for each data set they will need some heuristic to do so, or else do some kind of brute force search with cross-validation.

You would want to keep metadata on each algorithm recording its run time and performance on each data set as well as a browser for the source code, and wikis giving a higher-level overview and links to applicable papers.

There are lots of other little implementation details. You would want a nice UI with a leaderboard. You would want to attach facets that describe the problem for each data set and algorithm (e.g. number of classes, training set size, etc) so that results could be broken down appropriately in reporting.

I am not aware of anyone doing anything quite like this. Kaggle is a very cool company that provides machine learning contests as a way to outsource predictive analytics. Their focus seems to be on single problem contests and I don’t think they supply the infrastructure to run the algorithm. Maybe they could be convinced to provide past challenge data sets as benchmark data?

Primarily I think the reason this hasn’t already been done is because there is no publishable work involved, just mucky scripting and munging data sets into a common format. Secondly AWS is fairly new, and I think that simplifies a lot of the problems of making such a framework available for use without the security and operational problems of trying to let people run code on machines you own and operate.

Postscript: A few people pointed me towards a version of this idea that actually already exists. It is called mlcomp.org. Gotta love the internet.