Getting Real About Distributed System Reliability

There is a lot of hype around distributed data systems, some of it justified. It’s true that the internet has centralized a lot of computation onto services like Google, Facebook, Twitter, LinkedIn (my own employer), and other large web sites. It’s true that this centralization puts a lot of pressure on system scalability for these companies. Its true that incremental and horizontal scalability is a deep feature that requires redesign from the ground up and can’t be added incrementally to existing products. It’s true that, if properly designed, these systems can be run with no planned downtime or maintenance intervals in a way that traditional storage systems make harder. It’s also true that software that is explicitly designed to deal with machine failures is a very different thing from traditional infrastructure. All of these properties are critical to large web companies, and are what drove the adoption of horizontally scalable systems like Hadoop, Cassandra, Voldemort, etc. I was the original author of Voldemort and have worked on distributed infrastructure for the last four years or so. So in-so-far as there is a “big data” debate, I am firmly in the “pro-” camp. But one thing you often hear is that this kind of software is more reliable than the traditional alternatives it replaces, and this just isn’t true. It is time people talked honestly about this.

You hear this assumption of reliability everywhere. Now that scalable data infrastructure has a marketing presence, it has really gotten bad. Hadoop or Cassandra or what-have-you can tolerate machine failures then they must be unbreakable right? Wrong.

Where does this come from? Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures. The core argument that gets used for these systems is that if a single machine has probability P of failure, and if the software can replicate data N times to survive N-1 failures, and if the machines fail independently, then the probability of losing a particular piece of data must be PN. So for any desired reliability R and any single-node failure probability P you can pick some replication N so that PN < R. This argument is the core motivation behind most variations on replication and fault tolerance in distributed systems. It is true that without this property the system would be hopelessly unreliable as it grew. But this leads people to believe that distributed software is somehow innately reliable, which unfortunately is utter hogwash.

Where is the flaw in the reasoning? Is it the dreaded Hadoop single points of failure? No, it is far more fundamental than that: the problem is the assumption that failures are independent. Surely no belief could possibly be more counter to our own experience or just common sense than believing that there is no correlation between failures of machines in a cluster. You take a bunch of pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

PN is an upper bound on reliability but one that you could never, never approach in practice. For example Google has a fantastic paper that gives empirical numbers on system failures in Bigtable and GFS and reports empirical data on groups of failures that show rates several orders of magnitude higher than the independence assumption would predict. This is what one of the best system and operations teams in the world can get: your numbers may be far worse.

The actual reliability of your system depends largely on how bug free it is, how good you are at monitoring it, and how well you have protected against the myriad issues and problems it has. This isn’t any different from traditional systems, except that the new software is far less mature. I don’t mean this disparagingly, I work in this area, it is just a fact. Maturity comes with time and usage and effort. This software hasn’t been around for as long as MySQL or Oracle, and worse, the expertise to run it reliably is much less common. MySQL and Oracle administrators are plentiful, but folks experience with, say, serious production Zookeeper operations knowledge are much more rare.

Kernel filesystem engineers say it takes about a decade for a new filesystem to go from concept to maturity. I am not sure these systems will be mature much faster—they are not easier systems to build and the fundamental design space is much less well explored. This doesn’t mean they won’t be useful sooner, especially in domains where they solve a pressing need and are approached with an appropriate amount of caution, but they are not yet by any means a mature technology.

Part of the difficulty is that distributed system software is actually quite complicated in comparison to single-server code. Code that deals with failure cases and is “cluster aware” is extremely tricky to get right. The root of the problem is that dealing with failures effectively explodes the possible state space that needs testing and validation. For example it doesn’t even make sense to expect a single-node database to be fast if its disk system suddenly gets really slow (how could it), but a distributed system does need to carry on in the presence of single degraded machine because it has some many machines, one is sure to be degraded. These kind of “semi-failures” are common and very hard to deal with. Correctly testing these kinds of issues in a realistic setting is brutally hard and the newer generation of software doesn’t have anything like the QA processes its more mature predecessors had. (If you get a chance get someone who has worked at Oracle to describe to you what kind of testing they do to a line of code that goes into their database before it gets shipped to customers). As a result there are a lot of bugs. And of course these bugs are on all the machines, so they absolutely happen together.

Likewise distributed systems typically require more configuration and more complex configuration because they need to be cluster aware, deal with timeouts, etc. This configuration is, of course, shared; and this creates yet another opportunity to bring everything to its knees.

And finally these systems usually want lots of machines. And no matter how good you are, some part of operational difficulty always scales with the number of machines.

Let’s discuss some real issues. We had a bug in Kafka recently that lead to the server incorrectly interpreting a corrupt request as a corrupt log, and shutting itself down to avoid appending to a corrupt log. Single machine log corruption is the kind of thing that should happen due to a disk error, and bringing down the corrupt node is the right behavior—it shouldn’t happen on all the machines at the same time unless all the disks fail at once. But since this was due to corrupt requests, and since we had one client that sent corrupt requests, it was able to sequentially bring down all the servers. Oops. Another example is this Linux bug which causes the system to crash after ~200 days of uptime. Since machines are commonly restarted sequentially this lead to a situation where a large percentage of machines went hard down one after another. Likewise any memory management problems—either leaks or GC problems—tend to happen everywhere at once or not at all. Some companies do public post-mortums for major failures and these are a real wealth of failures in systems that aren’t supposed to fail. This paper has an excellent summary of HDFS availability at Yahoo—they note how few of the problems are of the kind that high availability for the namenode would solve. This list could go on, but you get the idea.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.

These difficulties are one of the core barriers to adoption for distributed data infrastructure. LinkedIn and other companies that have a deep interest in doing creative things with data have taken on the burden of building this kind of expertise in-house—we employ committers on Hadoop and other open source projects on both our engineering and operations team, and have done a lot of from-scratch development in this space where there was gaps. This makes it feasible to take full advantage of an admittedly valuable but immature set of technologies, and let’s us build products we couldn’t otherwise—but this kind of investment only makes sense at a certain size and scale. It may be too high a cost for small startups or companies outside the web space trying to bootstrap this kind of knowledge inside a more traditional IT organization.

This is why people should be excited about things like Amazon’s DynamoDB. When DynamoDB was released, the company DataStax that supports and leads development on Cassandra released a feature comparison checklist. The checklist was unfair in many ways (as these kinds of vendor comparisons usually are), but the biggest thing missing in the comparison is that you don’t run DynamoDB, Amazon does. That is a huge, huge difference. Amazon is good at this stuff, and has shown that they can (usually) support massively multi-tenant operations with reasonable SLAs, in practice.

I really think there is really only one thing to talk about with respect to reliability: continuous hours of successful production operations. That’s it. In some ways the most obvious thing, but not typically what you hear when people talk about these systems. I will believe the system can tolerate (some) failures when I see it tolerate those failures; I believe it can run for a year without downtime when i see it run for a year without downtime. I call this empirical reliability (as opposed to theoretical reliability). And getting good empirical reliability is really, really hard. These systems end up being large hunks of monitoring, tests, and operational procedures with a teeny little distributed system strapped to the back.

You see this showing up in discussions of the CAP theorem all the time. The CAP theorem is a useful thing, but it applies more to system design than system implementations. A design is simple enough that you can maybe prove it provides consistency or tolerates partition failures under some assumptions. This is a useful lens to look at system designs. You can’t hope to do this kind of proof with an actual system implementation, the thing you run. The difficulty of building these things means it is really unthinkable that these systems are, in actual reality, either consistent, available, or partition tolerant—they certainly all have numerous bugs that will break each of these guarantees. I really like this paper that takes the approach of actually trying to calculate the observed consistency of eventually consistent systems—they seem to do it via simulation rather than measurement, which is unfortunate, but the idea is great.

It isn’t that system design is meaningless, it is worth discussing the system design as it does act as a kind of limiting factor on certain aspects of reliability and performance as the implementation matures and improves, but don’t take it too seriously as guaranteeing anything.

So why isn’t this kind of empirical measurement more talked about? I don’t know. My pet theory is it has to do with the somewhat rigid and deductive mindset of classical computer science. This is inherited from pure math, and conflicts with the attitude in scientific disciplines. It leads to a preference for starting with axioms, and then proving various properties that follow from these axioms. This world view doesn’t embrace the kind of empirical measurements you would expect to justify claims about reality (for another example see this great blog post on programming language productivity claims in programming language research). But that is getting off topic. Suffice it to say, when making predictions about how a system will work in the real world I believe in measurements of reality a lot more than arguments from first-principles.

I think we should insist on a little more rigor and empiricism in this area.

I would love to see claims in academic publication around practicality or reliability justified in the same way we justify performance claims—by doing it. I would be a lot more likely to believe an academic distributed system was practically feasible if it was run continuously under load for a year successfully and if information was reported on failures and outages. Maybe that isn’t feasible for an academic project, but few other allegedly scientific academic disciplines can get away with making claims about reality without evidence.

More broadly I would like to see more systems whose operation is verifiable. Many systems have the ability to log out information about their state in a way that makes programmatic checking of various invariants and guarantees possible (such as consistency or availability). An example of such an invariant for a messaging system is that all the messages sent to it are received by all the subscribers. We actually measure the truth of this statement in real-time in production for our Kafka data pipeline for all 450 topics and all their subscribers. The number of corner-cases one uncovers with this kind of check, run through a few hundred use cases and a few billion messages per day is truly astounding. I think this is a special case of a broad class of verification that can be done on the running system that goes far far deeper than what is traditionally considered either monitoring or testing. Call it unit testing in production, or self-proving systems, or next generation monitoring, or whatever, but I think this kind of deep verification is something that makes turning the theoretical claims a design makes into measured properties of the real running system.

Likewise if you have a “NoSQL vendor” I think it is reasonable to ask them to provide hard information on customer outages. They don’t need to tell you who the customer is, but they should let you know the observed real-life distribution of MTTF and MTTR they are able to achieve, not just highlight one or two happy cases. Make sure you understand how they measure this, do they have automated test load that runs or just wait for people to complain? This is a reasonable thing for people paying for a service to ask for. To a certain extent if they provide this kind of empirical data it isn’t clear why you should even care what their architecture is beyond intellectual curiosity.

Distributed systems solve a number of key problems at the heart of scaling large websites. I absolutely think this is going to become the default approach to handling state in internet application development. But no one benefits from the kind of irrational exuberance that currently surrounds the “big data” or nosql systems communities. This area is experiencing a boom—but nothing takes us from boom to bust like unrealistic hype and broken promises. We should be a little more honest about where these systems already shine and where they are still maturing.

Scala Macros: “Oh God Why?”

Since this is now being read by people I don’t know, let me add this preface. Scala is one of the better things happening in programming. Any complaints come from the fact that I want it to work as more than a playground for innovation in language design and I think there is always a tension between innovation and refinement. Refutations from Martin Odersky, surely one of the world’s more calm and reasonable people, and Havoc Pennington provide a good counter-balance.

I saw this conversation on twitter:

This was my reaction to the Scala macros proposal too. Not because there is anything necessarily bad about macros or the proposal, but just because—is this really the most critical thing? For context Coda Hale works at Yammer, Stu Hood works at Twitter, Ted Nyman works at (Bank) Simple, and I work at LinkedIn. Each of these companies is trying to write stuff in Scala and so we have a high investment in Scala working out as a language that is good for writing production code (not just as a language that tests out new ideas, or that shows solidarity with the functional programming movement, or that acts a signaling mechanism to tell potential applicants you are hipper than the average Java programmer).

I think the Scala folks have done a good job of putting together something that is significantly better than Java as a language. It’s fast, concise, statically type-checked, and inherits a lot of the good operational properties of the JVM. But languages aren’t just, well, languages; and they aren’t even runtime environments for that matter. They are platforms and tool chains and libraries and bodies of documentation, and cultures, and practices. And here Scala does not yet do so well.

My personal experience with Scala came from working on Apache Kafka, which is a distributed messaging system that serves as LinkedIn’s data pipeline. So I have been working in Scala for the last few years. I still think using Scala was a good decision for a hard, stand-alone piece of software like this, and I am glad we did. Scala fixes all kinds of nastiness in the Java language. But it isn’t all roses, Scala has all kinds of gaps everywhere else. You end up with very pretty code, but gobs of practical annoyances in trying to write that code. The strong hope I have is that the Scala folks will focus on fixing these bread and butter issues, and those of us who just want to use programming languages will be left with something that is a better all around developer experience.

Here is a list of basic issues that hurt actual use of Scala for writing programs. These are pretty well known and much griped about elsewhere, but it bears repeating.

  • The compiler is slow. C++ slow. Waiting on a compiler is not a good way to spend your day.
  • This would be less of an issue except that IDE support is also weak. A lot of the advantages of static typing and rich libraries really surface when you have an IDE to help navigate, auto-complete, and continuously compile, and neither Eclipse nor IntelliJ is quite there yet. Progress has been made on Eclipse over the last few years, much of it recently, but there is a long way to go before it is the kind of tool you can work with seamlessly without the semi-regular pauses and crashes and spurious errors that are just really annoying in the kind of tool you stare at all day long.
  • The generated jar files are comically large.
  • There does seems to be some interest in release-to-release compatibility but it is somewhat haphazard. Upgrades have been nightmarish for us because transitive dependencies between internal libraries combined with non-compatibility leads to a kind of “stop the company and everyone upgrade” firedrill each time we do it.
  • The error messages from the compiler are often deeply cryptic.
  • Comprehensible runtime stack traces and profiling is defeated by the many layers of Scala compiler magic.
  • There isn’t much in the way of a standard library, which means always relying on Java’s standard libraries. What is there often has some questionable parts (e.g. virtually any attempt to just instantiate a DoubleLinkedList in the shipped version of 2.8.0 seemed to throw an NullPointerException which kind of implies no one tried running it). Java’s standard libraries are pretty awkward to use even in Java and much more awkward when mixed in with all your pretty Scala code.
  • The documentation is pretty impenetrable. Unlike Java, type signatures in Scala don’t really explain idiomatic usage to mortals very well. Needless to say, the docs don’t have anything so pedestrian as examples. (Of course I would kill for the kind of user-annotated documentation that MySQL and PHP have).

Note how deeply unfancy these problems are. I think most programming time is taken up by this kind of deeply unfancy problem: renaming things, waiting for compiles, stupid errors, debugging, understanding libraries, etc. This is true whether you are working on the dreariest enterprise workflow thingy or the sexiest mobile social distributed bigdata cloud framework.

When it comes to Scala language development or community, I am barely even a bystander. I don’t follow the mailing list or contribute code, so I’m not really in a position to describe the values or focus of that community. Even this kind of commentary is a bit whiny. But from what I see the focus of Scala development seems to be on innovation. I am sympathetic. Innovation is fun. Scala folks seem to want to do things like macros, or actors, or parallel collections. Unfortunately my experience with Actors left me convinced we should stick with Doug Lea’s code for any production usage. Neither, frankly, is the lack of a parallel collection library the thing holding back Scala adoption by any stretch of the imagination. A parallel collections library is way high up on the Maslow Hierarchy of programming needs, a reasonably complete, well-thought-out, bug-free, Scala-friendly I/O package in the standard library is much closer to “food, clothing, and shelter”.

Back to macros. Does Scala need them? I don’t know, maybe—I’m not knowledgeable enough to say. In most languages, I understand macros to provide two things (1) forced code in-lining and (2) a kind of non-strict semantics or lazy evaluation of arguments. Scala already has the ability to do lazy evaluation of arguments with functions, and the JVM does make some reasonable attempt at on-the-fly code in-lining, so I don’t see the compelling need. But this is not my point. The Scala folks seem super smart, so I assume that Scala will be better off with macros and that surely I will understand why when they have done it. Rather, what I am asking is why this now? Given all the other issues why keep on adding deep, hard language features that must certainly complicate fixing all the dreary unsexy problems we poor Scala users already have?

I understand why no one wants to work on writing documentation or putting together a fast rigorously regression tested standard library. Those things are boring. It is reasonable that people who are smart enough to work on building a language like Scala don’t want to spent their time on that kind of thing.

My own experience is that you can only push the innovation envelope so far. You can club together maybe two or five or ten audacious new ideas, but at some point if you want to make something valuable you need to stop having new ideas and start fixing the ten thousand little things that prevent people from making use of your bold thoughts.

Does off-the-shelf machine learning need a benchmark?

I really like the blog post "Why Generic Machine Learning Fails" by Joe Reisinger of metamarkets. His point is that successful industrial applications of machine learning are usually not based on black box algorithms being blindly fed data; rather they come from humans thinking deeply about the nature of the problem and iteratively working with a data set and algorithm—fiddling with features, tuning parameters, loss functions, etc. Usually the results are dependent on domain-specific insights that are not transferable to any other problem. A related discussion on the specific importance of domain-specific insight versus modeling skill can be found here.

This take on the discipline of modeling isn’t algorithm-centric, instead the person doing the modeling basically has access to large bag of mathematical and algorithmic tricks and creatively applies them to solve the problem at hand. There may be little that is reusable between problems, because each trick has its pros and cons, and the tricks that work in one application are unlikely to be the tricks that work in another application. In school my adviser referred to this approach to prediction as “cooking”—namely experimenting with lots of ingredients until you get something that tastes good. I think this was meant to be both a little derogatory in that cooking isn’t really based on a guiding theory, but also somewhat admiring, since cooking is a bit of an art (and he liked to cook).

The practical superiority of this “cooking approach” matches my experience as well. Most of the gains we have had in data and modeling problems I was involved in at LinkedIn came from creative feature engineering and analyzing results to debug where our models did well and where they did poorly (and then brainstorming solutions). It’s easy to see why academia doesn’t focus on this: research should be broadly applicable, so domain-specific feature engineering isn’t very sexy. This isn’t only a matter of preference, either, in commercial settings there are certain advantages. Specifically, our data sets aren’t a static thing. You often own the source of the data as well as the model, so you can always go and get new independent predictors by collecting new things or creating new feedback loops. This is generally not the case for academic research which works against fixed, static data sets by necessity. For a fixed, static data set, of course, the algorithm is the only thing that can be changed.

Since, after all, organizations often only have a few dozen key data problems, it is not too burdensome to do custom modeling work on each problem. In addition to producing better results, this is generally cheaper and simpler than building out generic infrastructure. Even for products which are very prediction-centric (e.g. recommendations) the actual model building portion is a small fraction of the overall effort of a production system, so a custom approach to modeling doesn’t add much to the work involved.

But—and here is what I really want to argue—even if the “cooking” approach to data problems is currently the best path for folks in industry, that doesn’t make it a good paradigm for machine learning and statistics research. After all one of the main motivations for machine learning as a discipline distinct from statistics was an interest in “black box” prediction without the bag of distribution-centric tricks and expertise that traditional statistical modeling favored. The end goal should be to automate the statistician as completely as possible, not just to add more tricks to her bag. I think this is an area where academia falls a little bit short.

A related point is discussed in a great opinion paper by Larry Wasserman (he of All of Statistics fame). The thesis of the paper is that although the primary distinction of the last decade or so in statistics has been frequentist versus Bayesian styles of modeling, a more important goal is achieving low-assumption high-dimensional modeling. “Assumptions” are, after all, what make a model fail to be “off-the-shelf” since each assumption requires checking and tweaking to make sure it is true and limits the approach to only certain applicable domains.

In some areas you can actually see things moving backwards. For example the fiddliness of Bayesian statistics is something that has always bothered me. On the theoretical side Bayesians have collapsed lots of disjoint heuristics into a core set of principles, but unfortunately the actual modeling practice advocated by most Bayesians seems to move further from off-the-shelf. Hierarchical modeling and priors and messing around with MCMC, and all of that, add to the assumptions and make the role of a good statistician conducting the analysis even more central. For example, I enjoy the fantastic walk-throughs of modeling problems in Andrew Gelman’s Bayesian Data Modeling but clearly the conception of modeling is of something done with a computer, not something done by a computer. Of course there is no reason this has to be true. Bayesian ideas can be used for black box approaches (though I gather that some Bayesian’s feel that non-informative priors and predictive checks and are somewhat sacrilegious). But in the common practice of Bayesian modeling the statistician seems to play an even more central role than in traditional statistics.

Even for the fairly “off-the-shelf” algorithms we do have, there are numerous practical obstacles that prevent anyone but an expert from making effective use of them. These algorithms often go wrong and need a fair amount of hand-holding to produce good performance. The more complex the algorithm the less likely you will be able to effectively do this analysis and debugging. I think the existence of a good set of theory and tools for debugging models are what, more than anything else, has lead to the pervasive success in industry of simple models like logistic regression and the relative rarity of a lot of what you would find in, say, a recent ICML publication. Part of the problem is that in publication most algorithms are kept as simple as possible to simplify presentation, but this tends to punt a large number of practical problems into hyper-parameters and other tricks that have to be guessed by magic or set via cross-validation. Likewise getting a fast and robust implementation of the algorithm that converges quickly and reliably without manual tweaking and debugging any numerical algorithm can be an art form in and of itself. (For example, I love that there is this book called "Numerical Algorithms that Usually Work"). In other words, there is an engineering aspect to the problem in addition to the research and this is often not really common knowledge.

As an example, pick up most introductory machine learning book and read the description of nearest neighbor algorithms. Few actually describe a procedure that will work well on real data. Simple nearest neighbor prediction can do very well on certain kinds of problems, but it has a number of problems that need to be fixed when used with any common distance like the L2 norm. First the scale of the feature values is important and input data has to be scaled. Say you are trying to classify men and women using height and weight. If height is in feet and weight in pounds, weight will dominate the distance. Likewise it is not tolerant of irrelevant features or features with non-uniform predictive value, since L2 norm weights all features equally, with no attempt to weight more predictive features more heavily. That is, if we add a weakly predictive feature predictive performance may get worse. Both of these problems are pretty obvious and can easily be fixed, but the key point is that the algorithm that is most commonly taught doesn’t actually work that well. For more sophisticated algorithms the fixes become less obvious.

This is unfortunate. I think there is a lot to be said for turning ideas into infrastructure. By infrastructure I mean something with well known characteristics, something you can rely on without thinking about the internals too much. A classic example would be the socket API which abstracts much of the gory details of networking from the system programmer. A programmer can be fairly successful writing network programs with a pretty naive understanding of how physical networks actually work. This kind of intellectual infrastructure is what makes the widespread use of deep, hard ideas possible, and I think not enough has been done for predictive machine learning to make it usable in this fashion.

Ignoring the nasty little details of a good implementation is something that made a lot more sense ten or fifteen years ago when machine learning was more a theoretical pursuit, just a more grounded offshoot of AI research. These days it is big business, and people have a practical and pressing need for algorithms that work.

Of course, no matter how successful researchers are in creating off-the-shelf algorithms, there will always be room for custom modeling to do better. Still, I think there is a lot to be said for algorithms that are cheap, easy to use, and predictable even if they perform worse than the best thing an expert could devise. Think of this as Ikea instead of high-end hand-crafted furniture: everyone agrees the hand-made stuff is better, but most people don’t want to pay for it in most cases.

Of course, no algorithm, no matter how sophisticated, can add new independent predictors not given to it as input. But I think guessing features is hardly the core difficulty.

In any case, thinking about all of this made me wonder if one of the reasons for lack of progress in this area is the absence of any kind of standard benchmark for “off-the-shelf” predictive performance. Typically in publication an algorithm is presented with comparisons to a few other state-of-the-art (or at least well known) algorithms on 2-5 common, benchmark data sets. MNIST or what have you. There are a number of well known problems with this. First, the data sets are often chosen to put the algorithm’s best foot forward, showing only problems on which performance is good. It may well be that the algorithm is worse than the state of the art on many or most other problems, but there is no way to tell this from the paper. Second computational performance and scalability information for larger data sets is often missing. This is a critical factor for commercial applicability—small improvements in predictive performance are not always worth crippling performance challenges. And of course predictive performance itself is a factor of the volume of training data—the most accurate algorithm with a training set of ten thousand is often not the most accurate when run with a training set of ten million. Finally, and perhaps most importantly, there may not be a reference implementation of the algorithm that is available to test against, so the comparison algorithm in publication may have a fairly naive implementation.

I wonder if most of these problems couldn’t be solved by a good standard benchmark and some reporting infrastructure. I have always found that groups of people work better when there is a clear, plausible metric for success. It isn’t possible to have a single data set that is representative of all problems, so I think the best approach is something that just throws many, many data sets at the algorithm, and breaks down the results along various dimension. What I am thinking is something along the lines of the computer language shootout. This would be a setup that allows people to submit algorithm implementations and have them run against a variety of data sets. These days one could easily set up some scripts that could run tests on EC2, making it fairly easy to run hundreds of algorithms on thousands of data sets for a modest cost. One might object that it would be hard to choose a representative sample of data sets, but I think as long as they represent real problems of interest, and as long as I can filter results to just problems with technical similarity to mine (say multiclass prediction, or high dimensional problems, or what have you), this would be fine.

In addition to predictive accuracy it would be nice to have some measurements of efficiency and scalability to measure actual CPU time. This would encourage efficient implementations to develop, which are often something of an art in themselves.

The infrastructure to run this would actually be useful for research as well since it would provide an easy test harness to run your own algorithm, a bunch of data sets for predictive problems in a standardized format, and a reference implementation for other algorithms.

The other part to this would be good metadata and reporting to understand how the algorithms perform in different scenarios. There are a lot of things you may want to know about an algorithm. How does performance change with the number of training examples? How well does it handle multiclass (if at all)? Missing features? Low probability classes? High dimensions? Outliers or mislabeled data? Categorical variables? Useless features? Furthermore what is the cost (in CPU seconds) per training example for a good implementation? What is the cost per prediction? There would likely need to be a few different benchmark categories for classification and regression, and perhaps one could argue for some special categories for structured problems like text and images.

A few thoughts on implementation. The framework should be open source and available for anyone to inspect and improve, as with anything in research. I don’t think the idea has much commercial value so I doubt there would be any objection to this. This would be well suited to running on AWS. EC2 would make it easy to get ahold of lots of computer time in an elastic fashion, handle some of the storage problems. Likewise it would allow a setup where the algorithm developer pays the cost of benchmarking through their own account. This is the only way to easily run other people’s code without fear of an infinite loop or a malicious user tying up your machines. This would also make it easy to parallelize the execution by running lots of tests at the same time. Having a high-quality driver that would run a test algorithm on hundreds of data sets in parallel and break down the results would be valuable in and of itself.

One would want the test framework to assume as little as possible about the implementation of the algorithm. I think probably all you would specify would be some driver script train.sh and test.sh that took the data set as input, this script could call whatever kind of binary it wanted. This would give freedom for people to use whatever programming languages or libraries they like without too many integration hassles.

It is important that the framework enforce a purely black box approach, any tuning or hyperparameters would have to be hard coded for all data sets or set automatically by the algorithm itself as part of its training. If an algorithm implementer wants to set custom hyper-parameters for each data set they will need some heuristic to do so, or else do some kind of brute force search with cross-validation.

You would want to keep metadata on each algorithm recording its run time and performance on each data set as well as a browser for the source code, and wikis giving a higher-level overview and links to applicable papers.

There are lots of other little implementation details. You would want a nice UI with a leaderboard. You would want to attach facets that describe the problem for each data set and algorithm (e.g. number of classes, training set size, etc) so that results could be broken down appropriately in reporting.

I am not aware of anyone doing anything quite like this. Kaggle is a very cool company that provides machine learning contests as a way to outsource predictive analytics. Their focus seems to be on single problem contests and I don’t think they supply the infrastructure to run the algorithm. Maybe they could be convinced to provide past challenge data sets as benchmark data?

Primarily I think the reason this hasn’t already been done is because there is no publishable work involved, just mucky scripting and munging data sets into a common format. Secondly AWS is fairly new, and I think that simplifies a lot of the problems of making such a framework available for use without the security and operational problems of trying to let people run code on machines you own and operate.

Postscript: A few people pointed me towards a version of this idea that actually already exists. It is called mlcomp.org. Gotta love the internet.

Khan Academy

This article, really irritates me.

It leads with a good point: the problem with math education is that it fails to connect math to the real world, and the author claims that Khan Academy has this same flaw. Fair enough. I agree that the central problem with math education is that students see no reason to learn math. Learning is hard, it requires a certain amount of struggle, and trying to get anyone to learn something they do is unlikely to produce lasting learning. They may memorize some procedures for a test, but they won’t be able to use it in their life or build on it in the future. This has little to do with Khan academy except insofar as Khan academy tends to teach a series of techniques without really motivating them or answering the larger question of relevance. I don’t know that this is necessarily true.

This is where things go down hill.

The next critique is that Khan Academy hired computer scientists instead of teachers. The author thinks teaching should be improved by helping to hone the craft of teaching. But let’s get real, we have been honing the craft of teach for a few hundred years now, how much better is it going to get. Khan Academy is about the medium, not the message. They are not trying to reform the math curriculum, however desirable that would be. They are trying to change the process by which people learn.

Teaching is currently done in a non-scalable way, and software and video is completely scalable. A video, once shot, can be replayed a million times at any speed. A program can be executed almost for free. No one thinks that videos or programs can compete with the best a human being can do, but they don’t need to—most people don’t have Richard Feynman as their high school physics instructor. The point is that despite how inferior video is as a medium it is possible to take the best possible video of instruction that mankind can produce and let the poorest student in the world have access to it (yes, assuming they have a computer…but you get the point).

If someone has an idea about how to re-orient the math curriculum, they can make their own videos and programs and give them to the world. The point about scalability is that only one person needs to do this for everyone to benefit.

And the rest of the article is just the silly teacher’s union stuff about how teachers, unlike every other profession in the world, can’t possibly be evaluated. The fact is, teachers can be evaluated, and good teachers make a bigger difference than anything else in how much students learn. Standardized tests aren’t perfect, but they are the best tool we have for identifying good and bad teachers. I think this paper shows fantastic work to identify the impact of good teachers. The obvious fix, to reward the good teachers and replace the bad ones, isn’t possible in most places because of politics, which makes teaching somewhat unique in that respect.

One of the main sinks of energy in the “developed” world is the creation of stuff. In its natural life cycle, stuff passes through three stages. First, a new-born stuff is displayed in shiny packaging on a shelf in a shop. At this stage, stuff is called “goods.” As soon as the stuff is taken home and sheds its packaging, it undergoes a transformation from “goods” to its second form, “clutter.” The clutter lives with its owner for a period of months or years. During this period, the clutter is largely ignored by its owner, who is off at the shops buying more goods. Eventually, by a miracle of modern alchemy, the clutter is transformed into its final form, rubbish. To the untrained eye, it can be difficult to distinguish this “rubbish” from the highly desirable “good” that it used to be. Nonetheless, at this stage the discerning owner pays the dustman to transport the stuff away. Quote from the fantastic Sustainable Energy Without The Hot Air by machine learning research and physic professor David Mackay.

Engineers and operations people are complementary goods to datacenters

There is a common perception that “cloud” computing will make operations people unnecessary. The reality is the opposite, I would expect the demand and price for operations people to rise on the whole. It is easy to see why: the final product of “cloud”-style data center operations and software engineering is a website or service. Insofar as “cloud” computing is successful it will lower the cost of running a data center by removing direct interaction with lower-level infrastructure. When the price of making and running websites and service goes down you expect more of them (i.e. demand goes up). This means the demand/price for anything else needed to run a website goes up to. It turns out the other thing you need to run a web service is engineers and operations people. This is what is known as a complementary good (see the wonderfully readable textbook Principles of Economics if you are interested in this kind of thing).

But all operations people are not created equal. Some are actually being automated. Certain lower-level systems tasks are being replaced. The people who build out servers, data centers and networks are being replaced by centralized teams run by Amazon. The interaction with these things is now via an API.

So systems operations will be impacted, but by no means replaced. You still absolutely need a networking expert if you run on Amazon you just no longer need to build out networks on your own. The result of this is that I expect most operations people in those areas to move up the stack.

One other impact is that because many services like machine allocation that used to be manual are now behind APIs, I expect the importance of programming for operations people to increase. So if you are an application operations specialist and want to future proof your skills, make sure you are a solid python programmer.

Credit to Eric Sammer for reminding me of this.