Does off-the-shelf machine learning need a benchmark?

I really like the blog post "Why Generic Machine Learning Fails" by Joe Reisinger of metamarkets. His point is that successful industrial applications of machine learning are usually not based on black box algorithms being blindly fed data; rather they come from humans thinking deeply about the nature of the problem and iteratively working with a data set and algorithm—fiddling with features, tuning parameters, loss functions, etc. Usually the results are dependent on domain-specific insights that are not transferable to any other problem. A related discussion on the specific importance of domain-specific insight versus modeling skill can be found here.

This take on the discipline of modeling isn’t algorithm-centric, instead the person doing the modeling basically has access to large bag of mathematical and algorithmic tricks and creatively applies them to solve the problem at hand. There may be little that is reusable between problems, because each trick has its pros and cons, and the tricks that work in one application are unlikely to be the tricks that work in another application. In school my adviser referred to this approach to prediction as “cooking”—namely experimenting with lots of ingredients until you get something that tastes good. I think this was meant to be both a little derogatory in that cooking isn’t really based on a guiding theory, but also somewhat admiring, since cooking is a bit of an art (and he liked to cook).

The practical superiority of this “cooking approach” matches my experience as well. Most of the gains we have had in data and modeling problems I was involved in at LinkedIn came from creative feature engineering and analyzing results to debug where our models did well and where they did poorly (and then brainstorming solutions). It’s easy to see why academia doesn’t focus on this: research should be broadly applicable, so domain-specific feature engineering isn’t very sexy. This isn’t only a matter of preference, either, in commercial settings there are certain advantages. Specifically, our data sets aren’t a static thing. You often own the source of the data as well as the model, so you can always go and get new independent predictors by collecting new things or creating new feedback loops. This is generally not the case for academic research which works against fixed, static data sets by necessity. For a fixed, static data set, of course, the algorithm is the only thing that can be changed.

Since, after all, organizations often only have a few dozen key data problems, it is not too burdensome to do custom modeling work on each problem. In addition to producing better results, this is generally cheaper and simpler than building out generic infrastructure. Even for products which are very prediction-centric (e.g. recommendations) the actual model building portion is a small fraction of the overall effort of a production system, so a custom approach to modeling doesn’t add much to the work involved.

But—and here is what I really want to argue—even if the “cooking” approach to data problems is currently the best path for folks in industry, that doesn’t make it a good paradigm for machine learning and statistics research. After all one of the main motivations for machine learning as a discipline distinct from statistics was an interest in “black box” prediction without the bag of distribution-centric tricks and expertise that traditional statistical modeling favored. The end goal should be to automate the statistician as completely as possible, not just to add more tricks to her bag. I think this is an area where academia falls a little bit short.

A related point is discussed in a great opinion paper by Larry Wasserman (he of All of Statistics fame). The thesis of the paper is that although the primary distinction of the last decade or so in statistics has been frequentist versus Bayesian styles of modeling, a more important goal is achieving low-assumption high-dimensional modeling. “Assumptions” are, after all, what make a model fail to be “off-the-shelf” since each assumption requires checking and tweaking to make sure it is true and limits the approach to only certain applicable domains.

In some areas you can actually see things moving backwards. For example the fiddliness of Bayesian statistics is something that has always bothered me. On the theoretical side Bayesians have collapsed lots of disjoint heuristics into a core set of principles, but unfortunately the actual modeling practice advocated by most Bayesians seems to move further from off-the-shelf. Hierarchical modeling and priors and messing around with MCMC, and all of that, add to the assumptions and make the role of a good statistician conducting the analysis even more central. For example, I enjoy the fantastic walk-throughs of modeling problems in Andrew Gelman’s Bayesian Data Modeling but clearly the conception of modeling is of something done with a computer, not something done by a computer. Of course there is no reason this has to be true. Bayesian ideas can be used for black box approaches (though I gather that some Bayesian’s feel that non-informative priors and predictive checks and are somewhat sacrilegious). But in the common practice of Bayesian modeling the statistician seems to play an even more central role than in traditional statistics.

Even for the fairly “off-the-shelf” algorithms we do have, there are numerous practical obstacles that prevent anyone but an expert from making effective use of them. These algorithms often go wrong and need a fair amount of hand-holding to produce good performance. The more complex the algorithm the less likely you will be able to effectively do this analysis and debugging. I think the existence of a good set of theory and tools for debugging models are what, more than anything else, has lead to the pervasive success in industry of simple models like logistic regression and the relative rarity of a lot of what you would find in, say, a recent ICML publication. Part of the problem is that in publication most algorithms are kept as simple as possible to simplify presentation, but this tends to punt a large number of practical problems into hyper-parameters and other tricks that have to be guessed by magic or set via cross-validation. Likewise getting a fast and robust implementation of the algorithm that converges quickly and reliably without manual tweaking and debugging any numerical algorithm can be an art form in and of itself. (For example, I love that there is this book called "Numerical Algorithms that Usually Work"). In other words, there is an engineering aspect to the problem in addition to the research and this is often not really common knowledge.

As an example, pick up most introductory machine learning book and read the description of nearest neighbor algorithms. Few actually describe a procedure that will work well on real data. Simple nearest neighbor prediction can do very well on certain kinds of problems, but it has a number of problems that need to be fixed when used with any common distance like the L2 norm. First the scale of the feature values is important and input data has to be scaled. Say you are trying to classify men and women using height and weight. If height is in feet and weight in pounds, weight will dominate the distance. Likewise it is not tolerant of irrelevant features or features with non-uniform predictive value, since L2 norm weights all features equally, with no attempt to weight more predictive features more heavily. That is, if we add a weakly predictive feature predictive performance may get worse. Both of these problems are pretty obvious and can easily be fixed, but the key point is that the algorithm that is most commonly taught doesn’t actually work that well. For more sophisticated algorithms the fixes become less obvious.

This is unfortunate. I think there is a lot to be said for turning ideas into infrastructure. By infrastructure I mean something with well known characteristics, something you can rely on without thinking about the internals too much. A classic example would be the socket API which abstracts much of the gory details of networking from the system programmer. A programmer can be fairly successful writing network programs with a pretty naive understanding of how physical networks actually work. This kind of intellectual infrastructure is what makes the widespread use of deep, hard ideas possible, and I think not enough has been done for predictive machine learning to make it usable in this fashion.

Ignoring the nasty little details of a good implementation is something that made a lot more sense ten or fifteen years ago when machine learning was more a theoretical pursuit, just a more grounded offshoot of AI research. These days it is big business, and people have a practical and pressing need for algorithms that work.

Of course, no matter how successful researchers are in creating off-the-shelf algorithms, there will always be room for custom modeling to do better. Still, I think there is a lot to be said for algorithms that are cheap, easy to use, and predictable even if they perform worse than the best thing an expert could devise. Think of this as Ikea instead of high-end hand-crafted furniture: everyone agrees the hand-made stuff is better, but most people don’t want to pay for it in most cases.

Of course, no algorithm, no matter how sophisticated, can add new independent predictors not given to it as input. But I think guessing features is hardly the core difficulty.

In any case, thinking about all of this made me wonder if one of the reasons for lack of progress in this area is the absence of any kind of standard benchmark for “off-the-shelf” predictive performance. Typically in publication an algorithm is presented with comparisons to a few other state-of-the-art (or at least well known) algorithms on 2-5 common, benchmark data sets. MNIST or what have you. There are a number of well known problems with this. First, the data sets are often chosen to put the algorithm’s best foot forward, showing only problems on which performance is good. It may well be that the algorithm is worse than the state of the art on many or most other problems, but there is no way to tell this from the paper. Second computational performance and scalability information for larger data sets is often missing. This is a critical factor for commercial applicability—small improvements in predictive performance are not always worth crippling performance challenges. And of course predictive performance itself is a factor of the volume of training data—the most accurate algorithm with a training set of ten thousand is often not the most accurate when run with a training set of ten million. Finally, and perhaps most importantly, there may not be a reference implementation of the algorithm that is available to test against, so the comparison algorithm in publication may have a fairly naive implementation.

I wonder if most of these problems couldn’t be solved by a good standard benchmark and some reporting infrastructure. I have always found that groups of people work better when there is a clear, plausible metric for success. It isn’t possible to have a single data set that is representative of all problems, so I think the best approach is something that just throws many, many data sets at the algorithm, and breaks down the results along various dimension. What I am thinking is something along the lines of the computer language shootout. This would be a setup that allows people to submit algorithm implementations and have them run against a variety of data sets. These days one could easily set up some scripts that could run tests on EC2, making it fairly easy to run hundreds of algorithms on thousands of data sets for a modest cost. One might object that it would be hard to choose a representative sample of data sets, but I think as long as they represent real problems of interest, and as long as I can filter results to just problems with technical similarity to mine (say multiclass prediction, or high dimensional problems, or what have you), this would be fine.

In addition to predictive accuracy it would be nice to have some measurements of efficiency and scalability to measure actual CPU time. This would encourage efficient implementations to develop, which are often something of an art in themselves.

The infrastructure to run this would actually be useful for research as well since it would provide an easy test harness to run your own algorithm, a bunch of data sets for predictive problems in a standardized format, and a reference implementation for other algorithms.

The other part to this would be good metadata and reporting to understand how the algorithms perform in different scenarios. There are a lot of things you may want to know about an algorithm. How does performance change with the number of training examples? How well does it handle multiclass (if at all)? Missing features? Low probability classes? High dimensions? Outliers or mislabeled data? Categorical variables? Useless features? Furthermore what is the cost (in CPU seconds) per training example for a good implementation? What is the cost per prediction? There would likely need to be a few different benchmark categories for classification and regression, and perhaps one could argue for some special categories for structured problems like text and images.

A few thoughts on implementation. The framework should be open source and available for anyone to inspect and improve, as with anything in research. I don’t think the idea has much commercial value so I doubt there would be any objection to this. This would be well suited to running on AWS. EC2 would make it easy to get ahold of lots of computer time in an elastic fashion, handle some of the storage problems. Likewise it would allow a setup where the algorithm developer pays the cost of benchmarking through their own account. This is the only way to easily run other people’s code without fear of an infinite loop or a malicious user tying up your machines. This would also make it easy to parallelize the execution by running lots of tests at the same time. Having a high-quality driver that would run a test algorithm on hundreds of data sets in parallel and break down the results would be valuable in and of itself.

One would want the test framework to assume as little as possible about the implementation of the algorithm. I think probably all you would specify would be some driver script and that took the data set as input, this script could call whatever kind of binary it wanted. This would give freedom for people to use whatever programming languages or libraries they like without too many integration hassles.

It is important that the framework enforce a purely black box approach, any tuning or hyperparameters would have to be hard coded for all data sets or set automatically by the algorithm itself as part of its training. If an algorithm implementer wants to set custom hyper-parameters for each data set they will need some heuristic to do so, or else do some kind of brute force search with cross-validation.

You would want to keep metadata on each algorithm recording its run time and performance on each data set as well as a browser for the source code, and wikis giving a higher-level overview and links to applicable papers.

There are lots of other little implementation details. You would want a nice UI with a leaderboard. You would want to attach facets that describe the problem for each data set and algorithm (e.g. number of classes, training set size, etc) so that results could be broken down appropriately in reporting.

I am not aware of anyone doing anything quite like this. Kaggle is a very cool company that provides machine learning contests as a way to outsource predictive analytics. Their focus seems to be on single problem contests and I don’t think they supply the infrastructure to run the algorithm. Maybe they could be convinced to provide past challenge data sets as benchmark data?

Primarily I think the reason this hasn’t already been done is because there is no publishable work involved, just mucky scripting and munging data sets into a common format. Secondly AWS is fairly new, and I think that simplifies a lot of the problems of making such a framework available for use without the security and operational problems of trying to let people run code on machines you own and operate.

Postscript: A few people pointed me towards a version of this idea that actually already exists. It is called Gotta love the internet.


  1. jakehofman reblogged this from boredandroid
  2. boredandroid posted this