This article makes an analogy between concurrency and memory management. The claim is that since modern engineers almost always program to clusters of computers, what we need are tools targeted at building distributed systems. This is taken to mean that we need language support for distributed system development. The idea is that this will lead to the dominance of languages like Go and Erlang.
Go and Erlang may end up being popular, but I doubt that it will be for that reason. Distributed system development is not going to become a part of every application developers daily life because it is a massive pain in the ass. To the extent that it is happening today, I think it is because good distributed computing frameworks are lacking and so applications are forced to reinvent some distributed systems primitives. But this gap will not persist. Rather a handful of frameworks will provide different programming models that handle distribution and concurrency across an elastic pool of machines.
Those who have worked with MapReduce have already dealt with this. MapReduce programming is almost always single-threaded. Concurrency is managed by the framework. The last thing that would help you write good MapReduce programs is more user-level concurrency primitives. And yet MapReduce is highly parallel.
This is not a new thing. Even Java servlets, for all their faults, largely served to abstract away the threading model, at least for applications that only interacted with a single database using blocking I/O.
I see three basic domains for processing: online, near-real-time, and offline.
In the online world people build request/response services. Parallelism is found by treating each request as a unit of work and dividing requests over threads and machines. I have seen a number of variants on this model, from “service query languages” to DSLs for stitching together REST calls. What all these share is that they abstract away the concurrency of the processing without needing direct access to single-server concurrency mechanisms like threads.
In the near-real-time processing domain stream processing frameworks do a good job of providing asynchronous processing without directly thinking about concurrency at all. Again you interact with concurrency and parallelism only at the framework level, the code you write appears completely single-threaded.
The offline world seems to be moving in the direction of a handful of YARN frameworks for different specialized purposes. What almost all of these share is that they don’t require users to directly manage concurrency.
I think these high-level domains (online, asynchronous, and offline) will prove pretty durable. I doubt we need more than a dozen basic distributed computing abstractions to cover these.
This leads me to think that putting more time into language support for single-server concurrency (software transactional memory and all that) is of limited utility. It will only help the implementors of these frameworks, not the end user. Framework implementors will have a much different set of needs then application programmers. They care a lot more about performance and fine-grained control. Although this is apparently a controversial statement, I’m not sure that just threads and locks aren’t a workable model for framework developers to work with. After all, they work pretty well for a discipled team and give excellent performance and control.
I just got back from Foo Camp. Here are some random thoughts I was left with. These are not subjects I know very much about so these are mostly questions rather than comments.
Why is Bitcoin decentralized?
There were several Foo Camp sessions on Bitcoin. I think the case for a digital currency is overwhelming. Think for a second about the world’s most common payment mechanism: checks. To send money first someone writes you a check, then you take a picture of the check with your phone, and then send that picture to your bank via an iphone app, and they OCR it and depost it in your account. This is arguably the most ridiculous imaginable way of moving a number from one computer system to another. Just stop for a second and think about checks—I think it is amazing that that system can work at all. So I am totally sold on the need for a digital currency.
But one question I am left with is why do we want a decentralized currency? Many people think decentralized things are inherently better than centralized things, but this doesn’t seem to be true at all. There are many things that are very very hard to do in the adversarial model that is required for a fully decentralized system. By decentralized I mean run on top of a bunch of untrusted peers, rather than by some centralized controlling entity. The internet is mostly decentralized in its communication layer, but the services on the internet are almost all centralized (e.g. Google or Twitter are not things that run distributed over computers controlled by many different organizations).
The primary argument for a decentralized currency seems to be that it will be a kind of improved gold standard that will avoid the mettling of the Fed. Yet the Fed was something we added for a reason, no? Does that reason no longer make sense? This is not something I have the expertise to understand, but it seems most economists think the role of the Fed in controlling the supply of money is important and I don’t know enough to know why I should disagree.
Designing a fully decentralized currency is mind-blowingly cool, but I am not at all sure that it is the right practical approach. Most of the advantages and cool applications of Bitcoin seem to come from its being digital rather than from its being decentralized. I am not an expert but it does seem that a centralized implementation of a digital currency could actually be a bit better than Bitcoin at scaling and handling vast numbers of small transactions with very low latency (micropayments!).
There are another set of arguments from people who want to limit government’s ability to control what is bought and sold and collect taxes, but this is primarily a political agenda and largely one I don’t agree with. From my point of view as long as the government controls a large nuclear arsenal and a police and military apparatus, their control over the currency is the least of my worries.
So a question for experts in this area is why wouldn’t it be better for the government to provide a centrally administered digital currency as a service? I.e. have the fed run a central transaction log with a nice public REST API. I think that that could potentially provide many of the same benefits and be a much simpler technical undertaking, with none of the useless mining or 51% GHash problems. I am not advocating this since I don’t know much about the topic, but just asking Bitcoin people what we would lose in that implementation.
I should learn more about deep learning. A bunch of smart people are shockingly enthusiastic about this, and there were a couple of interesting sessions that touched on it. I have somewhat ignored it since there haven’t been huge improvements in the fundamental challenges in machine learning in some time. Most of the advances in machine learning, from my perspective have really been from tuning up the techniques we have had for a long time and putting them to work on real problems. So I had somewhat stopped paying attention to new research, and attributed the extensive PR push around deep learning to be part of the normal cyclical fads in machine learning. Maybe it’s time to tune back in and see what is going on.
As I understand it, deep learning is basically a hack for “pre-training” multi-layer neural networks using unlabeled data first and then using that representation for building the final prediction model. In combination with various good optimization techniques, and the use of GPUs for vector processing this enables many more levels to be practical is a neural network.
The promise of Deep Learning, as with many past techniques, is to significantly reduce the need for feature engineering (i.e. having to create complex derived features to feed into the algorithm). The claim is state-of-the-art results in a number of hard domains taking input data “as is”, without preprocessing (raw words, raw pixels, etc). Of course SVMs and other techniques have made similar claims.
The sessions were for a general audience, so I was left with a couple of questions I don’t know the answer to:
Government and Digital Infrastructure
Another set of sessions and talks was around Healthcare.gov and the team that helped get it back on its feet. This was a pretty fantastic turnaround. As someone who grew up without healthcare I think it is wonderful that healthcare is becoming a basic right of citizens of the US, and I think we have that group of people to thank for the last minute save. But although it is great that they got our healthcare website back on its feet, the larger inability of the government to successfully build digital infrastructure is a huge concern to me.
The real practical problem with my centralized Bitcoin proposal is that there is exactly zero chance that the government could have the foresight or technical expertise to understand the uses of a digital currency, and even less chance that they could put together a team that could build such a thing. Even if a centralized implementation is vastly easier to build, there is no chance that it will be. This is something of a tragedy as I see much of what the government doing as being a kind of information processing. Large parts of what are done by the DMV, the passport office, the IRS, the social security administration, medicare, and so on are things that seem to me could be done in large part by computers. For example, the DMV and passport office seem to be in large part a special case of an authentication and permissions system for citizens (they manage the permission to drive a class C vehicle, say, or leave the country). It is just that these systems were built before computers, so although they are largely information processing systems, they are ones built out of people, i.e. bureaucracies.
This actually gives you a very different take on politics. The traditional view of politics is around the “size” of government: Republicans want low taxes and small government that does little, and Democrats want higher taxes and larger government that does more. But this is actually far too simplistic. In reality what the government does is a combination of two different things: redistributing money (social security, medicare, etc) and running services (the military, schools, etc). Interestingly, though, the controversy over “big government” is almost entirely about the government’s redistribution of money. And yet that is exactly the part of the problem that could likely be solved without a large government bureaucracy. I think a well designed entitlement and tax system could be almost entirely automated.
In other-words, you can be in favor of reasonable levels of taxation and wealth redistribution and simultaneously support a reduction in the size of government as an organization. I would think that the efficiency aspect is something that both Republicans and Democrats could agree on, since regardless of the taxation level and set of benefits you favor, you would surely want the system that delivered these to work efficiently. And yet I believe there is simply no chance that the government will be able to tackle these large challenges, which would amount to rebuilding parts of its own bureaucracy to be vastly smarter, quicker, smaller, and lower overhead. These projects would be large infrastructure investments, on a scale we no longer really undertake, though perhaps smaller than things we have done in the past like the building of the interstate highway system or space exploration. Code for America is an organization that is attacking some of these problems in local governments, which is probably the most tractable approach given the state of the federal government. What I would love to see is the development of political will around some of these things that would allow the federal and state governments to undertake some ambitious projects in this area and attract the kind of people that could actually make this stuff work.
This New York Times has an interesting internal memo on innovation and the disruption of print media by new digital media.
What many commentators have pointed out is that newspapers have not needed to be tech companies for the last few hundred years. Innovation and the continual improvement of their own product is not naturally in their DNA. So for the last few hundred years the newspaper business hasn’t changed that much.
What few people have pointed out is that this is likely a temporary phenomenon. You would expect online delivery of news media to reach a certain point of maturity where further innovation on the delivery mechanism produces little real improvement. Say that takes 10 years.
This presents an interesting dilemma. For the last few hundred years newspapers have not had to be tech companies, and ten years from now they can stop, but to stay alive for the next ten years they have to temporarily acquire a core competency in technology.
We just released Apache Kafka 0.8.1. There are a couple of interesting things in this release.
The biggest new feature is log compaction. This allows de-duplicating the data in the partitions of a Kafka topic by primary key. This makes Kafka an excellent backend for really massive partitioned event sourcing, database data replication, and other log-centric architectures that model mutable data. This feature also acts as the basis for stateful stream processing in Samza. More details on how you can use this here.
This release also included a big cleanup of the log layer. Most noticeably this does a much better job of managing fsyncs. For those who like smooth latency graphs under load with no performance tuning this is a big win.
We also improved a lot of operational activities that previously required bouncing the brokers. We added commands for adding partitions and deleting topics online. Documentation for using this is here.
We also made all per-topic configurations dynamically managed. This means you can change the retention on a topic, or change it’s segment file size with a simple command and no bouncing brokers. These configs are documented here.
We added functionality to automatically balance leadership amongst brokers (previously you had to run a special command to make this happen). You can enabled this by setting
auto.leader.rebalance.enable=true. We also added code to have leaders proactively transfer leadership during intentional shutdowns, this is more graceful than the transfer that happens with a hard kill. You can enable this with
controlled.shutdown.enable=true. We will be enabling both by default once we have a little more experience using them.
There are also dozens of bug fixes and minor improvements.
This is an in-place, no-downtime release. You shouldn’t need to do much other than push out the updated code and do a rolling bounce on your servers. However you may want to glance over the new configs and tools first.
As usual we have been running pre-release versions of this code at LinkedIn for several months now so it should be pretty stable. But if you see anything unexpected please let us know.
So what’s next?
The Kafka committers have been working on a bunch of exciting new stuff. There will probably be a 0.8.2 release in the next month or so with improved consumer offset management (built on top of our new log compaction support) as well as a beta version of a completely rewritten Kafka producer. The final version of the producer and consumer will be in 0.9. If you have thoughts on the api or feature set for the producer or consumer we have been actively discussing them on the mailing list and would love to hear people’s thoughts.
Kyle has a good write-up on replication and partitions in Kafka. I am a big fan of this methodology (empiricism FTW), though it is a bit uncomfortable to watch ones own project strapped to the apparatus.
Kyle’s explanation is pretty clear, so I have only a few things to add.
First, as Kyle notes, we have unfortunately not yet beaten the CAP theorem, although to maintain parity with the rest of the distributed systems ecosystem we promise to start working on this as soon as possible.
It’s worth pointing out our application domain to understand our motivations. Kafka is meant as a data journal (like HDFS, but targeted at real-time streams) rather than a metadata journal (like Zookeeper). These often have slightly different design tradeoffs. For example the HDFS namenode handles metadata and now uses a majority vote quorum approach for HA (or just a single master for older versions), but the HDFS data nodes need to be a little bit more parsimonious with the actual data blocks (because people don’t want to have to store five copies of all their data to tolerate two failures). Kyle describes our motivation as handling more failures, which is correct and equivalent, but it is a little more intuitive to think of keeping fewer copies of data since your failure tolerance requirements are probably fixed and you add replicas to meet that. Essentially we want to make replication a practical default approach to handling high-volume real-time data problems.
At LinkedIn, for example, we are running Kafka with replication in production for all data and are doing north of 40 billion writes a day through this replicated log. So the number of writes required is a real and practical concern.
As Kyle points out there is no correct algorithm for guaranteeing consistency in the face of f failures with fewer than 2f+1 servers, but this turns out to not actually require 2f+1 copies of each piece of data. The trick to getting fewer copies of data while maintaining consistency is to treat data writes different from configuration changes (which for anyone who is not a log-replication nerds mean changes to the set of brokers replicating writes). This idea is by no means original to us. Cheap Paxos is one of Leslie Lamport’s Paxos variants that does something along these lines with the same goal, and PacificA is a system utilizing a similar technique. This split makes sense for a system designed to handle large data volume because the data will be much much larger than the metadata about configuration (you might have tens or hundreds of data nodes, but metadata remains tiny).
A log replication algorithm typically needs to guarantee something along the lines of “committed writes are not lost”. In a leader-based log replication algorithm this usually means that the leader must have all the committed writes. To ensure that this property can hold even when a new leader is chosen there must be some overlap between the replicas who have the write and the set of nodes who participate in choosing the new leader (to ensure that the chosen leader has all the committed writes). This is the quorum property. Any overlap between these sets will work: for example you can require a majority vote for the write and a majority vote for leader election, or you can require only a single acknowledgement on write but require unanimous vote to elect a leader (not at all useful, but correct!).
Kafka does something a little different, it maintains a dynamic set of in-sync replica brokers (the ISR) that grows and shrinks as failures occur. Each broker in this set must all acknowledge each write for it to be committed, as a result any broker in the ISR has all committed messages and is eligible for election. However failure to acknowledge will cause a broker to drop out of the in-sync set, reducing the set of nodes that must acknowledge. This is great, but pushes the consistency problem into maintaining the ISR. We hold the ISR in Zookeeper which does a full majority quorum for writes. In order to rejoin the ISR a failed node must catch up on replicating the log to come back into sync with the master.
In this sense (as with Cheap Paxos and other variants) there is actually a caveat on the failure tolerance: we can tolerate N-1 Kafka node failures, but only N/2-1 Zookeeper failures. This actually is sensible, though, as Kafka node count scales with data size but Zookeeper node count doesn’t. So we would commonly have five Zookeeper replicas but only replication factor 3 within Kafka (even if we have many, many Kafka servers—data in Kafka is partitioned so not all nodes are identical).
This approach has pros and cons. The pro is basically fewer copies of data which makes the kind of large data volume problems we target a lot more practical. The con is primarily having what amounts to two types of quorums (Kafka’s quorum and Zookeepers Quorum) and more nuanced failure detection. The criteria for a node to be alive thus includes both replicating the leader and maintaining a Zookeeper connection. Failure detection can always be a bit finicky in practice so this is a real issue. But since Kafka already had a Zookeeper dependency for a variety of other uses this seemed an okay tradeoff to make.
The issue Kyle demonstrates makes for a good illustration. In this scenario Kyle kills off all but one node in the ISR, then writes to the remaining node (which is now the leader), then kills this node and brings back the other nodes. I actually think the issue here is what we call “unclean leader election” rather than our approach to quorums or anything specific to network partitions. Any type of failure executed in this pattern should work just as well as network partitions to reproduce this case.
An equivalent scenario can be constructed for a majority vote quorum. For example consider using a majority vote quorum with 3 nodes (so you can tolerate 1 failure). Now say that one of your nodes in this three node quorum is obliterated. If you accept a write with only two servers the failure of another server breaks the quorum property so you will no longer be able to elect a new master or guarantee consistency. I think this is the same as accepting a write in Kafka with only a single remaining server given our quorum properties—both cases are one server failure away from data loss.
The data loss that Kyle causes actually comes from our behavior in handling the case where all nodes die. This is an interesting thing to consider. For example, in the majority vote quorum, what do you do if you have only a single server remaining out of your three node cluster? To maintain your formal distributed system guarantee you need not do anything since the precondition of your guarantee has been broken. At this point you can can technically just printf(“Sucks to be you”) and exit(666). But a practical system likely needs to think this through. After all you would argue, I still have almost all my data on my one remaining server. The crux is that if you use this data you potentially violate consistency and lose (or gain!) committed writes, if you don’t you remain consistent but if you can’t revive your other machines you’ve lost everything.
Is it better to be alive and wrong or right and dead?
As usual the right thing to do depends on the application. Perhaps you can recover the nodes and bring them back with data intact in which case waiting around with the system down may be the best thing. But it may also be that you are running a “system of record” for writes from online traffic in which case downtime equals data loss, and you had better bring back up whatever you’ve got right-the-fuck-now.
Currently we are more than a little aggressive about pursuing the later approach—when all nodes are down we will elect the first node to come up as the leader. This can be dangerous for use cases which require strong consistency: we will do this, print a nasty warning in the logs about data loss, and continue merrily along.
For Kafka we actually do have both types of application. When used as a log directly by applications, Kafka is the system of record so downtime generally means data loss. In this case I think availability is key and our consistency guarantees are already more than good enough. But for processing downstream of this—say a Samza job that pulls data out of upstream Kafka topics, processes it, and produces new data to downstream topics—downtime need not mean data loss, just delay (provided you can eventually restore on formerly in-sync node).
Kyle’s recommendation of disabling “unclean election” and thus requiring a node in the “in sync set” to come back before accepting further reads and writes is a good one, and I think this provides more flexibility around handling this case. We have had a JIRA to add this for a while, but haven’t quite gotten around to it.
The other suggestion of requiring additional acks beyond the minimum is interesting. This does reduce the probability of a committed message being lost, though to do so reduces the availability for writes. But you could actually make the same argument for a majority quorum algorithm to have a setting that allows setting the minimum acknowledgement higher than majority to avoid “risky” writes.
There is a lot more detail on Kafka’s approach to replication and the tradeoffs it involves here. There are a lot of other interesting bits there for fellow distributed systems nerds.
I think the online education movement is amazing. College classes, like youth, are wasted on the young. College students are usually getting their first chance to live away from their parents and figure out who they are, which, understandably, seems more immediate and alive than 19th century English lit. or linear algebra or whatever class material they happen to study. So for a lot of students college is just the last hoop in a series of hoops they have to jump through before they are allowed to start their own life. So I think classes full of people who want to learn that subject is a great idea.
The most interesting phenomenon in this transformation is that education is changing from a non-scalable medium like theater or opera performances to a scalable medium like film or mp3s. This has the usual effect: average quality rises dramatically, price drops, and demand rises. It is all well and good to debate whether the Stanford students who took the Machine Learning class in person got a better experience than those who took the class online, but very likely all but a handful or those who took the online class wouldn’t have been admitted as Stanford students at all.
I do think this will transform the university. With a non-scalable medium, taking a class from the third best algorithms professor in the world is a great opportunity, but with a scalable medium it isn’t clear why anyone wouldn’t just want the best. So expect huge pressure to improve the quality of teaching (which is completely lacking today). And expect these top professors to be treated much more like rock stars, and make a lot of money from teaching. (This isn’t suprising, the value created by a class of 200k people is just so much higher than from a class of 150, it would be surprising if the professor couldn’t capture at least some of that).
But that isn’t what I want to talk about. I want to talk about testing and how I think testing, done right, could have an equal impact on higher education.
People say that these online classes will never replace colleges, and that may be true in some sense. I think if you break up the value of college into its constituant pieces there are really three parts:
Let’s go through each of these.
Learning, is what all articles about MOOCs and online education cover, and though I have a lot to add, I will save it for another time.
By “the college experience” I mean all the non-educational aspects of college. This is the friends, late night conversations, sex, drugs, alcohol, and all that. This is the first time many kids have to move out on their own, away from family and friends who have known them sense grade school, and kind of start fresh. For many people who start a career directly after college this may be both the first and the last chance they have to reinvent themselves. But college administrators have no particular expertise in providing this, and fortunately the college experience isn’t that hard to replicate. I think you just need dorms—i.e. housing where lots of young people are close together—and you need kids to move away from where they grew up and the rest takes care of itself. The housing could probably be cheap and nice too if it weren’t provided by universities, which, whatever their strengths, are not exactly the most efficient landlords.
So that leaves certification. That is what I really want to talk about.
Unlike this article, I think online education companies will make lots of money. The reasons are simple. Education takes a lot of time, so people will pay for better quality. If you are going to spend a few hundred hours in a class you want it to be good, and you would pay a few hundred dollars to get something 10% better for your time. And that doesn’t even address the more irrational reasons. People are used to paying a lot for education, and I think there is an irrational part of human nature that tends to assess prices relative to what they are used to.
But I don’t think producing classes is the best business in this space, and it may not even be the most transformative for education. The best business is certification or testing.
I think people can’t see how important this is because certification and testing is so broken now. How is it broken?
First, it has become customary that colleges get to assess their own students, which, since colleges are paid by the students, has led to the inevitable grade inflation.
Second, colleges have no motivation to make grades good—they don’t benefit by making grades comparable or accurate. How does a 3.3 average from Berkeley compare to a 3.5 from Georgia Tech? Nobody knows and I doubt either Berkeley or Georgia Tech would want to answer that question if they could.
Third GPAs, the usual summary measurement of grades, is a terrible statistic. It averages together lots of subjects in a way that likely isn’t meaningful for anything in particular other than measuring how much, on average, the candidate cared about getting good grades. It is easily manipulated by taking easy classes, which is exactly the wrong thing to reward. And it values consistency over all else (since getting an A is pretty easy these days, getting a high GPA is more about never screwing up then being particularly good at anything).
I have done a fair amount of hiring which let’s you look at GPAs and then do an in person assessment. GPAs aren’t worthless but neither are they worth much.
In short colleges do a terrible job at assessment which has made hiring use grades less than they should.
Outside of grades, most tests kind of suck. The normal “standardized” tests are overly general (one 3 hour test may cover your whole high school or college education). They also try to test subjects like English that are hard to test. Boiling your high school education down to an SAT score or your college education down to a GRE score is silly.
Interestingly the concept of “certification” has arisen in the professional context appart from schools. This is your “Microsoft Certified Systems Engineer” and the like. These certifications have a bad reputation purely because they are pass/fail and aimed at fairly low-end jobs. Having an MCSE is kind of like putting on your resume that you passed high school algebra. It’s not a bad thing, but if you have to say so (or you have to ask) that isn’t good. Harder certifications—an MD, for example—has a better reputation. But any pass/fail test will be aimed at preventing very bad quality rather than finding very good quality (having an MD, after all, doesn’t indicate a good doctor).
But imagine a group of people who care deeply about data working seriously on the idea of assessing skills. First your score would have to be closer to a percentile rank not pass/fail, and that rank would have to be against other people taking the test. Percentiles are wonderful because you know exactly what it means (a 99.9 means the candidate was better than 99.9 percent of all test takers) where as an ‘A’ doesn’t come with that. There are plenty of hard problems to get right: you have to randomize the question pool to avoid cheating, but you have to guarantee a fixed performance (can’t have people lucking out and getting all easy questions).
The other problem with existing tests is that they are too general. This makes studying for them stressful. Tests should cover a single specific area (i.e. “Linear Algebra”) not a general field (“math”). One can always create scores for general areas by taking scores in a set of tests for core subjects in that area.
I think this kind of testing would need to be done in person over a period of a day or so per subject. This sounds ridiculously time consuming compared to short tests like the SATs, but I think that is not an unreasonable percentage of time to spend on assessment and it would stand in for the “final” since this would be a much better, more comparable test.
I think it is easy to miss how powerful this is. To see it, you have to think about the hiring process as it works today. Let’s say there are three types of hiring: unskilled, skilled but inexperienced, and skilled and experienced. Unskilled hiring is not complicated (“have you ever been convicted of a felony?”). Hiring skilled, experienced people is generally done based on what they have accomplished; if they are really good they have been working for a while then they will have done some big things and have people who will vouch for them. This is going to be better than any test. In other words, LinkedIn can solve that problem. But hiring skilled, inexperienced people is pretty hard because they haven’t had an opportunity to do anything yet.
Let me illustrate this by describing the process for hiring new college graduates for a programming job. These are people who have specialized skills (programming) but no real work experience to judge. This process is pretty standard for good silicon valley companies and goes something like this. First you eliminate candidates from all but the best schools. You know that the best programmers at the worse schools are better than the worse programmers at the best schools, but this kind of blunt heuristic is all you have. Then you interview these candidates essentially at random (maybe you look at their projects or internships or whatever but it is done so quickly it is basically random). The first round of interview is usually a one hour phone screen. Assessing a persons skill set in one hour over the phone is, of course, totally impossible. So you reject lots of good people for silly reasons like taking a little longer to answer the one question you had time for. Interviewers are generally poorly calibrated against one-another so it matters almost as much whether you get an easy interviewer as how well you answer unless you are a complete failure. This, if successful, will be followed up by a series of in person one hour interviews. Refining this process, standardizing the question set, avoiding cheating, and calibrating the scores of your interview process are a huge amount of work and usually done wrong.
But the real inefficiency is this. Once you have invested a few dozen hours in assessment of a candidate, what happens to that assessment? Well, for most candidates, say 95%, the answer is “no hire”. This means that another company does exactly the same thing (pick a good school, ask simplistic questions, etc). Basically all the information gained in the first interview is thrown away. In total a candidate may go through 40 hours of interviews at different companies, but the coverage is terrible since all the companies ask the same questions and don’t share the results.
This problem doesn’t just impact companies, it impacts candidates. Candidates who are fantastic programmers but who lack the right degree, or went to the wrong school will not be given an interview at all. It is just too expensive to risk it because the rejection rate for that group, on average, is a little higher. My friend just went through this. He has a math degree from Harvard, and taught himself programming. After four years working as a concert cellist, he wanted to get into software engineering. The problem was, how to convince companies that you know what you know? They don’t have the time to let you prove that, and most won’t even return your calls. Meanwhile anyone with a Stanford CS degree has companies stalking them at home. This is an inefficient market. The Stanford CS kids are better, on average, but they aren’t that much better.
Now imagine that there is a central organization that does assessment. Let’s say that this assessment is granular (say one test per few classes) and let’s say that it is put together by real data people with a lot of effort to make the test material and the ranking itself high quality.
Naive thinking would lead you to believe that companies would hire by taking resumes, applying their random filtering, and then requesting test scores from this central organization for those resumes. But of course that isn’t how it would work at all. Instead you would just query and rank ALL candidates who met your standards on the skills you cared about.
This is the key to why testing is such a good business. It isn’t about charging test takers and competing with ETS. It is about being the sole entity with the ability to query a database that has all the skills of all the professionals and having a deep and well-calibrated assessment of that skill. There is no reason this testing should be limited to things currently taught in college classes, I think it could extend to any quantifiable skill.
If you believe that education will become very cheap as it moves from a non-scalable to a scalable model then this will result in lots of people who can learn things from these classes, but without the usual ability to certify their knowledge (e.g. a degree from Stanford). Of course the online education providers can try to provide this, but what does it mean to have a degree from Coursera or Udacity? I think this is just a digital imitation of the current (bad) practice.
Obviously testing like this wouldn’t replace interviews. But it would replace the part of interviews that companies are bad at (well calibrated assessment of basic skills) and give more time for assessing fit and personality.
Likely people would resent being tested and scored. But people don’t like being interviewed either, and at least this kind of model would mean shorter lower pressure interviews and the ability to “do over” if you happen to get tested on a bad day. Because the “query model” changes, the tests effectively apply you to all companies, rather than having to interview at each one.
This idea is only really easy to scale for quantitative ares. For math, engineering, and the sciences testing, when done right, can be very high quality. For these areas there is no reason for silly pre-computer relics like multiple choice, you can give open ended problems without a pre-defined set of answers. In computer programming you could actually have the person write a computer program.
Non-quantitative disciplines like english are harder to scale, but they are assessable. I think writing can be graded, and I think a really good writing certification could be a lot more useful then college literature classes (which focus on literary critique more than writing) for most uses. So the humanities could be assessed as well, but it would cost more since a human being would need to read the writing.
Its worth noticing the impact this would have on the world if it succeeded. First of all I think having a well-calibrated measurement would put a lot more focus on learning the material and much less on how you learned it. No one will care too much which book you read, which online class you took, or what exercises you did. Those are all just ways to learn. Second, this would truly democratize education—the self taught Bangladeshi is on equal footing with the legacy Harvard admittee.
Another way to say this is that testing, if it is good enough, commoditizes education. This sounds bad, since somehow commodities have gotten a bad wrap. But I mean this in the technical sense. A commodity is something that has a measurement of quality which is accurate enough that the details of the good’s production are irrelevant. If you know the ratings for a diamonds size, clarity, and color, you don’t need to worry about what mine it came from. In the absence of good measurements we are forced to rely on brands or other heuristics. But this is exactly how learning should work. If you learned linear algebra it shouldn’t somehow count for more because that learning happened in a Yale classroom instead of in your bedroom. All that should matter is how well you learned it.
Actually doing this would be a lot of work. Hopefully someone will steal this idea and do it.
Update: Explanation at the bottom.
Anyone know why fwrite() calls sometimes block?
Here is a test I did. I sequentially append 4096 byte chunks to 10 files in a round robin fashion. Throughput is artificially limited to a fixed rate set at about 1/4 the maximum throughput of the drives, which I accomplish by sleeping at subsecond intervals in between my writes. I time each call to write. My expectation is that writes go immediately to pagecache and are asynchronously written out to disk (unless I were to call fsync, which I don’t).
In fact this is usually what happens, the average time is just a few microseconds, but sometimes the write calls block for somewhere between 400 ms and a few seconds. I am using Linux 3.6.32 (RHEL 6) with ext4. I am using default configurations otherwise (no change to any of the /proc/sys/vm stuff and fiddling with those parameters don’t seem to help).
Here is a trace of average and max latencies taken over 1 second intervals. Note the regular spikes. What is the cause of this? Locking between the flush threads and the write thread? Is there anything you can do to mitigate it? This effects anything that does logging—i.e. almost everything.
I understand why this would happen if I exceeded the throughput that Linux’s background flush threads can handle, but that is clearly not the case here as the drive can sustain 400MB writes over a long period.
I tried this on a few different machines, some with RAID, some with a single disk and all showed the same behavior to varying degrees.
|Throughput (mb/sec)||Avg. Latency (ms)||Max Latency (ms)|
A number of people gave interesting and helpful suggestions, such as “this is your punishment for not using Solaris.” The best suggestion was from Mark Wolfe which was to install latencytop and measure it. To do this on Red Hat you need to install their debug kernel and reboot with that, then latencytop will capture the highest latency kernel operations for each process. This gives a great deal of insight into what is going on.
For those who are curious here are a few of the traces that pop up as causing hundreds of ms of latency:
vfs_write() do_sync_write() ext4_file_write() generic_file_aio_write() ext4_da_write_begin() [in this case da means delayed allocation] block_write_begin() __block_prepare_write() ext4_da_get_block_prep() ext4_get_block_prep() ext4_get_blocks() call_rw_sem_down_read_failed()
This trace seems to be due to delayed allocation. Turing off delayed allocation makes it go away, though probably at the cost of some throughput.Here is another one, this one seems to be related to journalling.
sys_write() vfs_write() do_sync_write() ext4_file_write() generic_file_aio_write() __generic_file_aio_write() file_update_time() __mark_inode_dirty() ext4_dirty_inode() ext4_journal_start_sb() jbd2_journal_start() start_this_handle()You can check out the details of the code in one of these nice Linux kernel code browsers. My take away from all this was that maybe it is time to look at XFS since that allegedly also has better file locking for fsync which is my other problem.
From Alan Turing’s lecture to the London Mathematical Society in 1947. For context, he had more or less invented the theoretical construct behind computation in 1936, but actual computers barely existed (the very primitive ENIAC was made public in 1946). This lecture was given while he was working on developing ACE (Automatic Computing Engine) the British answer to the ENIAC.Roughly speaking those who work in connection with the ACE will be divided into its masters and its servants. Its masters will plan out instruction tables for it, thinking up deeper and deeper ways of using it. Its servants will feed it with cards as it calls for them. They will put right any parts that go wrong…As time goes on the calculator itself will take over the functions both of masters and of servants…The masters are liable to get replaced because as soon as any technique becomes at all stereotyped it becomes possible to devise a system of instruction tables which will enable the electronic computer to do it for itself. It may happen however that the masters will refuse to do this. They may be unwilling to let their jobs be stolen from them in this way. In that case they would surround the whole of their work with mystery and make excuses, couched in well chosen gibberish, whenever any dangerous suggestions were made.
Data systems have always been designed around the limitations of physical hardware. I think of the design of these systems as being a compromise between the external API you want to provide and the ingredients you have to build it with. In particular, a discouraging portion of the lines of code of a database or filesystem are there to try to mask the latency of disk drives. I think most people understand that SSDs are somehow faster, but I wanted to give a few thoughts on what can be done with the current batch of SSDs and what might be possible in the future if the trend in price and size for SSDs continues (and not everyone thinks it will).
Here are the facts you need to know about SSDs:
A large part of system design is driven by the cost, latency and throughput ratios for different types of storage and networking technologies. This is not very different from how the laws of physics constrain the possible designs for a hammer or a kitchen sink, except that in the case of data systems the laws of physics change over time. One fun presentations of some of the trends in latency and throughput are given in the presentation "Latency Lags Bandwidth" by David Patterson. SSDs represent an area where latency suddenly gets many orders of magnitude better, invalidating a lot of design tradeoffs in existing systems. It is a pure speculation as to how this kind of change will effect the design of systems, but it is a good heuristic is to assume that people eventually move towards the design that provides the best price-performance tradeoff.
For distributed data systems the big change SSDs introduce is the relative latency of a random disk access versus a remote network hop. In the case of a traditional hard drive a single seek may have a latency cost easily 10 or 20x that of TCP request on a local network, which means a remote cache hit is much cheaper than a local cache miss. SSDs essentially erase this difference making them fairly close in terms of latency. The consequence should be favoring designs that store more data per machine and do fewer network requests.
The extreme version of this for databases is just abandoning partitioning altogether and storing all data on all machines. After all if the write load is small and data size isn’t going to go beyond 1TB then an unpartitioned, replicated mysql or postgres may be good enough. Since partitioning adds huge constraints on the richness of queries that can be implemented there may be something to be said for unpartitioned designs in cases where the data size isn’t going to grow without bound and the number of writes is tolerable (since in an unpartitioned design all writes must go to all nodes).
Eliminating partitioning is one way to reduce network hops. To take this line of thinking further one could actually co-locate the storage and the application it serves. This is a radical re-architecture, but could make a certain amount of sense. A network server often peaks at around 40,000 requests-per-second. This is not a limitation for a data system serving a random request stream for potentially disk resident data stored on traditional hard drives as each drive can only do a few hundred random accesses per second. If the drive were capable of hundreds of thousands of operations, the network request throughput limit might become a real limitation, and one might think about just co-locating the data and the application. The second reason this might make sense is that the data system is no longer a common integration layer for allowing multiple applications to share the same data. Implicit in the design of current RDBMS is that multiple applications will access the same set of tables (hence the need for so much structure and correctness checking in the database so it can be shared by multiple applications safely without validating the code in all the applications). In reality, the modern way of sharing data amongst applications is not though direct database access against a shared DB but via some kind of service API (REST, usually). There are still a lot of non-performance-related reasons to recommend keeping data systems in the client-server mode (for example allowing you to scale the CPU intensive part of the application separate from the IO intensive part, for example), but if the performance argument became strong enough these reasons might not be enough.
Less radically SSDs will likely change how caching is done. Many web companies have large memcached installations. Memcached is very good at serving high throughput at low latency on a small dataset, but since everything is in RAM it is actually rather expensive if you are space rather than CPU bound. If you place 32GB of cache per server, then 5TB of total cache space requires 160 servers. Having 5 servers each with 1TB of SSD space may be a huge win. Furthermore caching in RAM has a practical problem: restarts dump a full server worth of cache. This is an annoyance if you need to restart your cache servers frequently or if you need to bring up a new stack with completely cold cache as you may not actually be able to run your application without any caching (if you can, then why have caching, after all).
For real-time applications SSDs also enable access patterns that involve many seeks per request in a latency-sensitive application. For example light graph traversal against a disk-resident data set is feasible on SSDs but generally not on traditional hard drives. Graphs are generally not amenable to clean partitioning (they are, by nature, intertwined). As a result, a disk resident blob store running on SSDs may now be a pretty good way to implement social graph or “follower” functionality.
For offline applications, reduction in latency makes random access a possibility once again. MapReduce was designed, in large part, to work with only linear I/O patterns and eliminate random I/O. This is a huge performance win for batch processing on disk-resident data sets. But to accomplish this requires providing only a fairly limited programming model. This means Hadoop cannot easily implement the equivalent of a “hash join” or similar things except when the hash fits in memory. It is interesting to think how MapReduce might have been differently designed in a world where local random reads and writes were cheap.
A move to SSDs also impacts the internals of data systems. Traditional B+Trees or hashes are no longer the most appropriate persistent data structure. This is not due to the drop in latency but due the the write endurance problem. Moving a database with a traditional storage engine to commodity SSDs will likely be quite fast but the SSDs may stop working after a few months!
A little background. SSDs currently come in two flavors: enterprise-grade (SLCs) and consumer-grade (MLCs). Enterprise-grade SSDs are quite expensive (comparable to memory prices) and so won’t be an option for many scale-out deployments. They are a good option for small deployments, though, where the high cost is less of an issue. SSDs have other differences in the sophistication of the firmware and whether they attach through the PCI bus or via SATA, but these things are less important. If you are used to mechanical hard drives, virtually any SSD will all be faster than you know what to do with. The network access time will probably eliminate any more nuanced performance differences between SSD types.
For example, when I started doing experiments with MLC drives for storage I was warned the consumer-grade devices would have periodic large latency spikes as they did various internal compaction operations. This is true, there were occational large spikes in 99th percentile time. But although the spikes are huge in relative terms, they are extremely minor in absolute terms (under 1 ms). In comparison to memory this kind of variance is terrible, but in comparison to disk access the SSD’s 99th percentile looks closer to the hard drive’s uncached 1st percentile access time.
The important difference between MLC and SLC is the number of writes they can handle. I will walk through the basic arithmetic on how to model this. SSDs are broken into blocks that are usually around 512KB and each write must first erase and then rewrite an entire 512KB block. Each block can only be erased some number of times before it starts corrupting and giving the wrong data. To avoid this the SSD manufactures seem to cap each block to a fixed number of program-erase cycles. This means that after a certain number of writes to a particular block that block will stop accepting writes. SLCs and MLCs both work the same in this respect, except that SLCs will take roughly an order of magnitude more writes per block before craping out.
Here is a table that compares prices and write-capacity per block for MLC, SLC, RAM, and SAS drives. These are taken at random off the internet, your milage would certainly vary, but this gives some idea of the pricing as of May 2012.
|15k RPM SAS Hardrive||$0.75||Unlimited|
A few obvious conclusions are that SLC SSDs are priced roughly the same as memory. In a sense SLC SSDs are better than memory since they are persistent, but if keeping all your data in memory sounds expensive then so will SLCs. And in any case you can’t eliminate memory caching entirely as at least some part of the index likely needs to reside in memory even with the faster access times SSDs provide.
The MLCs, on the other hand, are actually very close in price to good hard disks. There is a slight price premium, but for most online data systems this is misleading. Since most real-time storage systems are bound by seek capacity, not data size or CPU, increasing the number of seeks available may well reduce overall capacity needs very significantly. With more seeks per machine each machine can handle more requests. For our uses we found we could comfortably take at least 5-10x more requests per machine before we started to hit limits around disk space and CPU. So the question is, is there a way to live with the low write-endurance of the MLC devices and still get the great performance and cost?
Here is where the complex dependency between the storage format and the SSD comes in. If your storage engine does large linear writes (say a full 512KB block or more) then calculating the number of writes you can do on one drive before it is all used up is easy. If the drive has 300GB and each block can be rewritten 5,000 times then each drive will allow writing 5000 x 300GB (about 1.4 petabytes). Let’s say you have 8 of these in a box with no RAID and that box takes 50MB/sec of writes 24 hours a day evenly balanced over the drives, then the drives will last around 7.8 years. This should be plenty of time, for most systems. But this lifetime is only realistic for large linear writes—the best possible case for SSD write-endurance.
The other case is that you are doing small random writes immediately sync’d to disk. If you are doing 100 byte random writes and the SSD’s internal firmware can’t manage to somehow coalesce these into larger physical writes then each 100 byte write will turn into a full program-erase cycle of a full 512KB block. In this case you would expect to be able to do only 5000*100 bytes = 500KB of writes per block before it died; so a 300GB drive with 300GB/512KB = 614,400 blocks would only take around 286GB of writes total before crapping out. Assuming, again, 8 of these drives with no RAID and 50MB/sec, you get a lifetime of only about half a day. This is the worst case for SSDs and is obviously totally unworkable.
It is worth noting that it doesn’t actually matter if the writes to the filesystem are large or small so long as the writes to the physical disk are large. If the writes are linear, that is written in order on disk, and are not sync’d to the physical drive by the OS until significant data has accumulated that will suffice. A number of small sequential writes to the filesystem will be coalesced into a single large write to the physical device by the operating system’s I/O scheduler provided there is no intervening call to fsync (or the equivalent).
To address this limitation SSDs attempt to implement some kind of internal write-ahead format to try to turn random I/O into a larger linear set of writes. However the efficacy of this seems to vary greatly by device, and I consider it a bit dangerous to bet your data on it working for all workloads. Likewise this can introduce a new read fragmentation problem since the updates that are meant to be co-located are actually scattered all over in different blocks.
A better option is just to use a storage format that naturally does linear writes. Traditional storage formats include the B+Tree and linear hashing. These formats group data into blocks by key, and hence writes are randomly scattered on the disk unless the write order happens to match the key ordering (which you can’t count on except in bulk load situations where you can chose the order in which records are updated). Buffering may help this a bit, but when the buffer is flushed it is likely that a random subset of blocks will be rewritten with small changes to each. Log-structured formats are the other alternative, they store data in the order in which it was written, and hence always do linear writes. Log-structured merge trees and Google’s SSTable variation are examples of this. Various hashing and tree formats can all be designed in a log-structured manner.
On traditional hard drives the trade-off between in-place and log-structured storage is a bit of a wash. Write performance for log-structured storage is vastly better, but most applications have more reads than writes. Read performance may be better or worse depending on the details of the implementation (a traditional log-structured merge tree is definitely worse for reads since it does multiple seeks for each uncached read, but hashing variants or SSTables that use bloom filters to avoid unnecessary lookups need not be). However the move to SSDs completely changes this dynamic. Since SSDs have fast seek performance grouping data by key is much less important. Using a log-structured storage format makes it possible to use cheap consumer grade MLC SSDs even under high write loads that would wear out the drive in a few months if run on an in-place storage format.
A particularly important factor in making this work is whether the storage engine requires an immediate fsync to disk with each write. Many systems do require this for data integrity. An immediate fsync will, of course, require a small write unless the record being written is itself large. On a single-node system avoiding fsync may mean that the last few records may be lost in a crash. On a properly designed distributed system, though, this need not be the case—replication to other nodes can take the place of the flush to disk. Replication and disk sync have different failure modes (i.e. in the case of a power outage where all nodes fail simultaneously replication won’t help, whereas in the case of disk corruption or total machine death flush generally doesn’t help). So an important question to ask about any data system design to determine if it is going to be SSD friendly is whether it requires immediate flushing of data to disk. In particular a good system should be able to give replication guarantees without waiting on the disk flush.
I did a pretty detailed evaluation of some SSDs about a year back in co-ordination with the Voldemort team at LinkedIn. One of the tests was replaying a production I/O trace at high speed to simulate the equivalent of 5 years of production load on BDB JE. This model of wear does indeed work exactly as modeled: buffered linear writes make it possible to get almost a decade of high-write usage out of a cheap MLC. The Voldemort guys wrote up a few more details on work they did to make SSDs work well with BDB JE.
If you are evaluating storage engines, here is a quick overview of which ones are in-place or log-structured. InnoDB, BDB, Toyko and Kyoto Cabnet as well as MongoDB all have in-place storage formats and are are not well-suited to cheap SSDs under high write load. LevelDB, BDB-JE, Krati, Cassandra's SSTable implementation and Bitcask are all log-structured. Riak and Voldemort both support pluggable storage engines and default to a log-structured format. The Cassandra folks at Datastax did a pretty good presentation on running Cassandra on SSDs.
An interesting question is whether cloud hosting providers will rent instances with SSDs any time soon. The write-endurance problem makes SSDs somewhat problematic for a shared hosting environment, so they may need to add a way to bill on a per block-erase basis. I have heard rumors that Amazon will offer them, but I have no idea in what form or how much they will cost (which is the critical detail).
People may have seen the episode of This American Life that retracted the previous piece with Mike Daisy. It is a pretty uncomfortable thing to listen to. The relevant thing is that Mike Daisy fabricated much of the details of his first-hand account to make Chinese labor look much harsher than it is. Then he lied to the NPR fact checkers to avoid their uncovering this.
The question is, does it matter that he lied? Everyone agrees Chinese labor conditions are harsh in comparison to western standards. His argument is that although, yes, he did make up much of his story, it could have been true. Critics say that this doesn’t really undermine his main point; counter critics say that it does.
Personally I think it is mostly an irrelevant distraction. Globalization is literally remaking the lives of billions of people, and whether you believe the change is good or bad should be primarily an empirical discussion about whether things are getting better or worse in these countries. It shouldn’t be carried out in overly emotional, under-reasoned stage performances. Whether or not Mike Daisy’s anecdotes about his visit to China actually took place shouldn’t really matter to us because we shouldn’t make decisions about global trade policy on the basis of theatrical anecdotes.
The important debate we should be having is about whether globalization—particularly shipping low-skill work to other countries with lower wages—is a good or bad thing for these countries. Popular opinion is that it is bad because it enriches large corporations by exploiting workers. However there is an argument to be made that it is good. That argument is best layed out in this 1997 article, by Paul Krugman. I won’t repeat the content of the article because I think it’s short and worth reading.
I read that article in the late nineties and it greatly changed my perception of economics and the debate around globalization. The reason I think everyone should read it is because it phrases the debate in terms of what is good for third world countries rather than emotional appeals. Economists see globalization as a huge transfer of money, technology, and skills from the first world to the third world. They see this as permanently transforming the standard of living in these countries in a way that a century of poverty relief and charitable aid hasn’t. I think the focus of the discussion should be less about whether Mike Daisy lied, and more about whether these beliefs are true.
I am not an economist, I haven’t studied this, so I don’t know if this is correct. But my values are pretty straight-forward. I think people born in the US won the equivalent of the geographical lottery and have opportunities that are either completely out of reach or exceedingly unlikely for people in other parts of the world. We have a moral obligation to strive for equalization; I would support this process even if it was bad for the US (which it doesn’t seem to be). I come from a particularly liberal family so it particularly galls me that the left seems to be willing to decide an issue that one way or another deeply impacts the lives of billions of people purely based on a kind of kneejerk belief that large corporations are evil.
Look, we could learn that this doesn’t work the way we thought. Maybe the manufacturing jobs do not lead to a development of industry in the way it’s supposed to. Or maybe increasing economic vitality doesn’t lead to greater political freedom in repressive regimes. Or maybe there are other unforeseen consequences. Economists are, after all, the people most likely to fuck up a country on an unproven but intellectually stimulating theory. But let’s talk about the evidence of these things happening or not happening.
For people who oppose globalization, I think the question is how will you accomplish this global equalization? It isn’t enough to have a vague esthetic icky feeling about buying a phone made by people who make less than you, you have to have a plausible way to fix this inequality. The people who have thought the most about this think that buying the phone is the best way to fix it, it is really important to be sure they are wrong before you try to stop people from buying phones.