The papers I read are available here. I went through Mike Burrows’ The Chubby lock service for loosely-coupled distributed systems and Tushar Deepak Chandra, Robert Griesemer and Joshua Redstone’s Paxos Made Live – An Engineering Perspective.
This is mainly what I highlighted from the papers and what I found interesting/noteworthy.
To provide high availability and correctness in the presence of possible transient faults such as the network dropping packets, data corruption or hardware failure, they relied on asynchronous distributed consensus through an implementation of the Paxos protocol. Their implementation is discussed in the Paxos Made Live paper. Incidentally, I’ve read a few papers on Paxos now and the Google paper was by far the most accessible, even though it’s a single chapter and they say it’s only meant to be an “informal overview” before they move on to the implementation details/challenges. If you’re interested in Paxos, I suggest you start with that paper rather than the Paxos Made Simple paper which doesn’t live up to its abstract of: The Paxos algorithm, when presented in plain English, is very simple.
Chubby is split into two parts. One part is a server. This server, due to the way Paxos works, is a really a collection of “cells” (cells are a running instance of a server) where one cell is elected the master and all client communication is with that cell. All operations on that cell are replicated to the other cells in the cluster. The other part is a client library that any applications that with to communicate with Chubby can use.
To help scalability, the client library has an internal cache. This cache invalidates data on a change and never updates it. This seemed bizarre to me at first since I figured it’s normally more efficient to update rather than invalidate. But they make a good point: “update-only protocols can be arbitrarily inefficient; a client that accessed a file might receive updates indefinitely, causing an unbounded number of unnecessary updates“.
Imagine an app touches a dozen files then does work for days. While it’s doing its work it’s constantly receiving updates to those dozen cached items even though it couldn’t care less about them anymore. I was framing the problem with my usual web-app centric mind where I’ll have a hook so that whenever there are db writes that will likely also be looked for in cache soon, rather than invalidate I update. This is more efficient because you have since you have only one location to update. If you have up to 90 000 clients subscribed to your cell (as can happen to Chubby cells), you don’t want to publish up to 90k messages unnecessarily every time a write happens.
They use long polling as a KeepAlive mechanism. Basically, the client sends a message to the server saying I’m here and the server blocks it. The server keeps a request timeout (lease time) and the client a conservative approximation thereof. When the lease ends, the server returns a response and the client immediately makes another request. If all goes well (no network problems, no machine failures, etc), this happens indefinitely. The noteworthy thing they did is that they cleverly used the KeepAlive as a communication mechanism. If the server needs to communicate with the client (send it a message of some kind such as telling the client to invalidate its cache for a file) it can return the KeepAlive early and pass the command along. The client will process the command and make another request, resetting its approximation of the lease time. The first thing I wondered about was how Chubby handled its callbacks to the client and this is a pretty good solution. As they mention, it “simplifies the client, and allows the protocol to operate through firewalls that allow initiation of connection only in one direction.”
Mechanisms for scaling:
- Dynamic lease times: the lease time is adaptive. The default is 12s but it can increase up to around 60s under heavy load.
- Client library cache: While the cache itself saves a lot, the most surprising thing they found that helped performance enormously was to cache the absence of files. Clients were making many requests for files that never existed and just caching the non-existence on the client side helped a lot. Because developers were writing infinite loops that simple retried until the file existed, they first tried exponentially-increasing delays along with education. They eventually gave up on that and went with caching the absence instead.
- Chubby’s protocol can be proxied: They use protocol-conversion servers that translate the Chubby protocol into less-complex protocols such as DNS and others. When writing your own protocol it’s definitely good to keep in mind that compatibility with mature and stable proxy servers is a very nice feature.
- Chubby’s data fits in RAM: This makes most operations cheap. Chubby can store small files (a few k) but it isn’t meant to store large files. This will degrade performance of the Chubby cell enormously and I imagine you’ll probably have angry sysadmins coming after you if you try. =)
My absolute favourite quote of the paper is about scaling as well.
Essentially, it’s the old idea that the fastest code is no code all over again. If your web page makes 100+ database hits per page load I don’t care how good you are at tuning you’re just plain screwed.
Google had a problem with DNS lookups. Basically, it’s common for developers to write jobs with thousands of processes, each process talking to every other. This leads to quadratic growth of lookups. A “small” job of 3 processes would require 150 000 lookups per second. Because Chubby uses explicit invalidation, only the first lookup would not be local (unless your local cache is invalidated which I imagine is rare for what’s probably mostly internal DNS addresses). “A 2-CPU 2.6 GHZ Xeon Chubby has been seen to handle 90 thousand clients communicating with it directly (no proxies); the clients included large jobs with communication parts as described above.” Chubby is so much better than regular DNS servers that Chubby’s most common use is now as a highly available name server with fast updates.
This quote I noted simply because I thought Skrud would enjoy it.
Overall, it seems that most of the Google papers are extremely well written. They are by far the most interesting and practical papers I’ve read. The Map/Reduce and Bigtable papers, along with these are all excellent with many examples and very little assumption of prior knowledge other than basic computer/programming terminology and are usually very light on math, unlike most academic papers I’ve seen. It would be really nice if the Google style paper would catch on more in academia.