A penny for thoughts?

About the correct valuation

Archive for May, 2009

Google Chubby papers: notes and highlights

one comment

The papers I read are available here. I went through Mike Burrows’ The Chubby lock service for loosely-coupled distributed systems and Tushar Deepak Chandra, Robert Griesemer and Joshua Redstone’s Paxos Made Live – An Engineering Perspective.

This is mainly what I highlighted from the papers and what I found interesting/noteworthy.

To provide high availability and correctness in the presence of possible transient faults such as the network dropping packets, data corruption or hardware failure, they relied on asynchronous distributed consensus through an implementation of the Paxos protocol. Their implementation is discussed in the Paxos Made Live paper. Incidentally, I’ve read a few papers on Paxos now and the Google paper was by far the most accessible, even though it’s a single chapter and they say it’s only meant to be an “informal overview” before they move on to the implementation details/challenges. If you’re interested in Paxos, I suggest you start with that paper rather than the Paxos Made Simple paper which doesn’t live up to its abstract of: The Paxos algorithm, when presented in plain English, is very simple.

Chubby is split into two parts. One part is a server. This server, due to the way Paxos works, is a really a collection of “cells” (cells are a running instance of a server) where one cell is elected the master and all client communication is with that cell. All operations on that cell are replicated to the other cells in the cluster. The other part is a client library that any applications that with to communicate with Chubby can use.

To help scalability, the client library has an internal cache. This cache invalidates data on a change and never updates it. This seemed bizarre to me at first since I figured it’s normally more efficient to update rather than invalidate. But they make a good point: “update-only protocols can be arbitrarily inefficient; a client that accessed a file might receive updates indefinitely, causing an unbounded number of unnecessary updates“.

Imagine an app touches a dozen files then does work for days. While it’s doing its work it’s constantly receiving updates to those dozen cached items even though it couldn’t care less about them anymore. I was framing the problem with my usual web-app centric mind where I’ll have a hook so that whenever there are db writes that will likely also be looked for in cache soon, rather than invalidate I update. This is more efficient because you have since you have only one location to update. If you have up to 90 000 clients subscribed to your cell (as can happen to Chubby cells), you don’t want to publish up to 90k messages unnecessarily every time a write happens.

They use long polling as a KeepAlive mechanism. Basically, the client sends a message to the server saying I’m here and the server blocks it. The server keeps a request timeout (lease time) and the client a conservative approximation thereof. When the lease ends, the server returns a response and the client immediately makes another request. If all goes well (no network problems, no machine failures, etc), this happens indefinitely. The noteworthy thing they did is that they cleverly used the KeepAlive as a communication mechanism. If the server needs to communicate with the client (send it a message of some kind such as telling the client to invalidate its cache for a file) it can return the KeepAlive early and pass the command along. The client will process the command and make another request, resetting its approximation of the lease time. The first thing I wondered about was how Chubby handled its callbacks to the client and this is a pretty good solution. As they mention, it “simplifies the client, and allows the protocol to operate through firewalls that allow initiation of connection only in one direction.

Mechanisms for scaling:

  • Dynamic lease times: the lease time is adaptive. The default is 12s but it can increase up to around 60s under heavy load.
  • Client library cache: While the cache itself saves a lot, the most surprising thing they found that helped performance enormously was to cache the absence of files. Clients were making many requests for files that never existed and just caching the non-existence on the client side helped a lot. Because developers were writing infinite loops that simple retried until the file existed, they first tried exponentially-increasing delays along with education. They eventually gave up on that and went with caching the absence instead.
  • Chubby’s protocol can be proxied: They use protocol-conversion servers that translate the Chubby protocol into less-complex protocols such as DNS and others. When writing your own protocol it’s definitely good to keep in mind that compatibility with mature and stable proxy servers is a very nice feature.
  • Chubby’s data fits in RAM: This makes most operations cheap. Chubby can store small files (a few k) but it isn’t meant to store large files. This will degrade performance of the Chubby cell enormously and I imagine you’ll probably have angry sysadmins coming after you if you try. =)

My absolute favourite quote of the paper is about scaling as well.

We have found that the key to scaling Chubby is not server performance; reducing communication to the server can have far greater impact. No significant effort has been applied to tuning read/write server code paths; we checked that egregious buge were present, then focused on the scaling mechanisms that could be more effective.

Essentially, it’s the old idea that the fastest code is no code all over again. If your web page makes 100+ database hits per page load I don’t care how good you are at tuning you’re just plain screwed.

Google had a problem with DNS lookups. Basically, it’s common for developers to write jobs with thousands of processes, each process talking to every other. This leads to quadratic growth of lookups. A “small” job of 3 processes would require 150 000 lookups per second. Because Chubby uses explicit invalidation, only the first lookup would not be local (unless your local cache is invalidated which I imagine is rare for what’s probably mostly internal DNS addresses). “A 2-CPU 2.6 GHZ Xeon Chubby has been seen to handle 90 thousand clients communicating with it directly (no proxies); the clients included large jobs with communication parts as described above.” Chubby is so much better than regular DNS servers that Chubby’s most common use is now as a highly available name server with fast updates.

This quote I noted simply because I thought Skrud would enjoy it.

Google’s infrastructure is mostly in C++, but a growing number of systems are being written in Java. [...] The usual mechanism for accessing non-native libraries is JNI, but it is regarded as slow and cumbersome. Our Java programmers so dislike JNI that to avoid its use they prefer to translate large libraries into Java, and commit to supporting them.

Overall, it seems that most of the Google papers are extremely well written. They are by far the most interesting and practical papers I’ve read. The Map/Reduce and Bigtable papers, along with these are all excellent with many examples and very little assumption of prior knowledge other than basic computer/programming terminology and are usually very light on math, unlike most academic papers I’ve seen. It would be really nice if the Google style paper would catch on more in academia.

Written by Smokinn

May 27th, 2009 at 10:57 pm

Posted in Uncategorized

Bob Ippolito is my hero

leave a comment

This is a follow-up from a previous post.

When Zed retired ZSFA, he replaced his most notorious rant with a plea to find someone else to look up to. Not someone loud, brash and arrogant, someone gentle. For me, this is Bob Ippolito.

His name first stuck last December when I was playing around with some Erlang. Bob is the author of mochiweb, an Erlang library for building lightweight HTTP servers. It’s what Facebook used to build Facebook Chat. Since I was just starting to learn Erlang, this video of Bob talking Erlang basics was a big help.

I haven’t done any Erlang since December but his name kept popping up anyway. I was checking out Tokyo Cabinet and found that Bob had written the Python client for the Tokyo Tyrant protocol.

Recently, he gave an absolutely excellent talk summarizing the current state-of-the-open-source-art in database alternatives. This video is highly worth your time and highscalability’s notes on the talk are also quite excellent.

The talk itself was great, but so was the way he handled the “questions”, especially the one where the guy tried to fight him on CouchDB. The question asker would comment on some future work to be “coming soon” or try to downplay the importance of the drawback. Bob simply calmly explained the basic facts of the problem at hand and how the current approach taken by CouchDB was fundamentally flawed for his use case along with how and why MongoDB would work better for what he needs to do.

Having just the facts about what the product was designed for and what its pitfalls are was a refreshing contrast to the usual tunnel vision people have where they first decide they like a software product (mysql is the best! let’s use it for everything!), shoehorn it into doing things it wasn’t designed for (instead of using a message queue like RabbitMQ or Beanstalkd, let’s write to a mysql table and empty it by cron job! (Mike Malone did this at pownce (Ghetto queue: slide 57) and I’m guilty of it too)) and then go around giving presentations on how their hack works, thereby trying to validate the idea that their favourite tool should be used for everything all the time. Fred Brooks would be not be proud of us, but he’d probably like Bob’s presentation.

Written by Smokinn

May 7th, 2009 at 9:41 am

Posted in Uncategorized

On screaming loudly

2 comments

Of all the speakers I’ve met at CUSEC, my favourite is Zed Shaw. His talk was honest, fun and (for me) life-changing. But that’s not why he’s my favourite speaker. CUSEC has lots of amazing speakers. He’s my favourite because he’s so honest and puts up with the flak he gets because of it. Granted, he’s not always the most diplomatic character online, but his views represent mine perfectly when it comes to online community fiascoes that keep cropping up.

Recently, we had the whole charade of Matt Amionetti’s porn presentation blow up and spew out all over blogs and social news sites. This happens all. the. time. And I’m sick of it.

It’s one of the reasons I deleted all social news sites from google reader. The other is Zed’s ZSFA character. He’s retired it now but the point seems to have been lost on pretty much everyone. People only seem to listen to people who scream loudly and often. People that are outraged. Outraged I say! Scala can’t be faster then Ruby!? If Ruby isn’t fast enough for you, you’re doing it wrong!

Whatever. I’m out. Bye.

I’ve started reading Dive Into Python. I went through the Django book and I’m starting a Django project. Seems like the Python folks can get their act together without the huge amount of drama in the Ruby community. Yes, Ruby. For a long time I stuck around writing Ruby while avoiding Rails, internalizing the “Rails is not Ruby” argument, but I’m through. The drama is just inescapable so now I’m just ignoring anything Ruby-related. Sorry Ruby. It was good times but Python is a perfectly acceptable substitute. I’ll check back in in maybe a year or two. Maybe Rails Bridge will have reversed the trend by then. I hope they do.

Written by Smokinn

May 5th, 2009 at 5:03 pm

Posted in Uncategorized