(click for larger view)
Bar chart of pies:
Pie chart of bars:
I wish I could take credit for this awesome idea but I read it here.
(click for larger view)
Bar chart of pies:
Pie chart of bars:
I wish I could take credit for this awesome idea but I read it here.
I planned on writing a post on the Paxos algorithm with an example implementation but it turns out someone already wrote an excellent post explaining it and followed it up with some toy code.
I was beaten to the punch but I'll probably still try implementing it myself soon. Like he said, you rarely truly understand the complexities of something unless you try doing it yourself.
The papers I read are available here. I went through Mike Burrows' The Chubby lock service for loosely-coupled distributed systems and Tushar Deepak Chandra, Robert Griesemer and Joshua Redstone's Paxos Made Live - An Engineering Perspective.
This is mainly what I highlighted from the papers and what I found interesting/noteworthy.
To provide high availability and correctness in the presence of possible transient faults such as the network dropping packets, data corruption or hardware failure, they relied on asynchronous distributed consensus through an implementation of the Paxos protocol. Their implementation is discussed in the Paxos Made Live paper. Incidentally, I've read a few papers on Paxos now and the Google paper was by far the most accessible, even though it's a single chapter and they say it's only meant to be an "informal overview" before they move on to the implementation details/challenges. If you're interested in Paxos, I suggest you start with that paper rather than the Paxos Made Simple paper which doesn't live up to its abstract of: The Paxos algorithm, when presented in plain English, is very simple.
Chubby is split into two parts. One part is a server. This server, due to the way Paxos works, is a really a collection of "cells" (cells are a running instance of a server) where one cell is elected the master and all client communication is with that cell. All operations on that cell are replicated to the other cells in the cluster. The other part is a client library that any applications that with to communicate with Chubby can use.
To help scalability, the client library has an internal cache. This cache invalidates data on a change and never updates it. This seemed bizarre to me at first since I figured it's normally more efficient to update rather than invalidate. But they make a good point: "update-only protocols can be arbitrarily inefficient; a client that accessed a file might receive updates indefinitely, causing an unbounded number of unnecessary updates".
Imagine an app touches a dozen files then does work for days. While it's doing its work it's constantly receiving updates to those dozen cached items even though it couldn't care less about them anymore. I was framing the problem with my usual web-app centric mind where I'll have a hook so that whenever there are db writes that will likely also be looked for in cache soon, rather than invalidate I update. This is more efficient because you have since you have only one location to update. If you have up to 90 000 clients subscribed to your cell (as can happen to Chubby cells), you don't want to publish up to 90k messages unnecessarily every time a write happens.
They use long polling as a KeepAlive mechanism. Basically, the client sends a message to the server saying I'm here and the server blocks it. The server keeps a request timeout (lease time) and the client a conservative approximation thereof. When the lease ends, the server returns a response and the client immediately makes another request. If all goes well (no network problems, no machine failures, etc), this happens indefinitely. The noteworthy thing they did is that they cleverly used the KeepAlive as a communication mechanism. If the server needs to communicate with the client (send it a message of some kind such as telling the client to invalidate its cache for a file) it can return the KeepAlive early and pass the command along. The client will process the command and make another request, resetting its approximation of the lease time. The first thing I wondered about was how Chubby handled its callbacks to the client and this is a pretty good solution. As they mention, it "simplifies the client, and allows the protocol to operate through firewalls that allow initiation of connection only in one direction."
Mechanisms for scaling:
Dynamic lease times: the lease time is adaptive. The default is 12s but it can increase up to around 60s under heavy load.
Client library cache: While the cache itself saves a lot, the most surprising thing they found that helped performance enormously was to cache the absence of files. Clients were making many requests for files that never existed and just caching the non-existence on the client side helped a lot. Because developers were writing infinite loops that simple retried until the file existed, they first tried exponentially-increasing delays along with education. They eventually gave up on that and went with caching the absence instead.
Chubby's protocol can be proxied: They use protocol-conversion servers that translate the Chubby protocol into less-complex protocols such as DNS and others. When writing your own protocol it's definitely good to keep in mind that compatibility with mature and stable proxy servers is a very nice feature.
Chubby's data fits in RAM: This makes most operations cheap. Chubby can store small files (a few k) but it isn't meant to store large files. This will degrade performance of the Chubby cell enormously and I imagine you'll probably have angry sysadmins coming after you if you try. =)
My absolute favourite quote of the paper is about scaling as well.
Essentially, it's the old idea that the fastest code is no code all over again. If your web page makes 100+ database hits per page load I don't care how good you are at tuning you're just plain screwed.
Google had a problem with DNS lookups. Basically, it's common for developers to write jobs with thousands of processes, each process talking to every other. This leads to quadratic growth of lookups. A "small" job of 3 processes would require 150 000 lookups per second. Because Chubby uses explicit invalidation, only the first lookup would not be local (unless your local cache is invalidated which I imagine is rare for what's probably mostly internal DNS addresses). "A 2-CPU 2.6 GHZ Xeon Chubby has been seen to handle 90 thousand clients communicating with it directly (no proxies); the clients included large jobs with communication parts as described above." Chubby is so much better than regular DNS servers that Chubby's most common use is now as a highly available name server with fast updates.
This quote I noted simply because I thought Skrud would enjoy it.
Overall, it seems that most of the Google papers are extremely well written. They are by far the most interesting and practical papers I've read. The Map/Reduce and Bigtable papers, along with these are all excellent with many examples and very little assumption of prior knowledge other than basic computer/programming terminology and are usually very light on math, unlike most academic papers I've seen. It would be really nice if the Google style paper would catch on more in academia.
This is a follow-up from a previous post.
When Zed retired ZSFA, he replaced his most notorious rant with a plea to find someone else to look up to. Not someone loud, brash and arrogant, someone gentle. For me, this is Bob Ippolito.
His name first stuck last December when I was playing around with some Erlang. Bob is the author of mochiweb, an Erlang library for building lightweight HTTP servers. It's what Facebook used to build Facebook Chat. Since I was just starting to learn Erlang, this video of Bob talking Erlang basics was a big help.
I haven't done any Erlang since December but his name kept popping up anyway. I was checking out Tokyo Cabinet and found that Bob had written the Python client for the Tokyo Tyrant protocol.
Recently, he gave an absolutely excellent talk summarizing the current state-of-the-open-source-art in database alternatives. This video is highly worth your time and highscalability's notes on the talk are also quite excellent.
The talk itself was great, but so was the way he handled the "questions", especially the one where the guy tried to fight him on CouchDB. The question asker would comment on some future work to be "coming soon" or try to downplay the importance of the drawback. Bob simply calmly explained the basic facts of the problem at hand and how the current approach taken by CouchDB was fundamentally flawed for his use case along with how and why MongoDB would work better for what he needs to do.
Having just the facts about what the product was designed for and what its pitfalls are was a refreshing contrast to the usual tunnel vision people have where they first decide they like a software product (mysql is the best! let's use it for everything!), shoehorn it into doing things it wasn't designed for (instead of using a message queue like RabbitMQ or Beanstalkd, let's write to a mysql table and empty it by cron job! (Mike Malone did this at pownce (Ghetto queue: slide 57) and I'm guilty of it too)) and then go around giving presentations on how their hack works, thereby trying to validate the idea that their favourite tool should be used for everything all the time. Fred Brooks would be not be proud of us, but he'd probably like Bob's presentation.
Of all the speakers I've met at CUSEC, my favourite is Zed Shaw. His talk was honest, fun and (for me) life-changing. But that's not why he's my favourite speaker. CUSEC has lots of amazing speakers. He's my favourite because he's so honest and puts up with the flak he gets because of it. Granted, he's not always the most diplomatic character online, but his views represent mine perfectly when it comes to online community fiascoes that keep cropping up.
Recently, we had the whole charade of Matt Amionetti's porn presentation blow up and spew out all over blogs and social news sites. This happens all. the. time. And I'm sick of it.
It's one of the reasons I deleted all social news sites from google reader. The other is Zed's ZSFA character. He's retired it now but the point seems to have been lost on pretty much everyone. People only seem to listen to people who scream loudly and often. People that are outraged. Outraged I say! Scala can't be faster then Ruby!? If Ruby isn't fast enough for you, you're doing it wrong!
Whatever. I'm out. Bye.
I've started reading Dive Into Python. I went through the Django book and I'm starting a Django project. Seems like the Python folks can get their act together without the huge amount of drama in the Ruby community. Yes, Ruby. For a long time I stuck around writing Ruby while avoiding Rails, internalizing the "Rails is not Ruby" argument, but I'm through. The drama is just inescapable so now I'm just ignoring anything Ruby-related. Sorry Ruby. It was good times but Python is a perfectly acceptable substitute. I'll check back in in maybe a year or two. Maybe Rails Bridge will have reversed the trend by then. I hope they do.
I have a pretty simple system for sorting books. If they're standing up, I've read them. If they're flat on their side in front of books standing up I haven't.
Right now these books are flat:
This one's hard to get through. I'm not a big fan of sci-fi with a huge amount of invented words.
Sequel of the awesome Plague Year. Will probably be done soon.
Not sure why I stopped reading it. I was reading this one this summer. It's very interesting. I'll probably pick it back up once I'm done Plague War.
It's taking me longer than I thought to finish this one because I thought it would be better. It's readable, there are some interesting anecdotes explained but overall it's one of those books you read for the content despite the way it's written.
This one was recommended by a friend. He'd recommended another that I really enjoyed but I don't like this one at all. I imagine I'll finish it eventually but it's slow going.
This thing is over 1000 pages long. It's going to take forever to finish.
Got this one as a Christmas gift. I've been kinda overdosing on Heinlein recently though so it's going to take a while before I want to read this one.
See above.
See above. (again)
I'm a big fan of militar science fiction. I highly recommend the Orphan's series if you are too.
When I entered my books into LibraryThing, it figured out I was a military sci-fi fan and recommended this. It looked interesting and I imagine I'll get to it eventually.
Another LibraryThing recommendation. It recommended this one because I liked S.M. Stirling's Dies The Fire serious. It's alternate history and those are often interesting. (They're also the only Fantasy books I tend to enjoy.)
I probably wouldn't have bought this book just because the guy's name resembles Darcy Tucker's too much. I got it as a Christmas gift though and I'll get to it sometime this summer.
Mostly I bought it for the title. Hopefully it'll be an interesting read.
This book seems to be recommended all over the internet. That's why I bought it. No clue when I'll actually end up reading it though.
While certainly amazing, true, it's becoming one of my pet peeves to see statements like:
As costs for internet start-ups decreases, amazing open-source technologies like hadoop continue to spread, and talent realizes [...]
Why always Hadoop? To help balance things out a bit here's a list of amazing open source products I've come across, most of which I've used in production and all of which I've at least prototyped something in that I would put up at the level of Hadoop, some even higher, particularly the ones that have as high a level of quality but a more broad applicability.
I had to start with the darling of web developers everywhere. No doubt memcached has saved an uncountable number of us from massive infrastructure costs and numerous outages. It makes scaling SO much easier. Scaling dynamic websites to millions of concurrent users used to be near impossible without massive investment in talent and hardware. With the current crop of available tools, memcached primary among them, costs have plummeted and any good developer can do it himself.
Sphinx, technically, is a full-text search engine. It's a testament to its brilliance though that I don't even use it for its primary purpose. I use it for general search indexing. The more you add search criteria in mysql the slower it gets. The more you add search criteria with sphinx, the *faster* it gets. Some queries could take 30+ seconds on mysql (it the tables got locked up it could be worse) and now take ~0.12s with sphinx. I run a distributed index on a single machine. I split up the index into 4 chunks so that any search executed will run in parallel on 4 cores (the machine has 8), merge the results and send them back. Millions of rows searched by arbitrary criteria in 0.12s.
I actually lied about the full-text search. There's a varchar field that can be searched. A search approximating SELECT * FROM table WHERE field LIKE '%word%' always runs in 0.000s on sphinx. It's so fast sphinx would need more than 3 decimals of precision to measure it. Amazing.
Sphinx is also very good at geo-location searches (only return results within a radius around a certain point, return the distance from a certain point with all results, etc).
Were I to rework an existing system using my currently tools today, I'd use mysql mainly as a key/value store that I would only fall back on when the data wasn't in memcached (and putting that data in memcached before returning it) and run any queries that are the least bit heavy against sphinx.
Any scalable system needs a way to do work asynchronously. If you always do everything synchronously you're in for a world of pain once you get traffic. Beanstalkd lets me queue work. For simple stuff that isn't absolutely critical a fast in-memory queue is perfect. With beanstalkd you connect to a "tube" (any name you want, if it doesn't exist yet it'll create it) and write to it. A consumer "watches" all tubes it's interested in and requests jobs from any of those tubes. That's it. It doesn't do anything else. If someone pulls the plug from the server or the server just plain crashes and needs to reboot everything in those tubes is gone forever.
I use beanstalkd because everything I do asynchronously isn't critical and I like the extremely simple model. If you a need serious message broker for serious business though RabbitMQ is what you want. Now, that isn't to say RabbitMQ is over-complicated. It isn't. But it can do a whole lots more than put 1 job in 1 tube and listen on 1 or more tubes at a time.
RabbitMQ keeps your queue persistent. If your server shuts down, when you come back up your messages are still there. (And hopefully you have transparent failover if this is an important system.) I'm not going to try and enumerate everything you can do with RabbitMQ since this post would never end but be sure to take a look at some of the messaging scenarios described in their FAQ. People have done everything from implementing chat rooms to collaborative editing to file streaming.
This beauty was the thing I needed most without realizing it. Everyone should have a lightweight profiler they can run in production. Most of my code that runs synchronously when someone requests a page is PHP. What xhprof does is it profiles a request and outputs the aggregate wall time, cpu usage and memory usage that php functions take. It calculates inclusive time (which includes calls to other functions and waiting for them to return) and exclusive time (time spend in that function only) for all function calls. It lets you drill up and down function calls to see who calls a particular function and what that function calls. This profiler let me find several bottlenecks that could be fixed very simply with a few lines of code. They hadn't been fixed because no one knew they were a problem. Very low hanging fruit sitting there right in front of us but we were blind. Now we can see.
Just to prove I don't hold any grudges against Hadoop, here's an amazing Hadoop related tool that I came across a while ago. Cascading is a framework that makes it conceptually easier to write map/reduce scripts. Instead of working out the low level map/reduce yourself, (which can be very frustrating since it's very different from how we normally program) Cascading lets you break down the problem into a "Source" (the raw data), a series of transformations on data streams and a final "Sink" (which could possibly be used as another Source as necessary). It compiles the appropriate map/reduce script and runs it. An added bonus is not only that it runs the script but it optimizes it too. My first map/reduce scripts were technically correct but very slow. Rewriting my goals in Cascading sped things up enormously.
AKA memcached-of-the-future. I haven't actually prototyped this one yet but plan on doing it very soon. (As in, later this week.) Basically it's a key-value store like memcached but it adds support for data structures such as lists and sets and atomic push/pop operations. Very useful. To stay fast the data is kept in memory and periodically written to disk. (The frequency of disk writes is configurable.) They're also adding Master/Slave replication support!
If all you need is a blazingly fast but persistent key-value store I highly recommend Tokyo Cabinet (TC). I used it on a small internal project with lots of data containing lots of associations and it performed beautifully. Technically, I didn't use TC directly. I used plurk.com's LightCloud which is a set of management tools and a Python client that talks to TC through Tokyo Tyrant, a high speed network interface. Check out TC's benchmarks. It's only slightly slower than memcached and it's persistent on disk!
I specifically avoided talking about web development frameworks because I wanted to discuss open source tools that are independent of frameworks. I added this section simply to mention that many of these tools can integrate seamlessly into some of the more popular frameworks. (Like how HyperRecord integrates Hypertable into Ruby on Rails and Carrot adds RabbitMQ support to Django among many other examples.)
I mentioned HyperRecord above. HyperRecord integrates with Ruby on Rails' ActiveRecord ORM but uses Hypertable instead of a traditional database. Hypertable is basically a BigTable clone. Given the success BigTable has brought Google I'm pretty sure everybody would love to have their very own. Thanks to Doug Judd and his employer zvents, we can. I've only done some basic prototyping but, when Rails 3 comes out, I'll probably try using it as a backend for a Rails project. Could definitely be interesting.
This is some of the more promising/amazing stuff I've been playing around with over the past 6 months. There are so many promising projects to explore. If any have worked particularly well for you I would love to hear about it and check it out.
My bug was in I had a script that ran forever. Whenever someone changed their info the object was reloaded in the script but the reloaded object was always keeping the old values.
A while back I noticed that on certain pages the majority of the time was spent talking to memcached. I then saw that the code was requesting the same keys over and over and over again. The quick fix was simply to add a static array in the Cacher class so that whenever a memcached get is called it first looks at that static array to see if the result is already there thereby saving a network round-trip.
Basically I put a cache in my cache so that I could cache my cache.
