Wednesday Apr 15 2009 11:00 pm by Smokinn

While certainly amazing, true, it's becoming one of my pet peeves to see statements like:

As costs for internet start-ups decreases, amazing open-source technologies like hadoop continue to spread, and talent realizes [...]

Hacker News (sorry about singling you out)

Why always Hadoop? To help balance things out a bit here's a list of amazing open source products I've come across, most of which I've used in production and all of which I've at least prototyped something in that I would put up at the level of Hadoop, some even higher, particularly the ones that have as high a level of quality but a more broad applicability.

memcached

I had to start with the darling of web developers everywhere. No doubt memcached has saved an uncountable number of us from massive infrastructure costs and numerous outages. It makes scaling SO much easier. Scaling dynamic websites to millions of concurrent users used to be near impossible without massive investment in talent and hardware. With the current crop of available tools, memcached primary among them, costs have plummeted and any good developer can do it himself.

Sphinx

Sphinx, technically, is a full-text search engine. It's a testament to its brilliance though that I don't even use it for its primary purpose. I use it for general search indexing. The more you add search criteria in mysql the slower it gets. The more you add search criteria with sphinx, the *faster* it gets. Some queries could take 30+ seconds on mysql (it the tables got locked up it could be worse) and now take ~0.12s with sphinx. I run a distributed index on a single machine. I split up the index into 4 chunks so that any search executed will run in parallel on 4 cores (the machine has 8), merge the results and send them back. Millions of rows searched by arbitrary criteria in 0.12s.

I actually lied about the full-text search. There's a varchar field that can be searched. A search approximating SELECT * FROM table WHERE field LIKE '%word%' always runs in 0.000s on sphinx. It's so fast sphinx would need more than 3 decimals of precision to measure it. Amazing.

Sphinx is also very good at geo-location searches (only return results within a radius around a certain point, return the distance from a certain point with all results, etc).

Were I to rework an existing system using my currently tools today, I'd use mysql mainly as a key/value store that I would only fall back on when the data wasn't in memcached (and putting that data in memcached before returning it) and run any queries that are the least bit heavy against sphinx.

beanstalkd

Any scalable system needs a way to do work asynchronously. If you always do everything synchronously you're in for a world of pain once you get traffic. Beanstalkd lets me queue work. For simple stuff that isn't absolutely critical a fast in-memory queue is perfect. With beanstalkd you connect to a "tube" (any name you want, if it doesn't exist yet it'll create it) and write to it. A consumer "watches" all tubes it's interested in and requests jobs from any of those tubes. That's it. It doesn't do anything else. If someone pulls the plug from the server or the server just plain crashes and needs to reboot everything in those tubes is gone forever.

RabbitMQ

I use beanstalkd because everything I do asynchronously isn't critical and I like the extremely simple model. If you a need serious message broker for serious business though RabbitMQ is what you want. Now, that isn't to say RabbitMQ is over-complicated. It isn't. But it can do a whole lots more than put 1 job in 1 tube and listen on 1 or more tubes at a time.

RabbitMQ keeps your queue persistent. If your server shuts down, when you come back up your messages are still there. (And hopefully you have transparent failover if this is an important system.) I'm not going to try and enumerate everything you can do with RabbitMQ since this post would never end but be sure to take a look at some of the messaging scenarios described in their FAQ. People have done everything from implementing chat rooms to collaborative editing to file streaming.

xhprof

This beauty was the thing I needed most without realizing it. Everyone should have a lightweight profiler they can run in production. Most of my code that runs synchronously when someone requests a page is PHP. What xhprof does is it profiles a request and outputs the aggregate wall time, cpu usage and memory usage that php functions take. It calculates inclusive time (which includes calls to other functions and waiting for them to return) and exclusive time (time spend in that function only) for all function calls. It lets you drill up and down function calls to see who calls a particular function and what that function calls. This profiler let me find several bottlenecks that could be fixed very simply with a few lines of code. They hadn't been fixed because no one knew they were a problem. Very low hanging fruit sitting there right in front of us but we were blind. Now we can see.

Cascading

Just to prove I don't hold any grudges against Hadoop, here's an amazing Hadoop related tool that I came across a while ago. Cascading is a framework that makes it conceptually easier to write map/reduce scripts. Instead of working out the low level map/reduce yourself, (which can be very frustrating since it's very different from how we normally program) Cascading lets you break down the problem into a "Source" (the raw data), a series of transformations on data streams and a final "Sink" (which could possibly be used as another Source as necessary). It compiles the appropriate map/reduce script and runs it. An added bonus is not only that it runs the script but it optimizes it too. My first map/reduce scripts were technically correct but very slow. Rewriting my goals in Cascading sped things up enormously.

redis

AKA memcached-of-the-future. I haven't actually prototyped this one yet but plan on doing it very soon. (As in, later this week.) Basically it's a key-value store like memcached but it adds support for data structures such as lists and sets and atomic push/pop operations. Very useful. To stay fast the data is kept in memory and periodically written to disk. (The frequency of disk writes is configurable.) They're also adding Master/Slave replication support!

Tokyo Cabinet/ LightCloud

If all you need is a blazingly fast but persistent key-value store I highly recommend Tokyo Cabinet (TC). I used it on a small internal project with lots of data containing lots of associations and it performed beautifully. Technically, I didn't use TC directly. I used plurk.com's LightCloud which is a set of management tools and a Python client that talks to TC through Tokyo Tyrant, a high speed network interface. Check out TC's benchmarks. It's only slightly slower than memcached and it's persistent on disk!

Frameworks

I specifically avoided talking about web development frameworks because I wanted to discuss open source tools that are independent of frameworks. I added this section simply to mention that many of these tools can integrate seamlessly into some of the more popular frameworks. (Like how HyperRecord integrates Hypertable into Ruby on Rails and Carrot adds RabbitMQ support to Django among many other examples.)

Hypertable

I mentioned HyperRecord above. HyperRecord integrates with Ruby on Rails' ActiveRecord ORM but uses Hypertable instead of a traditional database. Hypertable is basically a BigTable clone. Given the success BigTable has brought Google I'm pretty sure everybody would love to have their very own. Thanks to Doug Judd and his employer zvents, we can. I've only done some basic prototyping but, when Rails 3 comes out, I'll probably try using it as a backend for a Rails project. Could definitely be interesting.

Closing

This is some of the more promising/amazing stuff I've been playing around with over the past 6 months. There are so many promising projects to explore. If any have worked particularly well for you I would love to hear about it and check it out.

Comments
Thursday 4 2010 3:50 pm by Pavel

Nice Blog!

I'm running sphinx as well

I wonder if sphinx can handle terabytes of data :-/

Post a comment
Name:
Email (optional):
URL (optional):
(Allowed tags: <a> <p> <strong> <em>)

Sorry, but due to spambots, to post I'll need you to prove you're human.

Of the six following animals, just select the two that are not fluffy

About the Site:

I might update. Don't hold your breath though.

About Me:

Name: Guillaume Theoret

Age: 841567684 seconds

Job: Mostly web dev

Some Friends:
Search:

RSS Feeds:

RSS