A penny for thoughts?

About the correct valuation

H1B Salaries in the US

23 comments

Today I was playing around with the excellent Tableau Desktop and figured I’d take a look at the H1B datasets that are published to see if I could find anything interesting.

The first thing I looked up is what are the cities that spent the most on H1Bs. While the top city isn’t quite surprising, the gap between it and number 2 took me a little off guard.

Spend by city on H1Bs

As can easily be seen by the above chart, New York is by far and away the number 1 spender on H1Bs, spending over 4 times as much as second place Houston. And talking about Houston, while I wasn’t surprised to see Redmond high on the list at number 4, I wondered what Houston was spending so much on.

Houston spend on H1Bs

Turns out they’re spending most of their money on software people, albeit with a good chunk of it on accountants. And while they’re certainly spending a lot in aggregate, being an H1B programmer in Houston doesn’t seem to be all that lucrative:

Houston average H1B spend

If we take a look at the average over the whole country on the other hand the spots to be hired as an H1B point themselves out quite easily.

Average US spend on H1Bs by city

It becomes painfully clear that software hubs are where H1Bs make the most on average. All those nicely paid Google, Apple and Microsoft jobs obviously. But wait, all the averages are under 90k? Some entry-level dev positions get better offers than that in major software centers. Let’s see what happens when we drill into the data a little. Let’s take a look at Redmond.

Microsoft spend on H1Bs

In my initial version of this post looking only at prevailing_wage (which is the industry benchmark, not what Microsoft and others are actually paying) I said that perhaps there was some justification to feeling that H1B was used as a tool to depress wages. However, as is now shown above, Microsoft is definitely paying more than 95k to senior software engineers. I found the numbers I was using earlier fishy and this now makes a lot more sense given that new grads can get offers approaching 90k sometimes.

But enough of that. Let’s move on.

I’m in software myself so I figured I’d filter out industries I didn’t care about and see if we could find anything else that was interesting. I filtered on *developer*, *software*, *computer*, *database*, *dba*, *web*, *programmer*, *systems* and architect (not all but selected titles from the list). This is what I got:

US H1B software spend

While New York is still number 1, their lead was cut down drastically. Redmond is now second and… Edison? Where is Edison? Turns out it’s Edison, New Jersey. This tiny little place:

Edison, New Jersey

(click on the image to head to google maps and check it out)

Now why would Edison, NJ spend almost as much on H1Bs as Google and Apple (Mountain View & Cupertino) combined? I decided to take a quick look and see which companies were the ones hiring so many software H1Bs.

Edison, New Jersey spend on software H1Bs

Yeah… ok. I’ve never really heard of any of these guys other than Fujitsu and Oracle way down at the bottom there. Wait a minute, Oracle Financial Services Software? Let’s take a closer look at those company names.. Hmm sounds like consulting firms mostly. Let’s Google them.

Diaspark has expertise in Enterprise Software, Mobile Applications, Jewelry Software and IT Consulting.

Welcome to EATeam Inc.
We are Enterprise IT Solutions company based in New Jersey, with a dedicated goal of providing quality Information Technology consulting and staffing

Kaizen Technologies provides consulting and IT services to clients globally – as partners to conceptualize and realize technology driven business transformation initiatives.
We provide solutions for a dynamic environment where business and technology strategies converge.

etc. etc. etc. You get the idea. Why at there so many enterprise software companies based in Edison? It’s almost certainly a tax thing. Turns out if you zoom out a little, Edison is less than an hour’s drive from Wall Street which I imagine makes it a convenient location to base your “headquarters” while in really all the work is actually done in NY.

Update: Turns out my guess about Edison was completely wrong. It turns out that, as JoeW in the comments here and others in this news.yc thread have pointed out, Edison, New Jersey is something of an India within the US. Given the conditions there (workers rarely socializing outside their immediate surroundings), it seems that wages can stay depressed to the minimum outset in H1B law because the employees don’t know that outside their little bubble others are earning more for comparable work. This place would be a fascinating place for a sociologist to go in and study I’m sure.

Directions from Edison, NJ to Wall Street, NY

And while the pay is better than in Houston:

Edison, NJ avg spend on software H1Bs

if you look at the standard deviation you don’t exactly have all that much negotiation power it seems:

Edison, NJ stddev on software H1B spend

Now that we’re on the topic of negotiation power, how much salary range do H1Bs actually span?

Software H1B stddev by city

Whoa! What’s going on there? Is there really such a huge variation in Chicago, New York, San Diego and Columbus? Turns out, not really. If we drill down into the raw H1B data for New York for example we see this:

H1B data for New York

So it’s really just a few extremes that are skewing the data. It’s also similar for the other cities with large stddevs. San Diego has someone being paid 6,050,000.00 but then it drops to 221,848.00 while Columbus has someone being paid 5,526,600 which then drops to 191,184. It would’ve been nice if Tableau had an option to only work with the middle 95 or 98% of data so as to throw out some of the abberations that are skewing the data a little but for now we’ll just exclude those locations altogether and take a look again.

Software H1B stddev by city

So as it turns out, in software hubs at least, there seems to be a fairly decent range that H1Bs can span, unlike places like Houston or Edison (Edison has a single 10.5 million wage which is skewing the results massively).

To conclude, if you were curious about which are the companies in the US that spend the most on software H1Bs well here it is:

Companies that hire software H1Bs

Unsurprisingly, Microsoft is first, but surprisingly I actually haven’t heard all that much about several of those companies. Patni America for example is actually much closer to Microsoft than it seems because they hold both the second and seventh position (there’s an extra comma in the second name) so really they should be up at 107,746,662 which is interesting because it’s incredibly close to their prevailing wage total of 107,310,618. It seems Patni pays pretty much exactly the industry average and no more. I tried to figure out what exactly they do from their website but they seem to be yet another generic business software consulting company, albeit one that hires a massive number of H1Bs. As for the rest, feel free to look them up on your own. =)

Written by Smokinn

September 28th, 2011 at 8:36 pm

Posted in Uncategorized

Paris

2 comments

Part I

Well, Paris is certainly an interesting place but it’s the kind of place to visit, not the kind of place to live. It’s… packed. Like solid wall-to-wall people. NYC was pretty hectic but even that feels empty compared to trying to walk around Paris. I think part of that feeling might simply be the fact that there is very little rooms for cars so most people walk or take the subway (as we did) but either way it made the city feel at least twice as dense as the next densest city I’ve visited. (NYC)

On the other hand, if you stick to the old parts it’s quite beautiful. Here’s the view from our crappy (but comfortable) hostel:

The View

And that’s up northish in a more immigrant-filled area we think. A view like that would be considered absolutely spectacular anywhere in North America. Even the old port in Montreal doesn’t even compare.

We only had a few days in Paris before we headed off to Berlin so we packed in as much touristy stuff as we could. We’re going to be back in Paris for several days at the end of the trip but by then we may or may note be touristed out. We visited the Louvre:

The Louvre

The Eiffel Tower:

Eiffel Tower

And the Arc de Triomphe:

Arc de Triomphe

Also, we ate really well. Here’s one of breakfasts for example:

Breakfast

Right by our hostel there were two grocery stores where we picked up some terrine and right next door was a bakery so in the morning I would just head downstairs to get some fresh bread and I’d have a terrine, ham and bread breakfast while Vijeta would have a terrine, cheese and bread breakfast. It was quite delicious. =) We also had amazing food at pretty much any random restaurant we walked into. We walked down the street and found a little cafe serving duck and rabbit. So we tried it and of course it was great. As was the wine.

But…

Oh the beer. Oh the awful, expensive and awful beer. I got desperate and curious enough after a few terrible beers at restaurants/bars to try something that was basically Kronenburg mixed with orange campari. It was definitely the worst “beer” I’ve ever tasted. If you go to Paris just swear off beer for the duration. If you go in with the expectation of not touching beer for the duration you’ll be much better off than trying to find something decent. Although if you drink nothing but Coors Light or something similar you might do fine, just order a light beer, you’ll probably get Kronenburg which is basically the same.

We also ended our trip in Paris (and spent more time there at the end) so Paris part II will come at the end of this series. Next up: Berlin!

Written by Smokinn

September 13th, 2011 at 6:45 pm

Posted in Uncategorized

It’s shit like this Java

2 comments

Sure this is a minor quirk and there are plenty of more annoying things to bitch about when it comes to Java but it’s the collection of all these stupid decisions that makes Java annoying to use.

Have you every heard of Arrays.copyOfRange? Apparently it’s new in Java 6. Apparently before Java 6 there was a completely different method of doing it: System.arraycopy

Anyway, I was using Arrays.copyOfRange today as it’s typically implemented in pretty much every programming language I know of. If you do something like split(0,0) it should give you the first element. But in Java, no it doesn’t do that. It gives you nothing at all. Because the to is inclusive but the from is exclusive.

So basically if you want the whole array, instead of 0 to length – 1 which you would expect to need to provide (I want from this index to that index), you have to ask for index 0 to index end + 1 which is about the least intuitive thing I can think of for an array split. I totally would’ve expected an OutOfBoundsException.

On the other hand, if we take a look at how System.arraycopy works, you basically have to specify your source array, pass in a destination array Java can write to, tell it where to start in each and the length. Saner, but still a pain to actually use and not the least bit typesafe.

Why can’t Java convenience methods be convenient? As far as I can tell, one the the few things they’ve gotten right recently is the for(type var : collection) notation which is actually quite nice to use when you’re iterating over a collection. (Though why they overloaded for instead of using something like foreach(type var in collection) I have no idea.)

Written by Smokinn

April 20th, 2011 at 5:02 am

Posted in Uncategorized

Bloom Filters Revisited

leave a comment

Bloom filters have always fascinated me for some reason, probably because they seem like such an optimal data structure to use in caching problems. I have yet to use a bloom filter in a production environment (other than indirectly because some piece of infrastructure I rely on uses them internally) but they seem too efficient for me to not one day have a use for them. Back in 2008 I wrote a bloom filter implementation in pure PHP that you definitely don’t want to use. (Slow and uses too much memory.)

Anyway, now that I’m more or less back up to speed in C, I decided to have another go at implementing a bloom filter, though this time the implementation is certainly going to be a lot faster and more space efficient. =)

Basically the way a bloom filter works is that you use N hashing functions on your input and together they map to certain bits in a an array which, if all are set means your input should be available somewhere. According to this page where I found good implementations of common hash functions, an optimal bloom filter can be built with “at most two distinct hash functions also known as pairwise independent hash functions”. So you run your two pairwise independent hash functions on your input, end up with some bit locations and look them up to see if they’re set. If they aren’t both set you know the input isn’t part of your dataset because with bloom filters it’s impossible to get a false negative. If they “are* set, you then have to try and find the actual data wherever you put it before you return a true/false result to the user because it’s possible to have a false positive.

But wait, what the hell are pairwise independent hash functions? Ah, well after reading the excellent “Introduction to pairwise independent hashing” available here, the technically-wrong-in-details-but-not-far-off-summary is that what we need is a family of functions that are not actually random but behave as if they were. Once we have this family of functions we simply pick a random two and we can build our bloom filter. That’s a nice definition but how do we find this family or how do we know if two hash functions we already have and would like to use (because they’re fast for example) are suitable? Turns out most simple hash functions are suitable. A good ppt presentation on why (with an intro to bloom filters as well) is available here.

Ok so back to the show. In my case I chose the DJB and the DEK hash functions. Why? Because DJB and Donald Knuth are ridiculously smart so their functions must be good. =) We also need to set up a bit array. I based mine off a Stack Overflow answer. All we need to do is look at how many bits we want to use (N), how many bits are in a word (W) and set up an array of N/W buckets. From there whenever we look up a hash value we find which bucket to look in and then the bit offset within that bucket. I won’t paste the whole .c file here but you can check it out here

If we start with the string “blah”, we get the DJB Hash: 1306998344 and DEK Hash: 697811854. This means that we should be looking at bit 1306998344 % SIZE and bit 697811854 % SIZE in our bloom filter. If we’re building our bloom filter we should be setting both those bits to true. If we’re looking up membership we can simply return the && of whether those bits were set. So basically the bloom filter itself is really just a very thin wrapper around the bitset functions:

void bloom_add(char *s)
{
bitset_set_bit(DJBHash(s, 36) % BITSET_SIZE);
bitset_set_bit(DEKHash(s, 36) % BITSET_SIZE);
}

int bloom_has(char *s)
{
return bitset_has(DJBHash(s, 36) % BITSET_SIZE) &&
bitset_has(DEKHash(s, 36) % BITSET_SIZE);
}

void bloom_reset()
{
bitset_clear_all();
}

And now we’re done. We have a space efficient data structure that support both fast constant time lookups and uses a constant amount of space. If we know the number of items we plan to store in the data structure we can even calculate the number of bits we should be using to hit a desired false positive rate.

If you want to check out the full program download the following files and compile them with: gcc bloom.c GeneralHashFunctions.c bitset.c -o bloom (on linux.. It should compile fine on windows too but that’s up to you.)

bloom.c
GeneralHashFunctions.h
GeneralHashFunctions.c
bitset.h
bitset.c

Written by Smokinn

April 13th, 2011 at 8:11 pm

Posted in Uncategorized

Adding a jit to the web stack

leave a comment

I’m still not entirely sure how to restore transactions to the write side if the following idea is implemented but it could probably be done with a temporary transaction staging area where the transaction can either be applied or rejected and then written in order. The fact that writes would be done in order would greatly reduce write capacity though. There must be some sort of distributed MVCC described in the literature. But I digress. This post is about the read side.

I’m kind of surprised that no one has yet come up with (or at least that I’ve heard of) a middleware for your database that essentially acts like a jit compiler. This would allow you to keep relations while distributing data at will.

Let’s say you have an object of type A that is linked to many of type B. However, when loading A you rarely access the Bs. If your ORM is loading in all the Bs every time you load an instance of A, that’s a lot of wasted work and typically requires a join done in the database. Joins absolutely kill scalability because joins across machines and across datacenters are essentially impossible. (Theoretically it would be possible but it would be so slow as to be impractical.) So instead, you lazy-load the objects. You load up the A and then, when a B is needed you load up that instance of B (or possibly all related Bs). The problem with this is that when you deploy everything will be going fine until you end up with some power users that have hundreds or thousands (or hundreds of thousands) of Bs and either take down your database when you run out of open connections or end up with a bad experience because certain pages take forever to load. (Because you’re making 576 round trips over the network to the database, if A has 575 Bs. This is called the N+1 problem.)

So both solutions kind of suck in different scenarios. Most good ORMs included in frameworks essentially punt the problem over to the developer. When defining a has-many maybe it lazy-loads by default unless you add in an extra optional “eager” statement to load them in with a join. Or at least that’s what I’ve seen in the frameworks I’ve used.

Instead of doing that why don’t we put a sort of jit “compiler” in-between the app and the database. (Preferably the jit would integrate with the app but more on this later.) This jit would actually be part of the ORM you use and could read your definitions. Those definitions could include the traditional advice about eager or not but the jit would be free to ignore it, advice likely only being helpful on cold startup. Given the ORM integration, the jit would know that A has-many Bs and C has-one D and E is-a Z, etc. With that info it could build up knowledge and run heuristics on when to load what data how and even better, as the app is run, with the extra knowledge it gains about your data access patterns, it could only return what is likely to be relevant data.

Another huge advantage is that it would bring a scalable join. Since writes go through the ORM it would know (or at least have a pretty good guess at) what data is where. So when you load up an A and it figures it should return all the Bs, it can run its “join” across the relevant machines. Suddenly you have a scalable join. It involves an extra parallel network round-trip but I think that’s probably worth it. Especially if the jit has access to a large cache that it can manage intelligently. This join of course isn’t a real join but once you have the id of A, you can run a series of parallel “select * from table where foreign_key = A.id”, merge the results and return them together as if a join had happened.

The problem with making an intelligent jit though is the amount of knowledge you have available for your heuristics. The more information the better. So tight integration into the app would really be the best. It could look at the request that came in, the data that came over the wire from the database/cache and then look at what data was actually read/accessed to produce the final output. By looking at all that it could realize that in certain types of requests the application is actually requesting more data than it needs and tune itself to only return the relevant data with a mock/proxy in what should theoretically have been returned but was unlikely to be used (if it turns out to be a special case and the mock/proxy is accessed, this basically causes a cache miss and the tooling has to go back to the database/cache for the information and likely update its knowledge so that the heuristics become more accurate).

If this were all available you could build a massively scalable cloud database.

Of course, as this jit gets iterated on again and again eventually it can start doing more advanced stuff like managing indexes. For example if it finds queries that are running slow but would be quick with an index it could bring up a new machine, replicating from an original, add an index, let replication catch back up, lock for a little while it swaps the original for that one and then shut the original off entirely. It suddenly added an index to your schema that it decided you needed without costing you any downtime and very little additional expense. Running two machines instead of one for each db server for a little while would basically cost you a few dollars to get your entire database fleet to have an index added to them, your app’s performance upgraded and this without any downtime incurred. Pretty good bargain.

And there’s a company that happens to be in the absolutely perfect position to implement this. A company that has tight integration between their language, the tools developers use, and even control over the environment their deployments typically run on. That company is Microsoft. It would be awesome if you could start up a new .NET MVC project, code it all up in C# connected to an SQL Server running on localhost in development and then deploy to Azure with a massive scalable backend given to you “for free” (no extra work) as long as you stick with mostly LINQ for data requests or something that like that. They may be working on this already, I don’t know, but there definitely isn’t anyone in a better position to pull this off right now and the scenario I described above, especially if it’s billed pay-only-for-what-you-use cloud style (which, given Azure, is probable), would make developing on the .NET platform a very interesting proposal for startups and a no-brainer for large enterprises.

Maybe I’m missing something and maybe it’s already been done but if it hasn’t I’d be surprised if Microsoft didn’t already have it in the works.

Written by Smokinn

April 9th, 2011 at 2:52 am

Posted in Uncategorized

Charting life

leave a comment

(click for larger view)

Bar chart of pies:

Pie chart of bars:

I wish I could take credit for this awesome idea but I read it here.

Written by Smokinn

October 27th, 2009 at 9:27 pm

Posted in Uncategorized

Apple fanboys annoy me

3 comments

interest fail

Written by Smokinn

June 20th, 2009 at 4:30 pm

Posted in Uncategorized

Paxos makes the (future) world go round

2 comments

I planned on writing a post on the Paxos algorithm with an example implementation but it turns out someone already wrote an excellent post explaining it and followed it up with some toy code.

I was beaten to the punch but I’ll probably still try implementing it myself soon. Like he said, you rarely truly understand the complexities of something unless you try doing it yourself.

Written by Smokinn

June 1st, 2009 at 1:00 pm

Posted in Uncategorized

Google Chubby papers: notes and highlights

one comment

The papers I read are available here. I went through Mike Burrows’ The Chubby lock service for loosely-coupled distributed systems and Tushar Deepak Chandra, Robert Griesemer and Joshua Redstone’s Paxos Made Live – An Engineering Perspective.

This is mainly what I highlighted from the papers and what I found interesting/noteworthy.

To provide high availability and correctness in the presence of possible transient faults such as the network dropping packets, data corruption or hardware failure, they relied on asynchronous distributed consensus through an implementation of the Paxos protocol. Their implementation is discussed in the Paxos Made Live paper. Incidentally, I’ve read a few papers on Paxos now and the Google paper was by far the most accessible, even though it’s a single chapter and they say it’s only meant to be an “informal overview” before they move on to the implementation details/challenges. If you’re interested in Paxos, I suggest you start with that paper rather than the Paxos Made Simple paper which doesn’t live up to its abstract of: The Paxos algorithm, when presented in plain English, is very simple.

Chubby is split into two parts. One part is a server. This server, due to the way Paxos works, is a really a collection of “cells” (cells are a running instance of a server) where one cell is elected the master and all client communication is with that cell. All operations on that cell are replicated to the other cells in the cluster. The other part is a client library that any applications that with to communicate with Chubby can use.

To help scalability, the client library has an internal cache. This cache invalidates data on a change and never updates it. This seemed bizarre to me at first since I figured it’s normally more efficient to update rather than invalidate. But they make a good point: “update-only protocols can be arbitrarily inefficient; a client that accessed a file might receive updates indefinitely, causing an unbounded number of unnecessary updates“.

Imagine an app touches a dozen files then does work for days. While it’s doing its work it’s constantly receiving updates to those dozen cached items even though it couldn’t care less about them anymore. I was framing the problem with my usual web-app centric mind where I’ll have a hook so that whenever there are db writes that will likely also be looked for in cache soon, rather than invalidate I update. This is more efficient because you have since you have only one location to update. If you have up to 90 000 clients subscribed to your cell (as can happen to Chubby cells), you don’t want to publish up to 90k messages unnecessarily every time a write happens.

They use long polling as a KeepAlive mechanism. Basically, the client sends a message to the server saying I’m here and the server blocks it. The server keeps a request timeout (lease time) and the client a conservative approximation thereof. When the lease ends, the server returns a response and the client immediately makes another request. If all goes well (no network problems, no machine failures, etc), this happens indefinitely. The noteworthy thing they did is that they cleverly used the KeepAlive as a communication mechanism. If the server needs to communicate with the client (send it a message of some kind such as telling the client to invalidate its cache for a file) it can return the KeepAlive early and pass the command along. The client will process the command and make another request, resetting its approximation of the lease time. The first thing I wondered about was how Chubby handled its callbacks to the client and this is a pretty good solution. As they mention, it “simplifies the client, and allows the protocol to operate through firewalls that allow initiation of connection only in one direction.

Mechanisms for scaling:

  • Dynamic lease times: the lease time is adaptive. The default is 12s but it can increase up to around 60s under heavy load.
  • Client library cache: While the cache itself saves a lot, the most surprising thing they found that helped performance enormously was to cache the absence of files. Clients were making many requests for files that never existed and just caching the non-existence on the client side helped a lot. Because developers were writing infinite loops that simple retried until the file existed, they first tried exponentially-increasing delays along with education. They eventually gave up on that and went with caching the absence instead.
  • Chubby’s protocol can be proxied: They use protocol-conversion servers that translate the Chubby protocol into less-complex protocols such as DNS and others. When writing your own protocol it’s definitely good to keep in mind that compatibility with mature and stable proxy servers is a very nice feature.
  • Chubby’s data fits in RAM: This makes most operations cheap. Chubby can store small files (a few k) but it isn’t meant to store large files. This will degrade performance of the Chubby cell enormously and I imagine you’ll probably have angry sysadmins coming after you if you try. =)

My absolute favourite quote of the paper is about scaling as well.

We have found that the key to scaling Chubby is not server performance; reducing communication to the server can have far greater impact. No significant effort has been applied to tuning read/write server code paths; we checked that egregious buge were present, then focused on the scaling mechanisms that could be more effective.

Essentially, it’s the old idea that the fastest code is no code all over again. If your web page makes 100+ database hits per page load I don’t care how good you are at tuning you’re just plain screwed.

Google had a problem with DNS lookups. Basically, it’s common for developers to write jobs with thousands of processes, each process talking to every other. This leads to quadratic growth of lookups. A “small” job of 3 processes would require 150 000 lookups per second. Because Chubby uses explicit invalidation, only the first lookup would not be local (unless your local cache is invalidated which I imagine is rare for what’s probably mostly internal DNS addresses). “A 2-CPU 2.6 GHZ Xeon Chubby has been seen to handle 90 thousand clients communicating with it directly (no proxies); the clients included large jobs with communication parts as described above.” Chubby is so much better than regular DNS servers that Chubby’s most common use is now as a highly available name server with fast updates.

This quote I noted simply because I thought Skrud would enjoy it.

Google’s infrastructure is mostly in C++, but a growing number of systems are being written in Java. [...] The usual mechanism for accessing non-native libraries is JNI, but it is regarded as slow and cumbersome. Our Java programmers so dislike JNI that to avoid its use they prefer to translate large libraries into Java, and commit to supporting them.

Overall, it seems that most of the Google papers are extremely well written. They are by far the most interesting and practical papers I’ve read. The Map/Reduce and Bigtable papers, along with these are all excellent with many examples and very little assumption of prior knowledge other than basic computer/programming terminology and are usually very light on math, unlike most academic papers I’ve seen. It would be really nice if the Google style paper would catch on more in academia.

Written by Smokinn

May 27th, 2009 at 10:57 pm

Posted in Uncategorized

Bob Ippolito is my hero

leave a comment

This is a follow-up from a previous post.

When Zed retired ZSFA, he replaced his most notorious rant with a plea to find someone else to look up to. Not someone loud, brash and arrogant, someone gentle. For me, this is Bob Ippolito.

His name first stuck last December when I was playing around with some Erlang. Bob is the author of mochiweb, an Erlang library for building lightweight HTTP servers. It’s what Facebook used to build Facebook Chat. Since I was just starting to learn Erlang, this video of Bob talking Erlang basics was a big help.

I haven’t done any Erlang since December but his name kept popping up anyway. I was checking out Tokyo Cabinet and found that Bob had written the Python client for the Tokyo Tyrant protocol.

Recently, he gave an absolutely excellent talk summarizing the current state-of-the-open-source-art in database alternatives. This video is highly worth your time and highscalability’s notes on the talk are also quite excellent.

The talk itself was great, but so was the way he handled the “questions”, especially the one where the guy tried to fight him on CouchDB. The question asker would comment on some future work to be “coming soon” or try to downplay the importance of the drawback. Bob simply calmly explained the basic facts of the problem at hand and how the current approach taken by CouchDB was fundamentally flawed for his use case along with how and why MongoDB would work better for what he needs to do.

Having just the facts about what the product was designed for and what its pitfalls are was a refreshing contrast to the usual tunnel vision people have where they first decide they like a software product (mysql is the best! let’s use it for everything!), shoehorn it into doing things it wasn’t designed for (instead of using a message queue like RabbitMQ or Beanstalkd, let’s write to a mysql table and empty it by cron job! (Mike Malone did this at pownce (Ghetto queue: slide 57) and I’m guilty of it too)) and then go around giving presentations on how their hack works, thereby trying to validate the idea that their favourite tool should be used for everything all the time. Fred Brooks would be not be proud of us, but he’d probably like Bob’s presentation.

Written by Smokinn

May 7th, 2009 at 9:41 am

Posted in Uncategorized