A penny for thoughts?

About the correct valuation

Archive for April, 2011

It’s shit like this Java

2 comments

Sure this is a minor quirk and there are plenty of more annoying things to bitch about when it comes to Java but it’s the collection of all these stupid decisions that makes Java annoying to use.

Have you every heard of Arrays.copyOfRange? Apparently it’s new in Java 6. Apparently before Java 6 there was a completely different method of doing it: System.arraycopy

Anyway, I was using Arrays.copyOfRange today as it’s typically implemented in pretty much every programming language I know of. If you do something like split(0,0) it should give you the first element. But in Java, no it doesn’t do that. It gives you nothing at all. Because the to is inclusive but the from is exclusive.

So basically if you want the whole array, instead of 0 to length – 1 which you would expect to need to provide (I want from this index to that index), you have to ask for index 0 to index end + 1 which is about the least intuitive thing I can think of for an array split. I totally would’ve expected an OutOfBoundsException.

On the other hand, if we take a look at how System.arraycopy works, you basically have to specify your source array, pass in a destination array Java can write to, tell it where to start in each and the length. Saner, but still a pain to actually use and not the least bit typesafe.

Why can’t Java convenience methods be convenient? As far as I can tell, one the the few things they’ve gotten right recently is the for(type var : collection) notation which is actually quite nice to use when you’re iterating over a collection. (Though why they overloaded for instead of using something like foreach(type var in collection) I have no idea.)

Written by Smokinn

April 20th, 2011 at 5:02 am

Posted in Uncategorized

Bloom Filters Revisited

leave a comment

Bloom filters have always fascinated me for some reason, probably because they seem like such an optimal data structure to use in caching problems. I have yet to use a bloom filter in a production environment (other than indirectly because some piece of infrastructure I rely on uses them internally) but they seem too efficient for me to not one day have a use for them. Back in 2008 I wrote a bloom filter implementation in pure PHP that you definitely don’t want to use. (Slow and uses too much memory.)

Anyway, now that I’m more or less back up to speed in C, I decided to have another go at implementing a bloom filter, though this time the implementation is certainly going to be a lot faster and more space efficient. =)

Basically the way a bloom filter works is that you use N hashing functions on your input and together they map to certain bits in a an array which, if all are set means your input should be available somewhere. According to this page where I found good implementations of common hash functions, an optimal bloom filter can be built with “at most two distinct hash functions also known as pairwise independent hash functions”. So you run your two pairwise independent hash functions on your input, end up with some bit locations and look them up to see if they’re set. If they aren’t both set you know the input isn’t part of your dataset because with bloom filters it’s impossible to get a false negative. If they “are* set, you then have to try and find the actual data wherever you put it before you return a true/false result to the user because it’s possible to have a false positive.

But wait, what the hell are pairwise independent hash functions? Ah, well after reading the excellent “Introduction to pairwise independent hashing” available here, the technically-wrong-in-details-but-not-far-off-summary is that what we need is a family of functions that are not actually random but behave as if they were. Once we have this family of functions we simply pick a random two and we can build our bloom filter. That’s a nice definition but how do we find this family or how do we know if two hash functions we already have and would like to use (because they’re fast for example) are suitable? Turns out most simple hash functions are suitable. A good ppt presentation on why (with an intro to bloom filters as well) is available here.

Ok so back to the show. In my case I chose the DJB and the DEK hash functions. Why? Because DJB and Donald Knuth are ridiculously smart so their functions must be good. =) We also need to set up a bit array. I based mine off a Stack Overflow answer. All we need to do is look at how many bits we want to use (N), how many bits are in a word (W) and set up an array of N/W buckets. From there whenever we look up a hash value we find which bucket to look in and then the bit offset within that bucket. I won’t paste the whole .c file here but you can check it out here

If we start with the string “blah”, we get the DJB Hash: 1306998344 and DEK Hash: 697811854. This means that we should be looking at bit 1306998344 % SIZE and bit 697811854 % SIZE in our bloom filter. If we’re building our bloom filter we should be setting both those bits to true. If we’re looking up membership we can simply return the && of whether those bits were set. So basically the bloom filter itself is really just a very thin wrapper around the bitset functions:

void bloom_add(char *s)
{
bitset_set_bit(DJBHash(s, 36) % BITSET_SIZE);
bitset_set_bit(DEKHash(s, 36) % BITSET_SIZE);
}

int bloom_has(char *s)
{
return bitset_has(DJBHash(s, 36) % BITSET_SIZE) &&
bitset_has(DEKHash(s, 36) % BITSET_SIZE);
}

void bloom_reset()
{
bitset_clear_all();
}

And now we’re done. We have a space efficient data structure that support both fast constant time lookups and uses a constant amount of space. If we know the number of items we plan to store in the data structure we can even calculate the number of bits we should be using to hit a desired false positive rate.

If you want to check out the full program download the following files and compile them with: gcc bloom.c GeneralHashFunctions.c bitset.c -o bloom (on linux.. It should compile fine on windows too but that’s up to you.)

bloom.c
GeneralHashFunctions.h
GeneralHashFunctions.c
bitset.h
bitset.c

Written by Smokinn

April 13th, 2011 at 8:11 pm

Posted in Uncategorized

Adding a jit to the web stack

leave a comment

I’m still not entirely sure how to restore transactions to the write side if the following idea is implemented but it could probably be done with a temporary transaction staging area where the transaction can either be applied or rejected and then written in order. The fact that writes would be done in order would greatly reduce write capacity though. There must be some sort of distributed MVCC described in the literature. But I digress. This post is about the read side.

I’m kind of surprised that no one has yet come up with (or at least that I’ve heard of) a middleware for your database that essentially acts like a jit compiler. This would allow you to keep relations while distributing data at will.

Let’s say you have an object of type A that is linked to many of type B. However, when loading A you rarely access the Bs. If your ORM is loading in all the Bs every time you load an instance of A, that’s a lot of wasted work and typically requires a join done in the database. Joins absolutely kill scalability because joins across machines and across datacenters are essentially impossible. (Theoretically it would be possible but it would be so slow as to be impractical.) So instead, you lazy-load the objects. You load up the A and then, when a B is needed you load up that instance of B (or possibly all related Bs). The problem with this is that when you deploy everything will be going fine until you end up with some power users that have hundreds or thousands (or hundreds of thousands) of Bs and either take down your database when you run out of open connections or end up with a bad experience because certain pages take forever to load. (Because you’re making 576 round trips over the network to the database, if A has 575 Bs. This is called the N+1 problem.)

So both solutions kind of suck in different scenarios. Most good ORMs included in frameworks essentially punt the problem over to the developer. When defining a has-many maybe it lazy-loads by default unless you add in an extra optional “eager” statement to load them in with a join. Or at least that’s what I’ve seen in the frameworks I’ve used.

Instead of doing that why don’t we put a sort of jit “compiler” in-between the app and the database. (Preferably the jit would integrate with the app but more on this later.) This jit would actually be part of the ORM you use and could read your definitions. Those definitions could include the traditional advice about eager or not but the jit would be free to ignore it, advice likely only being helpful on cold startup. Given the ORM integration, the jit would know that A has-many Bs and C has-one D and E is-a Z, etc. With that info it could build up knowledge and run heuristics on when to load what data how and even better, as the app is run, with the extra knowledge it gains about your data access patterns, it could only return what is likely to be relevant data.

Another huge advantage is that it would bring a scalable join. Since writes go through the ORM it would know (or at least have a pretty good guess at) what data is where. So when you load up an A and it figures it should return all the Bs, it can run its “join” across the relevant machines. Suddenly you have a scalable join. It involves an extra parallel network round-trip but I think that’s probably worth it. Especially if the jit has access to a large cache that it can manage intelligently. This join of course isn’t a real join but once you have the id of A, you can run a series of parallel “select * from table where foreign_key = A.id”, merge the results and return them together as if a join had happened.

The problem with making an intelligent jit though is the amount of knowledge you have available for your heuristics. The more information the better. So tight integration into the app would really be the best. It could look at the request that came in, the data that came over the wire from the database/cache and then look at what data was actually read/accessed to produce the final output. By looking at all that it could realize that in certain types of requests the application is actually requesting more data than it needs and tune itself to only return the relevant data with a mock/proxy in what should theoretically have been returned but was unlikely to be used (if it turns out to be a special case and the mock/proxy is accessed, this basically causes a cache miss and the tooling has to go back to the database/cache for the information and likely update its knowledge so that the heuristics become more accurate).

If this were all available you could build a massively scalable cloud database.

Of course, as this jit gets iterated on again and again eventually it can start doing more advanced stuff like managing indexes. For example if it finds queries that are running slow but would be quick with an index it could bring up a new machine, replicating from an original, add an index, let replication catch back up, lock for a little while it swaps the original for that one and then shut the original off entirely. It suddenly added an index to your schema that it decided you needed without costing you any downtime and very little additional expense. Running two machines instead of one for each db server for a little while would basically cost you a few dollars to get your entire database fleet to have an index added to them, your app’s performance upgraded and this without any downtime incurred. Pretty good bargain.

And there’s a company that happens to be in the absolutely perfect position to implement this. A company that has tight integration between their language, the tools developers use, and even control over the environment their deployments typically run on. That company is Microsoft. It would be awesome if you could start up a new .NET MVC project, code it all up in C# connected to an SQL Server running on localhost in development and then deploy to Azure with a massive scalable backend given to you “for free” (no extra work) as long as you stick with mostly LINQ for data requests or something that like that. They may be working on this already, I don’t know, but there definitely isn’t anyone in a better position to pull this off right now and the scenario I described above, especially if it’s billed pay-only-for-what-you-use cloud style (which, given Azure, is probable), would make developing on the .NET platform a very interesting proposal for startups and a no-brainer for large enterprises.

Maybe I’m missing something and maybe it’s already been done but if it hasn’t I’d be surprised if Microsoft didn’t already have it in the works.

Written by Smokinn

April 9th, 2011 at 2:52 am

Posted in Uncategorized