Bloom filters have always fascinated me for some reason, probably because they seem like such an optimal data structure to use in caching problems. I have yet to use a bloom filter in a production environment (other than indirectly because some piece of infrastructure I rely on uses them internally) but they seem too efficient for me to not one day have a use for them. Back in 2008 I wrote a bloom filter implementation in pure PHP that you definitely don’t want to use. (Slow and uses too much memory.)
Anyway, now that I’m more or less back up to speed in C, I decided to have another go at implementing a bloom filter, though this time the implementation is certainly going to be a lot faster and more space efficient. =)
Basically the way a bloom filter works is that you use N hashing functions on your input and together they map to certain bits in a an array which, if all are set means your input should be available somewhere. According to this page where I found good implementations of common hash functions, an optimal bloom filter can be built with “at most two distinct hash functions also known as pairwise independent hash functions”. So you run your two pairwise independent hash functions on your input, end up with some bit locations and look them up to see if they’re set. If they aren’t both set you know the input isn’t part of your dataset because with bloom filters it’s impossible to get a false negative. If they “are* set, you then have to try and find the actual data wherever you put it before you return a true/false result to the user because it’s possible to have a false positive.
But wait, what the hell are pairwise independent hash functions? Ah, well after reading the excellent “Introduction to pairwise independent hashing” available here, the technically-wrong-in-details-but-not-far-off-summary is that what we need is a family of functions that are not actually random but behave as if they were. Once we have this family of functions we simply pick a random two and we can build our bloom filter. That’s a nice definition but how do we find this family or how do we know if two hash functions we already have and would like to use (because they’re fast for example) are suitable? Turns out most simple hash functions are suitable. A good ppt presentation on why (with an intro to bloom filters as well) is available here.
Ok so back to the show. In my case I chose the DJB and the DEK hash functions. Why? Because DJB and Donald Knuth are ridiculously smart so their functions must be good. =) We also need to set up a bit array. I based mine off a Stack Overflow answer. All we need to do is look at how many bits we want to use (N), how many bits are in a word (W) and set up an array of N/W buckets. From there whenever we look up a hash value we find which bucket to look in and then the bit offset within that bucket. I won’t paste the whole .c file here but you can check it out here
If we start with the string “blah”, we get the DJB Hash: 1306998344 and DEK Hash: 697811854. This means that we should be looking at bit 1306998344 % SIZE and bit 697811854 % SIZE in our bloom filter. If we’re building our bloom filter we should be setting both those bits to true. If we’re looking up membership we can simply return the && of whether those bits were set. So basically the bloom filter itself is really just a very thin wrapper around the bitset functions:
void bloom_add(char *s)
bitset_set_bit(DJBHash(s, 36) % BITSET_SIZE);
bitset_set_bit(DEKHash(s, 36) % BITSET_SIZE);
int bloom_has(char *s)
return bitset_has(DJBHash(s, 36) % BITSET_SIZE) &&
bitset_has(DEKHash(s, 36) % BITSET_SIZE);
And now we’re done. We have a space efficient data structure that support both fast constant time lookups and uses a constant amount of space. If we know the number of items we plan to store in the data structure we can even calculate the number of bits we should be using to hit a desired false positive rate.
If you want to check out the full program download the following files and compile them with: gcc bloom.c GeneralHashFunctions.c bitset.c -o bloom (on linux.. It should compile fine on windows too but that’s up to you.)