After reading a great article on bloom filters here, I ported his bloom filter to PHP:
EDIT: After some tests I've realized I have a bug somewhere. My false positive rate is WAY higher than the python script.. I'll repost the code once I've found and fixed the bug.
And.. we're back
class BloomFilter {
public $population;
public $vector;
public function __construct($bit_size, $num_hashes) {
$this->bit_size = $bit_size;
$this->num_hashes = $num_hashes;
$this->vector = array();
$this->population = 0;
}
public function contains($string) {
foreach($this->_hash_indices($string) as $index) {
if($this->vector[$index] != 1)
return false;
}
return true;
}
public function insert($string) {
foreach($this->_hash_indices($string) as $index) {
$this->vector[$index] = 1;
}
$this->population++;
}
private function _hash_indices($string) {
$indices = array();
for($i=1; $i < ($this->num_hashes + 1); $i++){
$indices[] = ($this->_num_hash($string) + $i * $this->_hash_string($string)) % $this->bit_size;
}
return $indices;
}
private function _num_hash($string) {
$size = strlen($string);
if ($size==0) return 0;
$chars = str_split($string);
$i = 0;
foreach($chars as $char) {
$i += ($i>>14) + ($char * 0xd2d84a61);
}
$i += ($i>>14) + ($size * 0xd2d84a61);
return $i;
}
private function _hash_string($string) {
$val = 0;
$array = str_split($string);
foreach($array as $char) {
$val = ($val << 4) + ord($char);
$tmp = $val & 0xf0000000;
if ($tmp != 0) {
$val = $val ^ ($tmp >> 24);
$val = $val ^ $tmp;
}
}
return $val;
}
}
Turns out the problem was that I was hashing with crc32b for the first hash and that was causing way too many false positives. Tim gave me the code for _num_hash (He found it on a python mailing list as a fast potential replacement for their internal hash function) and now with this hash it works well.





