Tuesday Aug 26 2008 4:15 pm by Smokinn

After reading a great article on bloom filters here, I ported his bloom filter to PHP:

EDIT: After some tests I've realized I have a bug somewhere. My false positive rate is WAY higher than the python script.. I'll repost the code once I've found and fixed the bug.

And.. we're back

class BloomFilter {

public $population;

public $vector;

public function __construct($bit_size, $num_hashes) {

$this->bit_size = $bit_size;

$this->num_hashes = $num_hashes;

$this->vector = array();

$this->population = 0;

}

public function contains($string) {

foreach($this->_hash_indices($string) as $index) {

if($this->vector[$index] != 1)

return false;

}

return true;

}

public function insert($string) {

foreach($this->_hash_indices($string) as $index) {

$this->vector[$index] = 1;

}

$this->population++;

}

private function _hash_indices($string) {

$indices = array();

for($i=1; $i < ($this->num_hashes + 1); $i++){

$indices[] = ($this->_num_hash($string) + $i * $this->_hash_string($string)) % $this->bit_size;

}

return $indices;

}

private function _num_hash($string) {

$size = strlen($string);

if ($size==0) return 0;

$chars = str_split($string);

$i = 0;

foreach($chars as $char) {

$i += ($i>>14) + ($char * 0xd2d84a61);

}

$i += ($i>>14) + ($size * 0xd2d84a61);

return $i;

}

private function _hash_string($string) {

$val = 0;

$array = str_split($string);

foreach($array as $char) {

$val = ($val << 4) + ord($char);

$tmp = $val & 0xf0000000;

if ($tmp != 0) {

$val = $val ^ ($tmp >> 24);

$val = $val ^ $tmp;

}

}

return $val;

}

}

Turns out the problem was that I was hashing with crc32b for the first hash and that was causing way too many false positives. Tim gave me the code for _num_hash (He found it on a python mailing list as a fast potential replacement for their internal hash function) and now with this hash it works well.

Friday Jun 20 2008 2:33 pm by Smokinn

Last month AVG put out a new version of their anti-virus, version 8.0. It's 8.0 that comes with LinkScanner and AVG LinkScanner is broken. It doesn't handle base href properly and that's why you're seeing crazy urls with js/js/js/js/js/js in your access log.

Here's a couple of (anonymized) examples from our own logs:

255.255.255.255 - - [20/Jun/2008:15:03:52 -0400] "GET /article/92572/js/js/gui.js HTTP/1.1" 500 624 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

255.255.255.255 - - [20/Jun/2008:15:03:33 -0400] "GET /article/77673/js/js/js/js/js//""+sWOUrl+"/" HTTP/1.1" 500 624 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

I would like everyone to do like we did and redirect their User Agent so that AVG gets the message.

This is what we now have as our first htaccess rule:

RewriteCond %{HTTP_USER_AGENT} Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)

RewriteRule ^.*$ http://www.grisoft.com?linkscanner=spamming_us&please_fix_it_kthx [R,L]

AVG Linkscanner uses 1813 to identify itself so as long as they keep a unique identifier we can cut them off.

The problem is how aggressive the program is. From avg's own forum:

I had Linkscanner turned on and have Google set to display 100 hits per page. In the process of scanning links, Linkscanner downloaded over 900 MegaBytes of data in one day! Use this feature with care if you have a download quota on your internet account!

You'd think a business that's entire reason to exist is to stop maliciousness wouldn't pretty much spam every site on the web and drive up their bandwidth costs like mad (and slow down google a lot).

I suggest you put that htaccess access rule if you don't want to end up like this poor guy:

Wow. I cannot believe this. We have been fighting performance issues on our web site for the last month, and just commissioned a new server. Then we got our bandwidth overage bill for May, and our bandwidth was more than double (and we got billed huge overages). The bandwidth on our site was going up EXPONENTIALLY! For June, we were looking at being 4-5 times more than our allocated bandwidth, and were looking at more than $5K this month in overages!

What made us realize something was off, was that the page views according to Google Analytics were flat, yet traffic and bandwidth were EXPLODING. Most of this started in Early April, and started heading north in a really scary J curve. But when we ran Webalyzer stats, it indicated that the traffic on the site WAS going up. But since Google analytics only logs page views that the browser renders (via Javascript), none of this showed up in the Google stats. So clearly something OTHER than normal browser traffic was sucking up our bandwidth and CPU time.

That comment was posted on this blog entry and it's that blog article (actually, one of the comments) that first tipped me off. Everything else I found elsewhere just confirmed it. One of the best articles I've found is from TheRegister. It's called AVG scanner blasts internet with fake traffic.

I figure if AVG wants to chew up other people's bandwidth, they can chew up their own. I can't seem to register on their forum and their technical support form requires a product license so while we're at it might as well send them a message in case they actually monitor their access logs.

UPDATE: I've been contacted by AVG. They're putting together a group to address the concerns of webmasters and asked me to be part of it. If you have any comments or suggestions on what they can do to improve LinkScanner let me know and I'll pass it along once I get the group invite.

Friday Jun 20 2008 9:56 am by Smokinn

I learned a really cool trick recently. Don't try this at home.

Sunday Jun 15 2008 5:10 pm by Smokinn

I think yesterday (and just this weekend in general) was one of my best days ever.

It started off rather uneventfully with a haircut. Then I headed off to the old port to check out the Eureka festival. It was an entire afternoon of science prevailing. I even got a plaque carved with a laser! At first I met up with Vijeta but Skrud, Kyle and Heather showed up a little later. After beating the heart (who's in charge now, heart? You think you can flatline on me? I don't think so) and lifting matter with my mind (I still think that one is bullshit.. I want an explanation!) we headed off to Chinatown for some food. Harley joined us at Pho Bang and I had a tasty dinner.

After dinner we headed back to the festival for the laser show. I guess I built it up too much in my mind (we were debating who would win in a fight between a laser and lightning since a storm seemed to be coming) and it was my only letdown of the day. We stuck around for a while amusing ourselves and we tried out the pedophile trap but got bored and managed to get out unarrested.

By now it was getting a bit late so Heather, Kyle and Vijeta headed home. Skrud, Harley and I headed up to the fringe park to wait for the 13th hour show. On the way though, I broke a flip flop. On St-Laurent. At 10:30 pm. Crap. So I ended up walking up and down a good part of St-Laurent on one flip-flop and one bare foot until Clare called and her brother had the great idea of using a shoelace to try and fix it somehow. So we found an open grocery store (after going by 2 closed pharmacies which is where we were actually trying to go) and I bought some shoelaces, MacGuyvered my flip flop and had a temporary walking implement. Science prevails!

We met up with Clare, her brother, Sean and Bridget at the 13th hour and it was amazing as usual, plenty of dance parties and lots of funny and entertaining acts. I'm definitely going to be seeing a lot of fringe stuff this week (next time is probably on thrustday) so if you're interested in comedy or just an all-around good time give me a call or leave me a note and I'll let you know what we're up to in the near future.

The best fringe of them all, is the fringe in Montreal.
The fringe in Montreal, is the best fringe of them all.

Wednesday Apr 30 2008 11:22 pm by Smokinn

This post is so incredibly spot-on it hurts.

There are really two types of files. There are files that are private and there are files you don't care about. No one wants to save their company's word and excel docs in a vague cloud you're not quite sure you'll ever get the files back from because of the security concerns. Not to mention people won't understand what happened. Where did the file go? I don't see it? Oh no my file is gone! How do I get it back? I have to search? Why isn't it on the computer? etc.

The other files are.. well.. just random stuff. They're photos of your summer vacation, photos that are already on facebook and flickr because you wanted to share them. They're mp3s you bought *cough*, but those are already synced between your computer and your ipod and maybe you've even moved past files and use something like deezer or, if you're american, pandora to get your music fill.

And what other media do you really want? Most of it is probably on youtube unless it's a full movie or tv episode in which case you can probably get it off a torrent or an flv streaming site. Soon enough (if the stupid media industry ever wakes up) it'll be cheaply available on-demand.

So what's really left to back up? I'm an outlying case because I have source code files I want to keep but even then I have a subversion server running on my pc and I can check my code into google code (or sourceforge if you prefer or any hosted service if you prefer to keep your code closed source) whenever I want.

I think Joel makes a very compelling point that it's a service we just don't need. It's something we will hopefully very soon be taking for granted. Something every developer will have to deal with if they expect their app to get any sort of market traction. The app will have to work on the web, on the phone, offline, etc. If it doesn't people will instead scratch their head and go for another that's maybe not as good but provides what people will by then consider basic features. An actual service to do this for consumers just won't work.

Of course it's really only partly aimed at consumers. It's mostly aimed at developers. I suppose it's another try to control the API developers mainly write code with but it's not going to work. No one trusts Microsoft. While some people might have their reservations and go ahead anyway because they consider the platform to be better, it certainly won't be the majority. The vast majority will say no thanks I'll develop my app on my own. Then I'll add in some backend "cloud" features. (Which really isn't hard.) Then I'll make a mobile version that works with my "cloud". Then I'll know my stack up and down and be able to fix any problem that comes up (provided I'm competent enough) and won't be dependent on the good graces of any other company, especially one that has proven itself to, time and time again, have very little grace.

Saturday Apr 19 2008 7:06 pm by Smokinn

No, this isn't about politics, but it's related.

Political campaigns are well known for slandering the other candidates but it seems Apple is descending to their level more and more. The Mac vs PC commercials were a very good idea but their execution is terrible. Up until recently I was mostly indifferent to the commercials. I don't like negative campaigns (who cares if the other product sucks? I don't want to know why I shouldn't buy their stuff, I want to know why I should buy your stuff) in general but it was fairly mild stuff.

Until this latest one.

Watch this:

Isn't that terrible? It doesn't even make any sense! How the hell did Vista, an operating system, break billing software? After being subjected to that commercial I was actually angry. How is that effective advertising?

Tuesday Apr 8 2008 9:36 am by Smokinn

I wonder if blatantly copying Campfire is included? Check out HuddleChat, a complete copy of Campfire, right down to the layout. Watch the video on the right of the HuddleChat page then take a look at the screenshots/video page for Campfire.

I know HuddleChat is supposed to be a demo of the Google App Engine and not an actual product but they could've put at least a little effort in and not just ripped of 37 Signals.

Saturday Mar 22 2008 1:38 pm by Smokinn

Well, I've been "in the industry" for nearly a year now. And what do people "in the industry" do, really? They write frameworks and white papers of course. I already wrote my own framework, so the next obvious step in my ascension to architecture astronomy was to write a white paper. Which I did. I hope you all enjoy it.

Friday Mar 21 2008 7:32 pm by Smokinn

First, you need to read this essay, Paul Graham's latest. Or at least as much of it as you can stomach before you stop.

Go ahead, I'll wait.

Back?

Ok.

See a problem with the article? I'll give you a hint.

Paul Graham is a human.

Paul Graham likes startups.

Therefore, humans like startups.

Given the above, humans not in startups are obviously sub-human.

That was pretty much the article, just not in so many words.

EDIT: Ok, maybe not sub-human, just unhappy.

EDIT2: Paul wrote an explanation of what he was really trying to say in the essay. I guess this was another issue of the missing body language and non-verbal queues causing people like me to get the wrong message.

Wednesday Mar 12 2008 5:32 pm by Smokinn

I hope the new change works out for the best.

I guess if you follow him on twitter it's not too hard to guess who in that list is the mysterious new partner though. =)

About the Site:

I might update. Don't hold your breath though.

About Me:

Name: Guillaume Theoret

Age: 793217677 seconds

Job: Mostly web dev

Some Friends:
Search:

RSS Feeds:

RSS