Sunday Feb 22 2009 1:15 am by Smokinn

Reading blogs is basically the same thing as watching the cooking channel. Neither will make you good at what they're displaying but damn is it fun to watch. It's pretty rare that a blog will get any large number of followers if it's actually technical enough to be useful. The blogs that get all the traffic are the utterly useless (from an education point of view) ones.

It's not that blogs don't serve a purpose. They're a great way for both the writer and the readers to stay passionate about what they do and the technical information posted is great when googled. But make no mistake, reading blogs religiously every day will not make you a better programmer. While plenty of above average programmers read lots of blogs I have a feeling that the truly great programmers don't. They're too busy actually coding and learning to waste their time on such pointless drivel as I'm writing now. (Yes, I am fully aware of the irony here.)

It's very easy to get lulled into thinking you're making yourself a better programmer by reading programming blogs. You aren't. Put the feed reader down and pick up a book. A good one. A single book on a topic you're only vaguely familiar with will teach you more useful than than a full year of reading blogs. Do yourself a favor and do like I just recently did. Delete at least half your RSS feed subscriptions, starting with any feeds that have more than one entry a day. You'll have a lot more time and probably spend it a lot better too.

Sunday Feb 15 2009 1:09 pm by Smokinn

Before I start, I have an obvious bias: I've been editing all the talks from the recent CUSEC 2009 conference to put online. It's time consuming, but I think it's worthwhile. So worthwhile actually that I wonder why today, with cheap-to-free bandwidth available, there are still conferences that do not publish their talks online.

My completely unfounded theory is that the organizers think that by putting their talks online people will avoid paying for a ticket, preferring to simply watch the videos when they get online. However, I think that that sends exactly the wrong message. While some conferences are clearly state-of-the-art and at the cutting edge of current knowledge, others are less so and would benefit from showing just how great the presentations are, confidently putting the presentations out in the world.

Another reason why conferences should be putting their talks online is that doing so would help them accomplish their stated mission: disseminating knowledge. Even conferences that are clearly run for profit and make massive gobs of money like TED (though TED does it for charity) put their talks online now. The cutting edge conferences have nothing to lose by putting the talks online since a year later there will be new and hopefully better solutions to the same or similar problems to discuss. The cutting edge information is simply put out there for everyone to have access to. Speakers at tech conferences routinely put up their slides but the slides alone are often just a fraction of the information conveyed in the talk.

It isn't even hard or expensive either. If you're worried about the costs of hosting the videos yourself, you can just pay vimeo 60$ and put up all your talks in "HD quality" and they'll foot the bandwidth bill.

If you run or are organized in running a conference, please do me, yourself, and everyone else a favor. Rent an HD camcorder if you don't have one available and tape the talks. Then throw them up online and blog about them and use them when you're promoting the next year's edition. Everyone wins.

Tuesday Feb 10 2009 11:18 am by Smokinn

This post is about an absolutely mind bending program called crm114. To illustrate the absolute insane complexity it can handle, consider this description:

CRM114 is a system to examine incoming e-mail, system log streams, data files or other data streams, and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires. Criteria for categorization of data can be via a host of methods, including regexes, approximate regexes, a Hidden Markov Model, Bayesian Chain Rule Orthogonal Sparse Bigrams, Winnow, Correlation, KNN/Hyperspace, Bit Entropy, CLUMP, SVM, Neural Networks ( or by other means- it's all programmable).

That's, ummm, a LOT of ridiculously complex stuff.

Essentially, the program is a text-stream processing system. It has its own programming language (that is incredibly mind bending but also very obviously powerful, I just haven't used it much so I don't know it much) and can do about any sort of statistical processing of text you can think of. The name comes from the movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.

"Now, in order to prevent the enemy from issuing fake or confusing orders, the CRM114 Discriminator is designed not to receive at all...

That is, not unless the message is preceded by the proper three letter code group."

- George C. Scott, playing the role of General Buck Turgidson, in Stanley Kubrick's Dr.Strangelove

By default, crm114 does nothing with what you give it. You could feed it text all day long and it would happily just take the input, throw it away and sit there idle. Any time it doesn't know what to do it defaults to doing nothing. Which is a refreshing change from the default being to horribly crash.

When you're building a program with crm114 you would normally define a goal and rules where text would match that goal. Most of the time when a text-processing situation shows up we, as programmers, whip out the old regexes and make a brittle mess of processing it all with what amount to massive nested case statements. The beauty of crm114 is how easily you can use very advanced machine learning and statistical processing algorithms without having to go and implement them yourself.

Now that I've explained the how, let's wade a little into the why. When I got hired at my new job last summer my first task was to get the "spammer/scammer problem" under control. When most people think of the world, they form a mental image that looks somewhat like this.

Well, unless you're American.

Sorry.

Anyway, the site mainly uses a freemium model where basic members can only communicate with paying members. So what spammers would do is simply make a new account and spam every single one of our paying customers.

Those kinds of spam attack are relatively easy to detect and block. Running the stats on the found scammers you end up noticing a trend. As far as the website is concerned, the world looks more like this:

Basically you have places like the USA and Canada that give money along with the rich countries of Europe in general and India. Places like Nigeria, South Africa and the Philippines on the other hand give nothing but massive amounts of spam. So our first attempt (before I was hired) at blocking spammers was a strategy I like to refer to as "scorched earth". Visually, it looks something like this:

Of course, banning entire countries doesn't entirely work (but it does help) because of a little thing called proxies. The more sophisticated spammers would simply route their traffic through open proxies in the US and keep their operations running. Since these are the more sophisticated ones they also don't make the mistake of signing up and mass-messaging. They keep their message traffic low so as to stay under the radar. What to do?

This is where crm114 comes in. As far as I can tell, the biggest deployment of crm114 is distributed. There's a crm114 script called mailreaver.crm that I hear is exceptionally good at blocking spam. I've never needed it (thank you Google) so I wouldn't know but that's essentially what I want to do. I want to classify messages based on their spammy/scamminess.

With crm114 it's actually ridiculously easy to do. All you need are 3 files.

#!/usr/bin/crm

{

learn (/home/crmrunner/good.css)

}

- learngood.crm

#!/usr/bin/crm

{

learn (/home/crmrunner/bad.css)

}

- learnbad.crm

#!/usr/bin/crm

{

{

classify ( /home/crmrunner/bad.css | /home/crmrunner/good.css)

exit /1/

}

exit /0/

}

- pick.crm

I use the orthogonal sparse bigram algorithm (which is apparently a good default according to the documentation) with the microgroom option (which removes old entries so as to not overly weight old scam schemes that are no longer used). Combined with beanstalkd (because it's better to do this processing asynchronously), I simply drop the member id into a queue (either learngood, learnbad or pick) and have programs listening on those queues. Every message sent on the site triggers a decision to be made by crm114 based on the last 20 messages in that member's message history. When crm114 says the member might be a spammer, the account is flagged and reviewed by a moderator. If the person was erroneously flagged as a potential spammer, crm114 updates the good statistics file and if the person was confirmed as a spammer, crm114 updates the bad statistics file.

We also do some other processing to flag accounts but this is definitely the coolest considering how ridiculously easy it was to implement. So far we've been very happy with the results and the more we use it the more accurate it gets, it's great!

Saturday Feb 7 2009 2:06 pm by Smokinn

Yes, my blog is back.

I've been meaning to get back to it for quite a while now but been busy doing other stuff and as time went on it got easier and easier to ignore the blog.

It also hasn't helped that I started using twitter. Twitter ate my blog.

The other thing that didn't help is that I now have a girlfriend that I love very much. She's perfect.

And I also started a new job. I think for once I'm finally going to be able to stick somewhere for a while. Last summer I had graduated for only a little more than a year but I was already on my third job. Thankfully now I like my boss and I like what I do. I don't much like the company (they treat us, as one of my friends perfectly summed it up, like Zellers employees) or the location (I work out in the West (Waste) Island) but those are just minor annoyances compared to what you actually work on for so many hours a week.

In the coming weeks I'll be posting about some tools I've found and been using mostly due to work requirements. Some of them are very cool.

A hint for the next installment:

Stay tuned.

Friday Aug 29 2008 3:32 pm by Smokinn

PHP arrays have pretty huge size overhead so I ported this ruby class to PHP and use it as my vector of bits.

I threw up all the code on google code so you can check it out if you're interested.

http://code.google.com/p/phpbloom/

Tuesday Aug 26 2008 4:15 pm by Smokinn

After reading a great article on bloom filters here, I ported his bloom filter to PHP:

EDIT: After some tests I've realized I have a bug somewhere. My false positive rate is WAY higher than the python script.. I'll repost the code once I've found and fixed the bug.

And.. we're back

class BloomFilter {

public $population;

public $vector;

public function __construct($bit_size, $num_hashes) {

$this->bit_size = $bit_size;

$this->num_hashes = $num_hashes;

$this->vector = array();

$this->population = 0;

}

public function contains($string) {

foreach($this->_hash_indices($string) as $index) {

if($this->vector[$index] != 1)

return false;

}

return true;

}

public function insert($string) {

foreach($this->_hash_indices($string) as $index) {

$this->vector[$index] = 1;

}

$this->population++;

}

private function _hash_indices($string) {

$indices = array();

for($i=1; $i < ($this->num_hashes + 1); $i++){

$indices[] = ($this->_num_hash($string) + $i * $this->_hash_string($string)) % $this->bit_size;

}

return $indices;

}

private function _num_hash($string) {

$size = strlen($string);

if ($size==0) return 0;

$chars = str_split($string);

$i = 0;

foreach($chars as $char) {

$i += ($i>>14) + ($char * 0xd2d84a61);

}

$i += ($i>>14) + ($size * 0xd2d84a61);

return $i;

}

private function _hash_string($string) {

$val = 0;

$array = str_split($string);

foreach($array as $char) {

$val = ($val << 4) + ord($char);

$tmp = $val & 0xf0000000;

if ($tmp != 0) {

$val = $val ^ ($tmp >> 24);

$val = $val ^ $tmp;

}

}

return $val;

}

}

Turns out the problem was that I was hashing with crc32b for the first hash and that was causing way too many false positives. Tim gave me the code for _num_hash (He found it on a python mailing list as a fast potential replacement for their internal hash function) and now with this hash it works well.

Friday Jun 20 2008 2:33 pm by Smokinn

Last month AVG put out a new version of their anti-virus, version 8.0. It's 8.0 that comes with LinkScanner and AVG LinkScanner is broken. It doesn't handle base href properly and that's why you're seeing crazy urls with js/js/js/js/js/js in your access log.

Here's a couple of (anonymized) examples from our own logs:

255.255.255.255 - - [20/Jun/2008:15:03:52 -0400] "GET /article/92572/js/js/gui.js HTTP/1.1" 500 624 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

255.255.255.255 - - [20/Jun/2008:15:03:33 -0400] "GET /article/77673/js/js/js/js/js//""+sWOUrl+"/" HTTP/1.1" 500 624 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"

I would like everyone to do like we did and redirect their User Agent so that AVG gets the message.

This is what we now have as our first htaccess rule:

RewriteCond %{HTTP_USER_AGENT} Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)

RewriteRule ^.*$ http://www.grisoft.com?linkscanner=spamming_us&please_fix_it_kthx [R,L]

AVG Linkscanner uses 1813 to identify itself so as long as they keep a unique identifier we can cut them off.

The problem is how aggressive the program is. From avg's own forum:

I had Linkscanner turned on and have Google set to display 100 hits per page. In the process of scanning links, Linkscanner downloaded over 900 MegaBytes of data in one day! Use this feature with care if you have a download quota on your internet account!

You'd think a business that's entire reason to exist is to stop maliciousness wouldn't pretty much spam every site on the web and drive up their bandwidth costs like mad (and slow down google a lot).

I suggest you put that htaccess access rule if you don't want to end up like this poor guy:

Wow. I cannot believe this. We have been fighting performance issues on our web site for the last month, and just commissioned a new server. Then we got our bandwidth overage bill for May, and our bandwidth was more than double (and we got billed huge overages). The bandwidth on our site was going up EXPONENTIALLY! For June, we were looking at being 4-5 times more than our allocated bandwidth, and were looking at more than $5K this month in overages!

What made us realize something was off, was that the page views according to Google Analytics were flat, yet traffic and bandwidth were EXPLODING. Most of this started in Early April, and started heading north in a really scary J curve. But when we ran Webalyzer stats, it indicated that the traffic on the site WAS going up. But since Google analytics only logs page views that the browser renders (via Javascript), none of this showed up in the Google stats. So clearly something OTHER than normal browser traffic was sucking up our bandwidth and CPU time.

That comment was posted on this blog entry and it's that blog article (actually, one of the comments) that first tipped me off. Everything else I found elsewhere just confirmed it. One of the best articles I've found is from TheRegister. It's called AVG scanner blasts internet with fake traffic.

I figure if AVG wants to chew up other people's bandwidth, they can chew up their own. I can't seem to register on their forum and their technical support form requires a product license so while we're at it might as well send them a message in case they actually monitor their access logs.

UPDATE: I've been contacted by AVG. They're putting together a group to address the concerns of webmasters and asked me to be part of it. If you have any comments or suggestions on what they can do to improve LinkScanner let me know and I'll pass it along once I get the group invite.

Friday Jun 20 2008 9:56 am by Smokinn

I learned a really cool trick recently. Don't try this at home.

Sunday Jun 15 2008 5:10 pm by Smokinn

I think yesterday (and just this weekend in general) was one of my best days ever.

It started off rather uneventfully with a haircut. Then I headed off to the old port to check out the Eureka festival. It was an entire afternoon of science prevailing. I even got a plaque carved with a laser! At first I met up with Vijeta but Skrud, Kyle and Heather showed up a little later. After beating the heart (who's in charge now, heart? You think you can flatline on me? I don't think so) and lifting matter with my mind (I still think that one is bullshit.. I want an explanation!) we headed off to Chinatown for some food. Harley joined us at Pho Bang and I had a tasty dinner.

After dinner we headed back to the festival for the laser show. I guess I built it up too much in my mind (we were debating who would win in a fight between a laser and lightning since a storm seemed to be coming) and it was my only letdown of the day. We stuck around for a while amusing ourselves and we tried out the pedophile trap but got bored and managed to get out unarrested.

By now it was getting a bit late so Heather, Kyle and Vijeta headed home. Skrud, Harley and I headed up to the fringe park to wait for the 13th hour show. On the way though, I broke a flip flop. On St-Laurent. At 10:30 pm. Crap. So I ended up walking up and down a good part of St-Laurent on one flip-flop and one bare foot until Clare called and her brother had the great idea of using a shoelace to try and fix it somehow. So we found an open grocery store (after going by 2 closed pharmacies which is where we were actually trying to go) and I bought some shoelaces, MacGuyvered my flip flop and had a temporary walking implement. Science prevails!

We met up with Clare, her brother, Sean and Bridget at the 13th hour and it was amazing as usual, plenty of dance parties and lots of funny and entertaining acts. I'm definitely going to be seeing a lot of fringe stuff this week (next time is probably on thrustday) so if you're interested in comedy or just an all-around good time give me a call or leave me a note and I'll let you know what we're up to in the near future.

The best fringe of them all, is the fringe in Montreal.
The fringe in Montreal, is the best fringe of them all.

Wednesday Apr 30 2008 11:22 pm by Smokinn

This post is so incredibly spot-on it hurts.

There are really two types of files. There are files that are private and there are files you don't care about. No one wants to save their company's word and excel docs in a vague cloud you're not quite sure you'll ever get the files back from because of the security concerns. Not to mention people won't understand what happened. Where did the file go? I don't see it? Oh no my file is gone! How do I get it back? I have to search? Why isn't it on the computer? etc.

The other files are.. well.. just random stuff. They're photos of your summer vacation, photos that are already on facebook and flickr because you wanted to share them. They're mp3s you bought *cough*, but those are already synced between your computer and your ipod and maybe you've even moved past files and use something like deezer or, if you're american, pandora to get your music fill.

And what other media do you really want? Most of it is probably on youtube unless it's a full movie or tv episode in which case you can probably get it off a torrent or an flv streaming site. Soon enough (if the stupid media industry ever wakes up) it'll be cheaply available on-demand.

So what's really left to back up? I'm an outlying case because I have source code files I want to keep but even then I have a subversion server running on my pc and I can check my code into google code (or sourceforge if you prefer or any hosted service if you prefer to keep your code closed source) whenever I want.

I think Joel makes a very compelling point that it's a service we just don't need. It's something we will hopefully very soon be taking for granted. Something every developer will have to deal with if they expect their app to get any sort of market traction. The app will have to work on the web, on the phone, offline, etc. If it doesn't people will instead scratch their head and go for another that's maybe not as good but provides what people will by then consider basic features. An actual service to do this for consumers just won't work.

Of course it's really only partly aimed at consumers. It's mostly aimed at developers. I suppose it's another try to control the API developers mainly write code with but it's not going to work. No one trusts Microsoft. While some people might have their reservations and go ahead anyway because they consider the platform to be better, it certainly won't be the majority. The vast majority will say no thanks I'll develop my app on my own. Then I'll add in some backend "cloud" features. (Which really isn't hard.) Then I'll make a mobile version that works with my "cloud". Then I'll know my stack up and down and be able to fix any problem that comes up (provided I'm competent enough) and won't be dependent on the good graces of any other company, especially one that has proven itself to, time and time again, have very little grace.

About the Site:

I might update. Don't hold your breath though.

About Me:

Name: Guillaume Theoret

Age: 841567596 seconds

Job: Mostly web dev

Some Friends:
Search:

RSS Feeds:

RSS