This post is about an absolutely mind bending program called crm114. To illustrate the absolute insane complexity it can handle, consider this description:
CRM114 is a system to examine incoming e-mail, system log streams, data files or other data streams, and to sort, filter, or alter the incoming files or data streams according to the user's wildest desires. Criteria for categorization of data can be via a host of methods, including regexes, approximate regexes, a Hidden Markov Model, Bayesian Chain Rule Orthogonal Sparse Bigrams, Winnow, Correlation, KNN/Hyperspace, Bit Entropy, CLUMP, SVM, Neural Networks ( or by other means- it's all programmable).
That's, ummm, a LOT of ridiculously complex stuff.
Essentially, the program is a text-stream processing system. It has its own programming language (that is incredibly mind bending but also very obviously powerful, I just haven't used it much so I don't know it much) and can do about any sort of statistical processing of text you can think of. The name comes from the movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.
"Now, in order to prevent the enemy from issuing fake or confusing orders, the CRM114 Discriminator is designed not to receive at all...
That is, not unless the message is preceded by the proper three letter code group."
- George C. Scott, playing the role of General Buck Turgidson, in Stanley Kubrick's Dr.Strangelove
By default, crm114 does nothing with what you give it. You could feed it text all day long and it would happily just take the input, throw it away and sit there idle. Any time it doesn't know what to do it defaults to doing nothing. Which is a refreshing change from the default being to horribly crash.
When you're building a program with crm114 you would normally define a goal and rules where text would match that goal. Most of the time when a text-processing situation shows up we, as programmers, whip out the old regexes and make a brittle mess of processing it all with what amount to massive nested case statements. The beauty of crm114 is how easily you can use very advanced machine learning and statistical processing algorithms without having to go and implement them yourself.
Now that I've explained the how, let's wade a little into the why. When I got hired at my new job last summer my first task was to get the "spammer/scammer problem" under control. When most people think of the world, they form a mental image that looks somewhat like this.
Well, unless you're American.
Sorry.
Anyway, the site mainly uses a freemium model where basic members can only communicate with paying members. So what spammers would do is simply make a new account and spam every single one of our paying customers.
Those kinds of spam attack are relatively easy to detect and block. Running the stats on the found scammers you end up noticing a trend. As far as the website is concerned, the world looks more like this:
Basically you have places like the USA and Canada that give money along with the rich countries of Europe in general and India. Places like Nigeria, South Africa and the Philippines on the other hand give nothing but massive amounts of spam. So our first attempt (before I was hired) at blocking spammers was a strategy I like to refer to as "scorched earth". Visually, it looks something like this:
Of course, banning entire countries doesn't entirely work (but it does help) because of a little thing called proxies. The more sophisticated spammers would simply route their traffic through open proxies in the US and keep their operations running. Since these are the more sophisticated ones they also don't make the mistake of signing up and mass-messaging. They keep their message traffic low so as to stay under the radar. What to do?
This is where crm114 comes in. As far as I can tell, the biggest deployment of crm114 is distributed. There's a crm114 script called mailreaver.crm that I hear is exceptionally good at blocking spam. I've never needed it (thank you Google) so I wouldn't know but that's essentially what I want to do. I want to classify messages based on their spammy/scamminess.
With crm114 it's actually ridiculously easy to do. All you need are 3 files.
#!/usr/bin/crm
{
learn (/home/crmrunner/good.css)
}
- learngood.crm
#!/usr/bin/crm
{
learn (/home/crmrunner/bad.css)
}
- learnbad.crm
#!/usr/bin/crm
{
{
classify ( /home/crmrunner/bad.css | /home/crmrunner/good.css)
exit /1/
}
exit /0/
}
- pick.crm
I use the orthogonal sparse bigram algorithm (which is apparently a good default according to the documentation) with the microgroom option (which removes old entries so as to not overly weight old scam schemes that are no longer used). Combined with beanstalkd (because it's better to do this processing asynchronously), I simply drop the member id into a queue (either learngood, learnbad or pick) and have programs listening on those queues. Every message sent on the site triggers a decision to be made by crm114 based on the last 20 messages in that member's message history. When crm114 says the member might be a spammer, the account is flagged and reviewed by a moderator. If the person was erroneously flagged as a potential spammer, crm114 updates the good statistics file and if the person was confirmed as a spammer, crm114 updates the bad statistics file.
We also do some other processing to flag accounts but this is definitely the coolest considering how ridiculously easy it was to implement. So far we've been very happy with the results and the more we use it the more accurate it gets, it's great!