A penny for thoughts?

About the correct valuation

On screaming loudly

2 comments

Of all the speakers I’ve met at CUSEC, my favourite is Zed Shaw. His talk was honest, fun and (for me) life-changing. But that’s not why he’s my favourite speaker. CUSEC has lots of amazing speakers. He’s my favourite because he’s so honest and puts up with the flak he gets because of it. Granted, he’s not always the most diplomatic character online, but his views represent mine perfectly when it comes to online community fiascoes that keep cropping up.

Recently, we had the whole charade of Matt Amionetti’s porn presentation blow up and spew out all over blogs and social news sites. This happens all. the. time. And I’m sick of it.

It’s one of the reasons I deleted all social news sites from google reader. The other is Zed’s ZSFA character. He’s retired it now but the point seems to have been lost on pretty much everyone. People only seem to listen to people who scream loudly and often. People that are outraged. Outraged I say! Scala can’t be faster then Ruby!? If Ruby isn’t fast enough for you, you’re doing it wrong!

Whatever. I’m out. Bye.

I’ve started reading Dive Into Python. I went through the Django book and I’m starting a Django project. Seems like the Python folks can get their act together without the huge amount of drama in the Ruby community. Yes, Ruby. For a long time I stuck around writing Ruby while avoiding Rails, internalizing the “Rails is not Ruby” argument, but I’m through. The drama is just inescapable so now I’m just ignoring anything Ruby-related. Sorry Ruby. It was good times but Python is a perfectly acceptable substitute. I’ll check back in in maybe a year or two. Maybe Rails Bridge will have reversed the trend by then. I hope they do.

Written by Smokinn

May 5th, 2009 at 5:03 pm

Posted in Uncategorized

On my bookshelf

leave a comment

I have a pretty simple system for sorting books. If they’re standing up, I’ve read them. If they’re flat on their side in front of books standing up I haven’t.

Right now these books are flat:

Started

Not-yet-started

Written by Smokinn

April 20th, 2009 at 4:41 pm

Posted in Uncategorized

Sheeple

leave a comment

Written by Smokinn

April 16th, 2009 at 8:57 pm

Posted in Uncategorized

Hadoop is not the best

2 comments

While certainly amazing, true, it’s becoming one of my pet peeves to see statements like:

As costs for internet start-ups decreases, amazing open-source technologies like hadoop continue to spread, and talent realizes [...]

Hacker News (sorry about singling you out)

Why always Hadoop? To help balance things out a bit here’s a list of amazing open source products I’ve come across, most of which I’ve used in production and all of which I’ve at least prototyped something in that I would put up at the level of Hadoop, some even higher, particularly the ones that have as high a level of quality but a more broad applicability.

memcached

I had to start with the darling of web developers everywhere. No doubt memcached has saved an uncountable number of us from massive infrastructure costs and numerous outages. It makes scaling SO much easier. Scaling dynamic websites to millions of concurrent users used to be near impossible without massive investment in talent and hardware. With the current crop of available tools, memcached primary among them, costs have plummeted and any good developer can do it himself.

Sphinx

Sphinx, technically, is a full-text search engine. It’s a testament to its brilliance though that I don’t even use it for its primary purpose. I use it for general search indexing. The more you add search criteria in mysql the slower it gets. The more you add search criteria with sphinx, the *faster* it gets. Some queries could take 30+ seconds on mysql (it the tables got locked up it could be worse) and now take ~0.12s with sphinx. I run a distributed index on a single machine. I split up the index into 4 chunks so that any search executed will run in parallel on 4 cores (the machine has 8), merge the results and send them back. Millions of rows searched by arbitrary criteria in 0.12s.

I actually lied about the full-text search. There’s a varchar field that can be searched. A search approximating SELECT * FROM table WHERE field LIKE ‘%word%’ always runs in 0.000s on sphinx. It’s so fast sphinx would need more than 3 decimals of precision to measure it. Amazing.

Sphinx is also very good at geo-location searches (only return results within a radius around a certain point, return the distance from a certain point with all results, etc).

Were I to rework an existing system using my currently tools today, I’d use mysql mainly as a key/value store that I would only fall back on when the data wasn’t in memcached (and putting that data in memcached before returning it) and run any queries that are the least bit heavy against sphinx.

beanstalkd

Any scalable system needs a way to do work asynchronously. If you always do everything synchronously you’re in for a world of pain once you get traffic. Beanstalkd lets me queue work. For simple stuff that isn’t absolutely critical a fast in-memory queue is perfect. With beanstalkd you connect to a “tube” (any name you want, if it doesn’t exist yet it’ll create it) and write to it. A consumer “watches” all tubes it’s interested in and requests jobs from any of those tubes. That’s it. It doesn’t do anything else. If someone pulls the plug from the server or the server just plain crashes and needs to reboot everything in those tubes is gone forever.

RabbitMQ

I use beanstalkd because everything I do asynchronously isn’t critical and I like the extremely simple model. If you a need serious message broker for serious business though RabbitMQ is what you want. Now, that isn’t to say RabbitMQ is over-complicated. It isn’t. But it can do a whole lots more than put 1 job in 1 tube and listen on 1 or more tubes at a time.

RabbitMQ keeps your queue persistent. If your server shuts down, when you come back up your messages are still there. (And hopefully you have transparent failover if this is an important system.) I’m not going to try and enumerate everything you can do with RabbitMQ since this post would never end but be sure to take a look at some of the messaging scenarios described in their FAQ. People have done everything from implementing chat rooms to collaborative editing to file streaming.

xhprof

This beauty was the thing I needed most without realizing it. Everyone should have a lightweight profiler they can run in production. Most of my code that runs synchronously when someone requests a page is PHP. What xhprof does is it profiles a request and outputs the aggregate wall time, cpu usage and memory usage that php functions take. It calculates inclusive time (which includes calls to other functions and waiting for them to return) and exclusive time (time spend in that function only) for all function calls. It lets you drill up and down function calls to see who calls a particular function and what that function calls. This profiler let me find several bottlenecks that could be fixed very simply with a few lines of code. They hadn’t been fixed because no one knew they were a problem. Very low hanging fruit sitting there right in front of us but we were blind. Now we can see.

Cascading

Just to prove I don’t hold any grudges against Hadoop, here’s an amazing Hadoop related tool that I came across a while ago. Cascading is a framework that makes it conceptually easier to write map/reduce scripts. Instead of working out the low level map/reduce yourself, (which can be very frustrating since it’s very different from how we normally program) Cascading lets you break down the problem into a “Source” (the raw data), a series of transformations on data streams and a final “Sink” (which could possibly be used as another Source as necessary). It compiles the appropriate map/reduce script and runs it. An added bonus is not only that it runs the script but it optimizes it too. My first map/reduce scripts were technically correct but very slow. Rewriting my goals in Cascading sped things up enormously.

redis

AKA memcached-of-the-future. I haven’t actually prototyped this one yet but plan on doing it very soon. (As in, later this week.) Basically it’s a key-value store like memcached but it adds support for data structures such as lists and sets and atomic push/pop operations. Very useful. To stay fast the data is kept in memory and periodically written to disk. (The frequency of disk writes is configurable.) They’re also adding Master/Slave replication support!

Tokyo Cabinet/ LightCloud

If all you need is a blazingly fast but persistent key-value store I highly recommend Tokyo Cabinet (TC). I used it on a small internal project with lots of data containing lots of associations and it performed beautifully. Technically, I didn’t use TC directly. I used plurk.com‘s LightCloud which is a set of management tools and a Python client that talks to TC through Tokyo Tyrant, a high speed network interface. Check out TC’s benchmarks. It’s only slightly slower than memcached and it’s persistent on disk!

Frameworks

I specifically avoided talking about web development frameworks because I wanted to discuss open source tools that are independent of frameworks. I added this section simply to mention that many of these tools can integrate seamlessly into some of the more popular frameworks. (Like how HyperRecord integrates Hypertable into Ruby on Rails and Carrot adds RabbitMQ support to Django among many other examples.)

Hypertable

I mentioned HyperRecord above. HyperRecord integrates with Ruby on Rails’ ActiveRecord ORM but uses Hypertable instead of a traditional database. Hypertable is basically a BigTable clone. Given the success BigTable has brought Google I’m pretty sure everybody would love to have their very own. Thanks to Doug Judd and his employer zvents, we can. I’ve only done some basic prototyping but, when Rails 3 comes out, I’ll probably try using it as a backend for a Rails project. Could definitely be interesting.

Closing

This is some of the more promising/amazing stuff I’ve been playing around with over the past 6 months. There are so many promising projects to explore. If any have worked particularly well for you I would love to hear about it and check it out.

Written by Smokinn

April 15th, 2009 at 11:00 pm

Posted in Uncategorized

Xhibit bugs in my code

leave a comment

My bug was in I had a script that ran forever. Whenever someone changed their info the object was reloaded in the script but the reloaded object was always keeping the old values.

A while back I noticed that on certain pages the majority of the time was spent talking to memcached. I then saw that the code was requesting the same keys over and over and over again. The quick fix was simply to add a static array in the Cacher class so that whenever a memcached get is called it first looks at that static array to see if the result is already there thereby saving a network round-trip.

Basically I put a cache in my cache so that I could cache my cache.

Written by Smokinn

March 23rd, 2009 at 9:56 pm

Posted in Uncategorized

Blogs are edutainment

leave a comment

Reading blogs is basically the same thing as watching the cooking channel. Neither will make you good at what they’re displaying but damn is it fun to watch. It’s pretty rare that a blog will get any large number of followers if it’s actually technical enough to be useful. The blogs that get all the traffic are the utterly useless (from an education point of view) ones.

It’s not that blogs don’t serve a purpose. They’re a great way for both the writer and the readers to stay passionate about what they do and the technical information posted is great when googled. But make no mistake, reading blogs religiously every day will not make you a better programmer. While plenty of above average programmers read lots of blogs I have a feeling that the truly great programmers don’t. They’re too busy actually coding and learning to waste their time on such pointless drivel as I’m writing now. (Yes, I am fully aware of the irony here.)

It’s very easy to get lulled into thinking you’re making yourself a better programmer by reading programming blogs. You aren’t. Put the feed reader down and pick up a book. A good one. A single book on a topic you’re only vaguely familiar with will teach you more useful than than a full year of reading blogs. Do yourself a favor and do like I just recently did. Delete at least half your RSS feed subscriptions, starting with any feeds that have more than one entry a day. You’ll have a lot more time and probably spend it a lot better too.

Written by Smokinn

February 22nd, 2009 at 1:15 am

Posted in Uncategorized

Why are you hoarding your talks?

leave a comment

Before I start, I have an obvious bias: I’ve been editing all the talks from the recent CUSEC 2009 conference to put online. It’s time consuming, but I think it’s worthwhile. So worthwhile actually that I wonder why today, with cheap-to-free bandwidth available, there are still conferences that do not publish their talks online.

My completely unfounded theory is that the organizers think that by putting their talks online people will avoid paying for a ticket, preferring to simply watch the videos when they get online. However, I think that that sends exactly the wrong message. While some conferences are clearly state-of-the-art and at the cutting edge of current knowledge, others are less so and would benefit from showing just how great the presentations are, confidently putting the presentations out in the world.

Another reason why conferences should be putting their talks online is that doing so would help them accomplish their stated mission: disseminating knowledge. Even conferences that are clearly run for profit and make massive gobs of money like TED (though TED does it for charity) put their talks online now. The cutting edge conferences have nothing to lose by putting the talks online since a year later there will be new and hopefully better solutions to the same or similar problems to discuss. The cutting edge information is simply put out there for everyone to have access to. Speakers at tech conferences routinely put up their slides but the slides alone are often just a fraction of the information conveyed in the talk.

It isn’t even hard or expensive either. If you’re worried about the costs of hosting the videos yourself, you can just pay vimeo 60$ and put up all your talks in “HD quality” and they’ll foot the bandwidth bill.

If you run or are organized in running a conference, please do me, yourself, and everyone else a favor. Rent an HD camcorder if you don’t have one available and tape the talks. Then throw them up online and blog about them and use them when you’re promoting the next year’s edition. Everyone wins.

Written by Smokinn

February 15th, 2009 at 1:09 pm

Posted in Uncategorized

The CRM114 Discriminator vs the Spammers

3 comments

This post is about an absolutely mind bending program called crm114. To illustrate the absolute insane complexity it can handle, consider this description:

CRM114 is a system to examine incoming e-mail, system log streams, data files or other data streams, and to sort, filter, or alter the incoming files or data streams according to the user’s wildest desires. Criteria for categorization of data can be via a host of methods, including regexes, approximate regexes, a Hidden Markov Model, Bayesian Chain Rule Orthogonal Sparse Bigrams, Winnow, Correlation, KNN/Hyperspace, Bit Entropy, CLUMP, SVM, Neural Networks ( or by other means- it’s all programmable).

That’s, ummm, a LOT of ridiculously complex stuff.

Essentially, the program is a text-stream processing system. It has its own programming language (that is incredibly mind bending but also very obviously powerful, I just haven’t used it much so I don’t know it much) and can do about any sort of statistical processing of text you can think of. The name comes from the movie Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.

“Now, in order to prevent the enemy from issuing fake or confusing orders, the CRM114 Discriminator is designed not to receive at all…
That is, not unless the message is preceded by the proper three letter code group.”

- George C. Scott, playing the role of General Buck Turgidson, in Stanley Kubrick’s Dr.Strangelove

By default, crm114 does nothing with what you give it. You could feed it text all day long and it would happily just take the input, throw it away and sit there idle. Any time it doesn’t know what to do it defaults to doing nothing. Which is a refreshing change from the default being to horribly crash.

When you’re building a program with crm114 you would normally define a goal and rules where text would match that goal. Most of the time when a text-processing situation shows up we, as programmers, whip out the old regexes and make a brittle mess of processing it all with what amount to massive nested case statements. The beauty of crm114 is how easily you can use very advanced machine learning and statistical processing algorithms without having to go and implement them yourself.

Now that I’ve explained the how, let’s wade a little into the why. When I got hired at my new job last summer my first task was to get the “spammer/scammer problem” under control. When most people think of the world, they form a mental image that looks somewhat like this.

Well, unless you’re American.

Sorry.

Anyway, the site mainly uses a freemium model where basic members can only communicate with paying members. So what spammers would do is simply make a new account and spam every single one of our paying customers.

Those kinds of spam attack are relatively easy to detect and block. Running the stats on the found scammers you end up noticing a trend. As far as the website is concerned, the world looks more like this:

Basically you have places like the USA and Canada that give money along with the rich countries of Europe in general and India. Places like Nigeria, South Africa and the Philippines on the other hand give nothing but massive amounts of spam. So our first attempt (before I was hired) at blocking spammers was a strategy I like to refer to as “scorched earth”. Visually, it looks something like this:

Of course, banning entire countries doesn’t entirely work (but it does help) because of a little thing called proxies. The more sophisticated spammers would simply route their traffic through open proxies in the US and keep their operations running. Since these are the more sophisticated ones they also don’t make the mistake of signing up and mass-messaging. They keep their message traffic low so as to stay under the radar. What to do?

This is where crm114 comes in. As far as I can tell, the biggest deployment of crm114 is distributed. There’s a crm114 script called mailreaver.crm that I hear is exceptionally good at blocking spam. I’ve never needed it (thank you Google) so I wouldn’t know but that’s essentially what I want to do. I want to classify messages based on their spammy/scamminess.

With crm114 it’s actually ridiculously easy to do. All you need are 3 files.

#!/usr/bin/crm
{
learn (/home/crmrunner/good.css)
}

- learngood.crm
#!/usr/bin/crm
{
learn (/home/crmrunner/bad.css)
}

- learnbad.crm
#!/usr/bin/crm
{
{
classify ( /home/crmrunner/bad.css | /home/crmrunner/good.css)
exit /1/
}
exit /0/
}

- pick.crm

I use the orthogonal sparse bigram algorithm (which is apparently a good default according to the documentation) with the microgroom option (which removes old entries so as to not overly weight old scam schemes that are no longer used). Combined with beanstalkd (because it’s better to do this processing asynchronously), I simply drop the member id into a queue (either learngood, learnbad or pick) and have programs listening on those queues. Every message sent on the site triggers a decision to be made by crm114 based on the last 20 messages in that member’s message history. When crm114 says the member might be a spammer, the account is flagged and reviewed by a moderator. If the person was erroneously flagged as a potential spammer, crm114 updates the good statistics file and if the person was confirmed as a spammer, crm114 updates the bad statistics file.

We also do some other processing to flag accounts but this is definitely the coolest considering how ridiculously easy it was to implement. So far we’ve been very happy with the results and the more we use it the more accurate it gets, it’s great!

Written by Smokinn

February 10th, 2009 at 11:18 am

Posted in Uncategorized

IT’S ALIIIVE!!

one comment

Yes, my blog is back.

I’ve been meaning to get back to it for quite a while now but been busy doing other stuff and as time went on it got easier and easier to ignore the blog.

It also hasn’t helped that I started using twitter. Twitter ate my blog.

The other thing that didn’t help is that I now have a girlfriend that I love very much. She’s perfect.

And I also started a new job. I think for once I’m finally going to be able to stick somewhere for a while. Last summer I had graduated for only a little more than a year but I was already on my third job. Thankfully now I like my boss and I like what I do. I don’t much like the company (they treat us, as one of my friends perfectly summed it up, like Zellers employees) or the location (I work out in the West (Waste) Island) but those are just minor annoyances compared to what you actually work on for so many hours a week.

In the coming weeks I’ll be posting about some tools I’ve found and been using mostly due to work requirements. Some of them are very cool.

A hint for the next installment:

Stay tuned.

Written by Smokinn

February 7th, 2009 at 2:06 pm

Posted in Uncategorized

Connecting to teamspeak

leave a comment

First, download the Client for your OS from here:

http://www.goteamspeak.com/?page=downloads

(The 5.59 MB one if you’re on windows, the 7.18 MB one if you’re on linux but if you’re on ubuntu teamspeak is in the package manager so you can just go to add/remove programs)

Once downloaded and installed, start it up.

The procedure to connect to the server is this:

Connection -> Connect
Label: CUSEC
Server Address: 216.99.51.131
(Be careful to not put any spaces it’s picky about that)
Nickname: Your name
Anonymous
Server Password: cusec
Default channel: cusec
Default subchannel: cusec2
Channel Password: cusec

I also suggest you change from voice activated talking to “push to talk” mode.

To do this, go to Settings->Sound Input/Output Settings
Select Push to talk
Click Set
Press any key on your keyboard
Click close

From then on, only when you’re pressing that key (no matter what app you’re using so pick something weird like F12) will your mic be activated.

Also, we’re chatting on irc on the freenode network. We’re in #cusec so either connect there with your favorite irc client or if you don’t want/have an irc client you can connect through www.mibbit.com

Written by Smokinn

September 16th, 2008 at 7:55 pm

Posted in Uncategorized