17th May 2008
The history of The Rarest Words site or how one person could crawl the whole Web
In a far away country, back in the snowy month of October one guy was eating a hamburger thinkg about how rare they call it “abreadwithmeat”. So, he decided to find out if there were any guys out there who did call it that. I mean - that’s what it is, why call it a “hamburger”? Turns out nobody thought of that. But he thought it might be cool to find out what other weird people call their things with weird names.
He came to a computer and typed down “rarest words”. As weird as it seems - Google didn’t help here. So, what’s the guy to do? Take a six-pack and start coding! Well… He took a six-pack and when he finally regained consciousness - he forgot about it. So it wasn’t until January that he actually got to coding that stuff.
Well, that’s about how this project started in my imagination. In reality it was much boring but I did wanted to find out rarest words that people use. As it turns out I needed a lot of money, knowledge and patience to do that.
As usual first came the idea and implementation came later. Still, even when I started coding the site I had no idea what’s it going to be, what for and how. All I had was the server I was renting for $100 per month, some money to spend on this hobby (yes, it’s not my work), some knowledge of English language (yes, I’m not American - you can guess that from my crappy English anyway.. Vodka is our drink - try to guess which bear-filled snow-covered matrioshka-for-tourists-making country I’m living in.. and BTW I’m speaking Russian exactly as bad as English) and domain list. Which was about of 100 million domains.
I started to walk around sites 1000 per hour. Wow! That’s a lot. Not! If you have 100 millions to crawl - that almost 377 years to go. Nice number, but not very impressive. I don’t have that time and I don’t want to yell “Yay! It’s finished” when a hover-car runs a hover-pedestrian on the porch of my hover-house and mp3 is defined in antique history books.
Anyway, I needed some limits and something what is called high-performace queue. The limit was that I needed to go with only main pages of sites for now as there are sites there that have millions upon millions of pages (that that stupid therarestwords.com site - there are 100 millions of pages on it right now - it’s like a copy of whole internet for Gods sake!) and the second thing that I could hardly understand what was - wasn’t anywhere to be found. I’ve probably put it somewhere.
Jokes aside - the queue was the worst part as feeding 100 millions of entries into MySQL was bad, but not as bad as selecting them. So I’ve wrote my own implementation - the plain-text-file one. I was soon to find out that writing to disk 100 000 times per hour is not such a smart idea…
Long story short - I’ve loaded that script to the server and started parsing those 1700GBytes of data at a speed of 300000 sites per hour. I’ve compiled my first big list of words from something like 100000 site and something called “the law of big numbers” suggested me that if word “the” appeared there often - it would be really popular in whole Web.
So I’ve made 3 lists. Words that are too popular “the, that, there, even, about, shopping, ..”, somewhat popular “compare, revenue, never, blogs, …” and not-very-popular “oversee, drew, scholar, …” and wrote a PHP script to download the page, analyze text and match it against those words.
And there I was to be discovered by the Web and attacked, played by my own stupidity, etc.. That’s part 2.
|
|