18th May 2008
Site’s history: Part II - ah, destiny, thou art a heartless bitch
That’s how the original site arrived. I decided it would be cool to let people know that they’re using some of the rarest words in the whole Web, so I’ve added a referer link to by crawler and the crowd went in. The problem was - there was nothing to see yet. In a hurry I’ve had to create a site (now that’s what you call dumb: let people know there’s something about them and that something haven’t even appeared). Okay, so the first site was made in a hurry, but it did serve a purpose of showing the rarest words. I’ve also begun my Journey of Great Stupidity involving destroyed hardware and alot of my nerves.
Soon I was to discover that the high-performance-queue based on plain-text file that I’ve written is like a swarm of genetically modified thumbtacks on my chair which are just jumping expecting fun. It was saving the results many times each second. Actually about 250 thousand times per hour. Those of you who knows about web crawlers more than me are free to make a correct guess what happened.
Yep.. my HDD dies. Random accesses that queue was doing and saves to delete the already parsed sites practically were burning the hard drive. No backups, no site, nothing… My datacenter offered me a replacement, but I was soon to find out that this one was coming into the light already. (Not because of me).
I was really discouraged by that fact and the fact that server wasn’t going anywhere near the speed I need (at that time it was Celeron with 256MB and it was maybe 5 years old). I did remember that Amazon has something high-performance to use, quick googling around and Amazon Web Services and in particular EC2 — voila!
So, I’ve read about how to start those things (like little virtual servers which were a hundred time more powerful than my server) and uploaded my php scripts to them. They were maybe destroying Amazon’s HDDs (who knows), but they were working.
At that time posts started appearing around the Internet about mysterious project. Domain-by-proxy? Owner unknown! Amazon Web Services? What’s that at all? There were more questions, but I really didn’t want to answer any of them. Especially since some people got to me through e-mail (yep, that long e-mail in my Whois info does work) and were attacked and harassed with people. There was a lot of annoying ideas that my message «your site isn’t found in our database, maybe we haven’t been to it or maybe there is no text on main page or frames or flash-based-site or maybe your site is porn-related» definitely gives them idea that there is porn somewhere on their site (whereas they had no text at all at main page) and the I should stop going to their site or they’ll sue me. International lawsuit about the message that is correct.. A lot of fun. But that’s something you’d have to be prepared for when doing a big-scale (even if that’s a hobby) project.
How much did I spend to crawl the whole web with Amazon services? $121.05 Yes, really! That’s computer time and traffic. Also I’ve spent $130 on another server I was renting + a lot on my Internet traffic (which was about $30 per GB in my location at the time). Those guys (Amazon + my Server) did a great job and I’ve ended up with 2TB file with list of all sites and words. Great. If I were to download it — I’d pay $60′000 for that.
After some more easy programming I ended up with file that was about 7 gigs = $210 more for ya. At that time I though that I really need to pay my bills with some advertisement. So I’ve put up the ads that were most releavant to the users of the site — SEO, affiliate programs, etc — there were only webmasters on the site.
But I was soon to find out that this would come back and bite my in some fleshy parts of the body.
BTW. The quote in the title is from “The Big Bang Theory”.
|
|