26th Jul 2008
Probably within 24 hrs I would release my 4th hobby project (In case you’ve just turned on your TV - the first three are The Rarest Words, The Rarest News and The Craziest Ideas) - the 4th is called “Suggestan“.
The Webster defines “Suggestan” as “1. geo. A little country where everyone is suggesting something.” Ok, I’m just kidding
The project is going to be yet another joke project which has some meaning. Expect something kind of like “The Rarest Words” but for the hobbies, proffesions and ideas instead of words. Expect something slightly more complex than 1 textbox under a hobby name. 
This project is going to once again “tap into crowdsourcing”, and maybe even into “semantics” as it’s going to define some relations between words, hobbies, ideas, places, etc. It’s definitely not something revolutionary, but if you like “The Rarest Words” - you’re going to enjoy Suggestan too.
Posted by admin under ideas | Comments Off
20th Jul 2008
I love Django for many reasons and here’s one of them. Testing average time for queries I’ve done today to compare engines (mySQL vs postgreSQL) is easily done with django.
Read the rest of this entry »
Posted by admin under python, site | No Comments »
20th Jul 2008
UPDATE: This has been Reddit. Read the comments. The main thing to understand that those results are for default settings of both databases for my case and my priorities. Yours could (and maybe even should) be different.
Well, this and last year I hear everywhere that PostgreSQL is the way to go and that usage of mySQL in 2008 makes people puke… But without any real arguments (besides “Postgres is the way to go”).Well, I don’t usually buy into fashion-style technologies shopping (it’s when someone can’t prove something’s better that what I use) and this time it wouldn’t be an exception.
Ok, so scouring the Internet I’ve found some comparative tests. Mostly in form of “INSERT 10000 items WITH COMMIT AT THE END”. Okay, how many people actually inserted 10000 items in a real web-application (besides dumping-restoring-moving data)? Some people did, but they were both unavailable for comments
Just kidding.
Ok, so since I’m with Django - moving to Postgres and testing my application (RarestNews) should be a snap, isn’t it? Just change the database string in settings.py and install PostgreSQL, right? Wrong!
But there’s a time for everything step-by-step.
Read the rest of this entry »
Posted by admin under site | No Comments »
20th Jul 2008
Well, after trying hundreds of ways to make Python’s carbage collector work with Django’s ORM and threading (see here - scroll to “Python 2.5 bug”) and sing many tools (heapy, valgrind) to try to find the leaks (all the tools show 15-30MB used, no leaks, but in reality program uses all available memory and starts to swap within minutes) I’ve to conclude that there doesn’t seem to be a workaround.
I’ve tried:
- passing only integers instead of Django objects;
- adding +” to strings to make copies of strings, not references, (copy.copy and copy.deepcopy too);
- creating threads inside of a threads, hoping that would lose references somehow;
- del Object; del everything;
- weakrefs;
- moving all Django code into a function and only passing integer to it;
- disabling Django and only leaving lxml (parsing library) in thread, and vice versa - still leaks;
- something else too, but can’t remember all.
The worst part is that nothing detects where those GIGABYTES are going, every tool I used shows 15-30MB memusage.
The workaround I’ve settled for - running separate child processes and connecting to parent “queuer” process via xmlrpc. Takes a lot of memory (each process is 17MB vs some KB for thread) and my guess is that xml isn’t the most effective, but at least no memleaks even if child is running infinite loop.
If you have other ideas to try - let me know.
Posted by admin under site | No Comments »
19th Jul 2008
UPDATE: The response to this post by one of the developers of CouchDB.
RarestNews project is currenlty under re-tooling, for some reasons. First of all there’s a design flaw. The accumulation of 100 000 news articles every day in a single MySQL database is a bad idea. The database started to crawl (i.e. being really slow) on 10th day, and on 20th day it nearly came to stop. Actually the site stopped being responsive at all (even at the start it wasn’t really responsive, but it was all due to normalization, which I was taught is good, but it’s not… in some cases). So the technical description of scalability problems I’m facing are following.
The only thing that wasn’t giving me the problems is the new Amazon EC2 High-CPU instances. That’s a terrific thing. Everything else is crap.
Ok, not everything, but probably being misused like I was doing it - it is.
MySQL problems
So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.
I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…
Read the rest of this entry »
Posted by admin under site | No Comments »
27th Jun 2008
I’ve upgraded the bot for TheRarestWords (about TheRarestWords) to 0.4b today, the new version has these improvements:
- Umlauts are now recognized as letters and actually…. all national letters recognized, except for Japanese, Chinese, etc - the words there are actually phrases and due to the fact they don’t use spaces to separate words - I’ve no idea how to split them into the words. (Ideas, anyone?)
- External domain redirects are recognized and ignored (this is usually either misspellings or SPAM-like-technique)
- Internal domains redirects are recognized (META redirects too)
- Multiple pages instead of just one (if your main page has less than 100 words - the bot goes further up to 10 pages deep, to find some)
- Frames are now recognized too
- Improved HTML support (more tolerable to errors)
Also the new bot stores datetime component for words, so now the trends can be built after a few walks around the web (one walk - about 55 days
since this project still can’t make any money to cover the expenses and it still is on a single server).
Posted by admin under site | No Comments »
22nd Jun 2008
Well, I barely left the TV when today Russia won against Hollland in soccer and went to shout out a few slogans (there are lots of people shouting and cars honking right now cause this is our greatest achievement in last decade or two), but by the time I was back at the computer - there was 4 news on The Rarest News about that in soccer section. Man, that’s fast! Right now there are more than 10 links to stories about that on main page.

At the same time Google News has not a word about it - only politics!

Posted by admin under site | No Comments »
20th Jun 2008
This post is
outdated as it decribes previous version of RarestNews.
The
current version is under heavy development. Preview can be
seen here.
Please welcome The Rarest News
It’s not yet quite what I bragged it would be, but it has only one reason not to - not enough server power
Hopefully, AdSense would help with that.
Ok, so what is it? It’s my yet another hobby project (The Rarest Words being the first), it started because I couldn’t find anything interesting on Google/Yahoo News. Politics - politics - politics. I don’t care for politics. There’s much more to the World than Presidents meeting and Britney’s sister childbirth. Oh yeah - I don’t care for pop either.
So, I’ve decided that I could write something better and in fact I tried. Read the rest of this entry »
Posted by admin under site | No Comments »
20th Jun 2008
The news project I’ve been talking about is now much closer for public release than ever - currently it uses 9000 news sources, but the number only depends on how much can I optimize the software as it’s fully automatic (I don’t even have to point the sources to it - it does it on its own). I first started it by manually adding news sources, but then I’ve realized that I’d have to pay attention, so I’ve decided to use the Applauso-meter (The Simpsons)
ok, jk, I’ve decided that I don’t want to do that, so next few weeks it was a matter of writing automated news-adder 
The project is moving very slowly due to a lot of setbacks. Like when I started it with 100 000 sources and it took scheduler 28 hours to get the idea that it’s 28 hours late to pick up the news (it was still doing the first pass).
Schedule optimizer was also one of the worst thing I’ve yet had to develop. It tries to predict when’s the news coming to a site to optimize the number of visits to that site. The problem is it takes 5-10 hours for it to develop the schedule and only by that moment I can figure out if it works or not.
So, the project has already been restarted like 30-40 times from the ground up, but it finally seems to be working. It still needs a complete rewrite, but for now it should do.
Anyway, stay tuned, the release date is hopefully going to be within a week or maybe two. And if this project could bring in some money to pay for all the servers it uses - who knows - maybe it could even be up to 100 000 news sources this year 
Posted by admin under site | No Comments »
04th Jun 2008
I’ve came up an interesting comment by Steven Dowd from http://newton-le-willows.com and thought that it should be posted here as well:
I have been log watching, and noticed quite a number of hits to my sites with rarestwords as the referrer, any hits are a bonus for a personal homepage site such as mine, so I am glad that I found your project and added a little bit into it..
What I have found useful is the ability to lookup similar sites, and sites also using the same key words. I have found that over this last week, I have managed to get my sites onto the top of Google for certain searches, which I have failed to get #1 position before, I believe this is purely down to filtering the content I have on the front page and key word usage that I have fine tuned through the use of TheRarestWords system.
I still think its brilliant, though I really do now think that its most definatly missing a ’search a word’ input box.
Well, the search box is definitely missing, but I’m still thinking of the way to do it without making 3 input boxes at top
(My site, site search and word search)
So, have you found any usefulness on this site? Share your story.
Posted by admin under site | No Comments »