22nd May 2008

Auto-categorizer by content and LSI/SEO which has no competition and link relevancy

Finally this has arrived. The algorithm is my original development, it shows pretty good results often, but sometimes it can be stubbornly dumb. :) One site that I own that’s about some algorithms and technologies in areas of GIS - it categorized it best as “Boring” and “Unfinished”. :) So don’t blame the computer - it’s dumb. It’s also really slow right now and lazy. Last one (lazy) is in technical meaning - in that way that categories are generated when needed, not before, like all other stuff on site, cause it would take more than 50 days for my single server to categorize all 70 million pages in database. Unfortunately that’s as fast as I can go without resorting to some compiled language such as C.

The algorithm bases it’s results purely on words in text - html tags are totally ignored (title tag is not even considered) and also it can result in some words that aren’t even in your text - that’s normal - it uses words to guess category, which can include other words. Just braggin’ :) However test suite shown very big crossing with Google’s results - i.e. if you search something in google, let’s say “widgets” and then go through the same sites with my algo - you’d see a lot of “widgets” in my results too. Which should suggest some ideas to those who understand what SEO is and what is LSA and how they’re related.

I’m not suggesting that Google is using LSI/LSA or something like I do, but it surely seems like my algo could be on to something.

And if you’re too lazy to guess the idea about SEO - the results from Google if it were only considering on-site factors would be almost identical to my categorizer. Now think how you can use it to your advantage. :) But remember that TheRarestWords can re-index your site (use “add url” form) only once a day and only main pages right now.

As usual - any comments, funny stuff or totally wrong stuff is welcome in comments.

I’m thinking to release this tool as AJAX-style text-box so that changes in text could almost instantly show categorization change AND suggest other words to use to achieve better position in some categories and avoid writers block. But this is only in plans when this project stops eating my money :)

P.S. Honestly… I haven’t read about LSA, just heard the basic idea and I think my algo could be close to it. But I’m too lazy to

Subscribe via RSS: or e-mail (the form in right sidebar).