My Projects: TheRarestWords, RarestNews, Suggestan, TheCraziestIdeas, Flim.me, MereFact, SemanticKernelBot, My development blog . Wanna help?


Archive for May, 2008

23rd May 2008

New feature: Fight your site! :)

Another fun feature roll out for today - fight your site! It’s a joke, don’t take it too seriously. It’s more of SEO tool actually.

See example here or go to your own site and click [VS] link under “Related sites”. If you want to compare with another site - just make a change in the page address (it’s pretty self-explanatory).

The address is http://therarestwords.com/vs/your-site.com/competitors-site.com

What this tool does is compare your site’s rare and rarest words against competition and finds words they use, that you don’t (but you should!).

Green stuff is what your site has, but competitor doesn’t. And red stuff is the words your competitor uses, but you don’t :)
P.S. Congrats, Mircea! Your site won the round :)

Posted in seo, site | No Comments »

23rd May 2008

SEO by example words

While waiting for domain dissection to wrap up, I thought that maybe I could build another index - the rarest/rare words corpus by category. Like this:

       Category: Domain Names

words most used: domain, dns, ip, registrar, hosting, registration
  commonly used: register,  whois
    rarely used: here comes a thousand or so other words

(more…)

Posted in ideas, seo | No Comments »

22nd May 2008

How to split glued words in domains into parts in PHP/Python

Well, I’ve finally got the idea of algorithm how to split things like belfastjobs.com into Belfast Jobs.com - i.e. detect words glued together. So, as soon as algo finishes crunching 70 million domains (which as you can see is goint to take more than 24hrs) - then you’re going to have a title on your site and all the links to related sites would finally become readable.

If you’re going to do that yourself - here’s how I did it:

(more…)

Posted in php, python | No Comments »

22nd May 2008

Auto-categorizer by content and LSI/SEO which has no competition and link relevancy

Finally this has arrived. The algorithm is my original development, it shows pretty good results often, but sometimes it can be stubbornly dumb. :) One site that I own that’s about some algorithms and technologies in areas of GIS - it categorized it best as “Boring” and “Unfinished”. :) So don’t blame the computer - it’s dumb. It’s also really slow right now and lazy. Last one (lazy) is in technical meaning - in that way that categories are generated when needed, not before, like all other stuff on site, cause it would take more than 50 days for my single server to categorize all 70 million pages in database. Unfortunately that’s as fast as I can go without resorting to some compiled language such as C.

The algorithm bases it’s results purely on words in text - html tags are totally ignored (title tag is not even considered) and also it can result in some words that aren’t even in your text - that’s normal - it uses words to guess category, which can include other words. Just braggin’ :) However test suite shown very big crossing with Google’s results - i.e. if you search something in google, let’s say “widgets” and then go through the same sites with my algo - you’d see a lot of “widgets” in my results too. Which should suggest some ideas to those who understand what SEO is and what is LSA and how they’re related.

I’m not suggesting that Google is using LSI/LSA or something like I do, but it surely seems like my algo could be on to something.

And if you’re too lazy to guess the idea about SEO - the results from Google if it were only considering on-site factors would be almost identical to my categorizer. Now think how you can use it to your advantage. :) But remember that TheRarestWords can re-index your site (use “add url” form) only once a day and only main pages right now.

As usual - any comments, funny stuff or totally wrong stuff is welcome in comments.

I’m thinking to release this tool as AJAX-style text-box so that changes in text could almost instantly show categorization change AND suggest other words to use to achieve better position in some categories and avoid writers block. But this is only in plans when this project stops eating my money :)
P.S. Honestly… I haven’t read about LSA, just heard the basic idea and I think my algo could be close to it. But I’m too lazy to

Posted in seo | No Comments »

22nd May 2008

Sites that are related to yours by the rarest words

A new feature accidently emerged. :) Please welcome - the related sites. Now you don’t have to click each of your rarest words to see words that are sharing them (and therefore - your interests possibly). There are things that are plain wrong - for example of my other sites matched a porn site. Well, it’s auto-matching.. and it’s pretty dumb… So you’d have to excuse it.

Posted in site | No Comments »

22nd May 2008

Correct word index is building

Finally got around the bugs that prevented word index from correctly building. Now you can really see all sites that use the word, but capped at ~600 sites (there are words that don’t make it into Top25000, but still used on tens of thousands of sites - I don’t think there’s any reason to list all of those thousands upon thousands of sites for each list).

However, traversing the word index of your rarest words seems to be an interesting way to find similar sites. One of the ideas that I think about is to actually build a cross-list of sites by word-index - therefore building a list of most related sites to yours.

Posted in site | No Comments »

21st May 2008

More new features

Thanks to the input of visitors of the site - new features are appearing. For example, Top 100 sites now have up and down arrows whenever something moves there up or down.

Another new feature is what happened last. It’s kind of a journal of what has been written here in the last 24 hours.

Also, note that you can re-index your sites yourself - put the url into “add url” form (at top of every page on main site) and push the button press “re-index” - the site will be updated. It also works if you want to remove your site - use robots.txt blocking and re-index.

Most of the new stuff appearing looks raw and it actually is. I have reasons for that 1) I want to implement a lof of stuff and see what you will actually use 2) I don’t really have time for this project - I can only do it in free time and it’s currently eating all of my free time and even some of sleep time :)
So the plan is to implement all of the stuff, see what’s going on and what’s useful and then re-implement it with a nice face-lift. Or we’ll leave everything as it is. Time will tell.

Posted in site | No Comments »

20th May 2008

Word-index feature

Finally you can see who else uses same the rarest words as you use and therefore potentially find related sites. However this is a tech-preview only. It contans bugs, errors and problems! It’s by no means full yet. I’m working on it, but just for fun check it out - doodles, help, home, google, appear… or click your own rarest word.

Posted in site | No Comments »

20th May 2008

[BUG] Google isn’t so “inadvanced” after all - new crawl is going on

Today I’ve discovered a nasty bug which made pages to be parsed incorrectly. A lot of “rarest words” weren’t words at all, but rather two separate words glued.

That’s why there are words like “inadvanced” or “toolsover” on Google’s page or other glued words on other pages. I’ve smashed the bug, but the bad news is that it was skewing the results. Now I have to re-crawl the pages again. Your pages will be updated within next 9-10 days (while the crawl goes), if you want - you can force update of your (or any other for that matter) pages via “add url” link on top of the page.

Looks like there aren’t 3 million new words in Web :( Too bad.

Well, this is kind of good and bad news. Bad news is that Trendspotting and other keyword-related fetures are put away due to lack of data… Good news is that I’ll be concentrating on “fun” part of the site, making it better (there are a lot of ideas and with Trends away I finaly have time to do them).

I still can’t believe nothing rose my suspicion about so many pages having so weird words. In other good news - I’ve found the lost right sidebar!

Posted in site | No Comments »

20th May 2008

[BUG] Google isn’t so “inadvanced” after all - new crawl is going on

Today I’ve discovered a nasty bug which made pages to be parsed incorrectly. A lot of “rarest words” weren’t words at all, but rather two separate words glued.

That’s why there are words like “inadvanced” or “toolsover” on Google’s page or other glued words on other pages. I’ve smashed the bug, but the bad news is that it was skewing the results. Now I have to re-crawl the pages again. Your pages will be updated within next 9-10 days (while the crawl goes), if you want - you can force update of your (or any other for that matter) pages via “add url” link on top of the page.

Looks like there aren’t 3 million new words in Web :( Too bad.

Well, this is kind of good and bad news. Bad news is that Trendspotting and other keyword-related fetures are put away due to lack of data… Good news is that I’ll be concentrating on “fun” part of the site, making it better (there are a lot of ideas and with Trends away I finaly have time to do them).

I still can’t believe nothing rose my suspicion about so many pages having so weird words. In other good news - I’ve found the lost right sidebar!

Posted in site | No Comments »