TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


Archive for the 'word frequency' Category

14th Nov 2008

It’s like Chinese room experiment

This is more about upcoming SEmangic - I’ve improved algo even more.

Damn, I love science, the computer don’t know squat about what those words mean or how they’re related to “skiing”, but look at it go:

1000 ngrams analyzed: skiing,winter,mountain,industry,beach,map,summer,unique,enjoy,culture,
welcome,season,offers,room,beautiful,built,shop,outdoor,golf,areasskiing,mountain,winter,beach

6000 ngrams analyzed: skiing,winter,snow,shops,ski,fishing,hiking,village,accommodation,resort,
alpine,zealand,finest,magnificent,guests,springs,unit,bathroom,vacation,attractions

15000 ngrams analyzed: skiing,ski,shops,hiking,village,accommodation,alpine,resort,magnificent,
guests,mountains,climbing,scenic,trails,harbour,comfortable,bookings,prestigious,seasons,magazines

32000 ngrams analyzed: skiing,ski,hiking,alpine,magnificent,climbing,scenic,trails,harbour,
bookings,prestigious,seasons,magazines,coastal,majestic,situated,renowned,picturesque,superb,lodge

With each iteration the words are getting more and more closely related. Damn and that’s with only 7000 random sites training! Only 24 (yep twenty four) of them contain word ’skiing’!

Science is magic! I’m even having second thoughts on whether I should release this at all :) I really feel like I’m being the “tester” in Chinese room experiment and computer plays me.

Posted in Uncategorized, word frequency | Comments Off

01st Aug 2008

All Wikipedia’s n-grams are REALLY belong to us

A few days ago I’ve thought about Google releasing it’s n-grams in the past. Damn, that was the second thing I’ve wanted to get after the TLD Zone Access Program [I did apply to it, but never heard back from them. In our open information age - the most wanted information even though seem open is usually out-of-reach, like in those 2 cases ]

So, I’ve decided to do the next best thing - build a ngram frequency list from Wikipedia. It’s not quite as big as Google’s (with trillion or so n-grams and 5 DVDs), but the licensing terms are much better (”Free” vs Google’s “$180″, “you can use it” vs “you can’t use it” Google’s license).

Read more and download is here

Posted in downloads, word frequency | Comments Off