TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


[Download] Wikipedia n-grams (word frequency)

I really liked the idea of Google releasing their n-grams (global web word/phrases frequency counts, “n-gram” is just “n words”). But their terms are TERRIBLE! The “do not use it” license is kind of… dumb. No, really - their license is “pay $180, but you can’t use it for more than quoting a few phrases from there”. That’s for $180 bucks? I’d better go for about 100 beers for that money (hmm.. better spend only $100 for beer, I need to keep something for new liver too :) ). Beer is cheap and terrible in Russia :)
Well, since I had about 40 hours of free computer time - here’s English Wikipedia word counts.
(Small sample)

freedom involves	3
freedom iron works	2
freedom iron	2
freedom is a	3
freedom is achieved	2
freedom is allowed	2
freedom is armed	2
freedom is attained	4

Download: (via torrent)
File size: ~180MB in .zip file / ~700MB unpacked
n-grams count: 36.3 million
File format:
Plain text, tab-separated, with come garbage (like {{disambig}} tag is considered a word/unigram “disambig”).
License: FREE as in “Do whatever you want to with those, but don’t do what I wouldn’t” :) Filtered from file: n-grams that occur only once
Included: unigrams, bigrams, trigrams (in plain English: from 1 to 3 words in phrase)

Download requires BitTorrent client. If you don’t have one - just get uTorrent. Then install it, press “File” -> “Add torrent from URL” and paste the URL from above. As much as I love Opera, but avoid Opera’s BitTorrent client - it’s too slow.

IMPORTANT. Please link ONLY to this page [ http://rarestblog.com/download-wikipedia-n-grams-word-frequency/ ] (don’t link directly to file/torrent), as the Torrent location/tracker might change in future if the distribution costs me too much.

Subscribe via RSS: or e-mail (the form in right sidebar).