[Download] Wikipedia n-grams (word frequency)
I really liked the idea of Google releasing their n-grams (global web word/phrases frequency counts, “n-gram” is just “n words”). But their terms are TERRIBLE! The “do not use it” license is kind of… dumb. No, really - their license is “pay $180, but you can’t use it for more than quoting a few phrases from there”. That’s for $180 bucks? I’d better go for about 100 beers for that money (hmm.. better spend only $100 for beer, I need to keep something for new liver too
). Beer is cheap and terrible in Russia ![]()
Well, since I had about 40 hours of free computer time - here’s English Wikipedia word counts.
(Small sample)
freedom involves 3 freedom iron works 2 freedom iron 2 freedom is a 3 freedom is achieved 2 freedom is allowed 2 freedom is armed 2 freedom is attained 4
Download: (via torrent)
File size: ~180MB in .zip file / ~700MB unpacked
n-grams count: 36.3 million
File format: Plain text, tab-separated, with come garbage (like {{disambig}} tag is considered a word/unigram “disambig”).
License: FREE as in “Do whatever you want to with those, but don’t do what I wouldn’t”
Filtered from file: n-grams that occur only once
Included: unigrams, bigrams, trigrams (in plain English: from 1 to 3 words in phrase)
Download requires BitTorrent client. If you don’t have one - just get uTorrent. Then install it, press “File” -> “Add torrent from URL” and paste the URL from above. As much as I love Opera, but avoid Opera’s BitTorrent client - it’s too slow.
IMPORTANT. Please link ONLY to this page [ http://rarestblog.com/download-wikipedia-n-grams-word-frequency/ ] (don’t link directly to file/torrent), as the Torrent location/tracker might change in future if the distribution costs me too much.