22nd May 2008
How to split glued words in domains into parts in PHP/Python
Well, I’ve finally got the idea of algorithm how to split things like belfastjobs.com into Belfast Jobs.com - i.e. detect words glued together. So, as soon as algo finishes crunching 70 million domains (which as you can see is goint to take more than 24hrs) - then you’re going to have a title on your site and all the links to related sites would finally become readable.
If you’re going to do that yourself - here’s how I did it:
First of all you need a function that’ll test the words inside as subwords, like this: “menu”: m-enu, me-nu, men-u, m-e-nu, m-en-u, me-n-u, m-e-n-u. Get it? Here’s the source for it in PHP and Python.
Call PHP one like this:
<?
wordsTriage($txt, 'myFunction');
function myFunction($wordsToTestArray) {
## see below
};
?>
And Python’s one:
def myFunction(wordsList): # see below wordsTriage(txt, myFunction)
Now we only have to test if any of those in array are real words. You’d need a wordslist for that (suggestion: try here) or if you want to go like me - get a list of domains from sites like who.is. Now split all domains by dashes (”-”), yes, dashes! You could build a big list from that. Filter it from entries that appear only once.
Next, use that function above to test if each of words in combination is really a word (against wordlist), and make a rating, like so:
cnts=0 foreach word in words cnt1=isword(word) ? wordcount(word) : 2 cnts+=(len(word)**2)*math.log(math.log(cnt1)) cnts=cnts/len(words)
Now the variant with the biggest cnts in result would be the splitted or best fit for dissected domain. You’d have to use some global variable or class member for that.
Oh and it’s best to pre-load wordcount(word) into some sort of list into memory. I tried using memcache for it, but all those hits of non-words (enu, en, u, m, etc….) are really taking all the time. Use unserialize/file_get_contents in php and cPickle in Python.
How fast? Well, my system (Core2Duo) does about 500-1000 splits per second in Python (psyco really helps here).
How accurate?
domain split maxCnt justiceworldwide justice_worldwide 134.620576268 justiceworx justice_worx 61.6600788394 justicewow justice_wow 56.6534190319 justicewrangler justice_wrangler 93.3487077578 justicewriters justice_writers 92.0577660222 justicexchange justice_xchange 92.162314499 justicex justice_x 48.3281603023 justicexcreation justice_x_creation 76.5891593625 justicexhange justice_xh_ange 43.0450404039 a1babyclothing a1_babyclothing 64.2535247012 a1baby a1_baby 21.7238067684 a1babygifts a1_babygifts 55.2439748556 a1backcare a1_backcare 33.330711631 a1backhoe a1_backhoe 37.094683974 a1backman a1_backman 25.630677525 a1backoffice a1_backoffice 77.1811859749 a1backpack a1_backpack 57.406368702 a1backpackers a1_backpackers 102.624920562 a1backup a1_backup 39.7820319434 a1backups a1_backups 41.5113724545
So, there are misses, not to common though..
|
|
