My Projects: TheRarestWords, RarestNews, Suggestan, TheCraziestIdeas, Flim.me, MereFact, SemanticKernelBot, My development blog . Wanna help?


22nd May 2008

How to split glued words in domains into parts in PHP/Python

Well, I’ve finally got the idea of algorithm how to split things like belfastjobs.com into Belfast Jobs.com - i.e. detect words glued together. So, as soon as algo finishes crunching 70 million domains (which as you can see is goint to take more than 24hrs) - then you’re going to have a title on your site and all the links to related sites would finally become readable.

If you’re going to do that yourself - here’s how I did it:

First of all you need a function that’ll test the words inside as subwords, like this: “menu”: m-enu, me-nu, men-u, m-e-nu, m-en-u, me-n-u, m-e-n-u. Get it? Here’s the source for it in PHP and Python.

Call PHP one like this:

<?
wordsTriage($txt, 'myFunction');
function myFunction($wordsToTestArray) {
 ## see below
};
?>

And Python’s one:

def myFunction(wordsList):
 # see below

wordsTriage(txt, myFunction)

Now we only have to test if any of those in array are real words. You’d need a wordslist for that (suggestion: try here) or if you want to go like me - get a list of domains from sites like who.is. Now split all domains by dashes (”-”), yes, dashes! You could build a big list from that. Filter it from entries that appear only once.

Next, use that function above to test if each of words in combination is really a word (against wordlist), and make a rating, like so:

cnts=0
foreach word in words
 cnt1=isword(word) ? wordcount(word) : 2
 cnts+=(len(word)**2)*math.log(math.log(cnt1))
cnts=cnts/len(words)

Now the variant with the biggest cnts in result would be the splitted or best fit for dissected domain. You’d have to use some global variable or class member for that.

Oh and it’s best to pre-load wordcount(word) into some sort of list into memory. I tried using memcache for it, but all those hits of non-words (enu, en, u, m, etc….) are really taking all the time. Use unserialize/file_get_contents in php and cPickle in Python.

How fast? Well, my system (Core2Duo) does about 500-1000 splits per second in Python (psyco really helps here).

How accurate?

domain              split                maxCnt
justiceworldwide    justice_worldwide    134.620576268
justiceworx         justice_worx          61.6600788394
justicewow          justice_wow           56.6534190319
justicewrangler     justice_wrangler      93.3487077578
justicewriters      justice_writers       92.0577660222
justicexchange      justice_xchange       92.162314499
justicex            justice_x             48.3281603023
justicexcreation    justice_x_creation    76.5891593625
justicexhange       justice_xh_ange       43.0450404039
a1babyclothing	    a1_babyclothing	  64.2535247012
a1baby	            a1_baby       	  21.7238067684
a1babygifts	    a1_babygifts	  55.2439748556
a1backcare	    a1_backcare		  33.330711631
a1backhoe	    a1_backhoe		  37.094683974
a1backman	    a1_backman		  25.630677525
a1backoffice	    a1_backoffice	  77.1811859749
a1backpack	    a1_backpack		  57.406368702
a1backpackers	    a1_backpackers	  102.624920562
a1backup	    a1_backup		  39.7820319434
a1backups	    a1_backups		  41.5113724545

So, there are misses, not to common though..

This entry was posted on Thursday, May 22nd, 2008 at 5:58 pm and is filed under php, python.

Subscribe via RSS: or e-mail (the form in right sidebar).