18th May 2008
Organizing a high-performance queue in PHP/MySQL in 6 steps or how to walk 70 million domains without repeating yourself
So, there was the problem of walking through the Internet and hopefully without repeats. THe first solution came as a plain-text file with list of domains and overwriting domain as we read it with new-line characters and reading from random point (fseek) until a non-empty string is found. It led to multiple HDD deaths (random reads hundred times per second is something to avoid). Although it was high-performance - the seek time didn’t rise with the size of file. That was the case with mySQL - even with 10 millions domains it started to choke and selecting random entry from there is a pain too with rising time with each record added.
This describes version 2 of highperf queue, version 3 is not as portable, as it involves python, mod_python and editing httpd.conf on server.
So the better solution came later.
- Group your entries into packs of a hundred or so, like this: “google.com,yahoo.com,something.com,…” - let’s call it “item” field
- Add it to mySQL (performance here is no problem with 1st step done since you cut the amount of entries by at least two zeroes at the end). If performance begins to be an issue - just group “items” into packs of thousands.
- Retrieve either first entry or random entry as: $let=chr(65+floor(rand(0,25))); mysql_query(’SELECT … WHERE item LIKE “$let%”‘); // don’t forget that for random seeks that “item” field should have a key and that $let function is only good for A to Z range, if you have items starting with other letters, maybe $let=chr(floor(rand(0,255)));
- Call via HTTP or RPC call something that will act as subqueue, like file_get_contents(”http://myserver/cgi-bin/queue.pl?items=google.com,yahoo.com,something.com,…”);
- queue.pl is a subqueue processing script, which takes 100 items and work with them. It’s best that you have Perl script, since it can multithread itself (or fork) into 100 processes that will work with data in parallel. ( search Google for ForkManager ) PHP is really bad at forking. And you need to work in parallel with sites since some of them just hang and you’d have to wait for 2 minutes before moving to next responsive one. That’s time better wasted to work with some other site, than waiting.
- So Parallel::ForkManager (see the manual for basic example, add this:
use CGI qw(:all); @items=split(/,/, param(’items’)); for $item (@items) { system(’php myscript.php ‘.$item); };
it just forks 100 times and runs myscript.php with a parameter, which you can read from php (search google for “php argv” - the argument will be something like $GLOBALS[argv][1]) and runs 100 processes in parallel
There you go - high-performance queue in 6 steps which can process hundreds of millions of items on commodity hardware and widely available scripts on hostings/servers. Watch out for using this on shared server as it could easily eat all processor time of whole server, which could get you in trouble.
It’s best to watch for ‘uptime’ which is in PHP got like this: (remember WordPress breaks quotes, so you have to replace them with usual single quotes)
function uptime() { $fp=@popen('uptime','r'); $s=@fgets($fp); @fclose($fp); @preg_match('#load average: ([0-9\.]+)#', $s, $m); return $m[1]; };
If it get’s higher than 5 - you’re in trouble. So you better watch that parameter before making the call at 4th step. If it’s high - you’d better let your server cool out.
|
|