Question

I just accepted a similar question (PHP + MySQL Queue), but I realized that it wasn't the correct question for my problem, but was the correct answer for my question :)

I have a MySQL (MyISAM type) table of sites to be scraped by workers.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_pop` int(13) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

What I need is to scrap one site per worker without repeat. So, if I have 3 sites and 2 workers the system needs to work like this:

      ID URL   LAST_POP
t4    1  site1 t1         <- worker1 scrap site1
t4    2  site2 t2         <- worker2 scrap site2
t5    3  site3 t3         <- worker1 scrap site3
t6    1  site2 t4         <- worker2 scrap site2
t6    2  site1 t4         <- worker1 scrap site1
t7    3  site3 t5         <- worker2 scrap site3
....

It's like a cycled queue orderer by last_pop ASC.

How can I do that?

Was it helpful?

Solution

You'll probably want to keep track of two pieces of information for each site: when it was last scraped and if it is currently being scraped.

Using the answer for your other question, set the scraping field to the id of the worker to lock it from other workers. When the worker is finished with the task set the scraping field back to null and the last_scrape date to the current time.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_scrape` TIMESTAMP,
  `scraping` tinyint(1) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

Lock and retrieve the next job (the site that was last scraped the longest time ago):

Update site
  set `scraping` = '$worker_id' 
  where `scraping` is null 
  order by `last_scrape` ASC limit 1;

$job = 
  Select * from site
  where `scraping` = '$worker_id'

Release job back to queue:

Update site
  set `scraping` = NULL,
  `last_scrape` = NOW()
  where `scraping` = '$worker_id';

OTHER TIPS

Why not add an extra boolean column STATUS so that you can order by LAST_POP. So when a site is selected by a worker for scrapping run a second query with UPDATE site SET status = '1'. And when the next worker selects the next site run a query with SELECT * FROM site WHERE status = '0' ORDER BY last_pop ASC.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top