Question

I have a MySQL database table that I need to process. It takes about 1 second to process 3 rows (due to CURL connections I need to make for each row). So, I need to fork the PHP script in order to have a reasonable time (since I will process up to 10,000 rows for one batch).

I'm going to run 10-30 processes at once, and obviously I need some way to make sure that processes are not overlapping (in terms of which rows they are retrieving and modifying).

From what I've read, there are three ways to accomplish this. I'm trying to decide which method is best for this situation.

Option 1: Begin a transaction and use SELECT ... FOR UPDATE and limit the # of rows for each process. Save the data to an array. Update the selected rows with a status flag of "processing". Commit the transaction and then update the selected rows to a status of "finished".

Option 2: Update a certain number of rows with a status flag of "processing" and the process ID. Select all rows for that process ID and flag. Work with the data like normal. Update those rows and set the flag to "finished".

Option 3: Set a LIMIT ... OFFSET ... clause for each process's SELECT query, so that each process gets unique rows to work with. Then store the row IDs and perform and UPDATE when done.

I'm not sure which option is the safest. I think option 3 seems simple enough, but I wonder is there any way this could fail? Option 2 also seems very simple, but I'm not sure if the locking due to the UPDATE cause everything to slow down. Option 1 seems like the best bet, but I'm not very familiar with FOR UPDATE and transactions, and could use some help.

UPDATE: For clarity, I have currently just one file process.php which selects all the rows and posts the data to a third-party via Curl one-by-one. I'd like to have a fork in this file, so the 10,000 rows can be split among 10-30 child processes.

Was it helpful?

Solution 2

I ended up using mult_curl functions (as proposed by Brad) to accomplish this task. I divided the array of rows into groups of 100 using array_chunk() and then configured a multi_curl task to process them. I started out using ParallelCurl, but it did not end up working correctly, so I just coded the mult_curl myself.

It went from taking almost 2 hours to process 10,000 curl connections to taking just a few minutes.

OTHER TIPS

Another way of handling this is to put the ids that you need to process into a redis queue (list). You can then pop/push items from the list. When the len(list) is empty, you know that there is nothing left to process.

There is also the php resque project which will implement some of the job queuing you want to do.

https://github.com/chrisboulton/php-resque

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top