Question

Using fgetcsv, can I somehow do a destructive read where rows I've read and processed would be discarded so if I don't make it through the whole file in the first pass, I can come back and pick up where I left off before the script timed out?

Additional Details:

I'm getting a daily product feed from a vendor that comes across as a 200mb .gz file. When I unpack the file, it turns into a 1.5gb .csv with nearly 500,000 rows and 20 - 25 fields. I need to read this information into a MySQL db, ideally with PHP so I can schedule a CRON to run the script at my web hosting provider every day.

I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.

My idea was to grab the information from the .csv using the fgetcsv function, but I'm expecting to have to take multiple passes at the file because of the 3 minute timeout, I was thinking it would be nice to whittle away at the file as I process it so I wouldn't need to spend cycles skipping over rows that were already processed in a previous pass.

Was it helpful?

Solution

From your problem description it really sounds like you need to switch hosts. Processing a 2 GB file with a hard time limit is not a very constructive environment. Having said that, deleting read lines from the file is even less constructive, since you would have to rewrite the entire 2 GB to disk minus the part you have already read, which is incredibly expensive.

Assuming you save how many rows you have already processed, you can skip rows like this:

$alreadyProcessed = 42; // for example

$i = 0;
while ($row = fgetcsv($fileHandle)) {
    if ($i++ < $alreadyProcessed) {
        continue;
    }

    ...
}

However, this means you're reading the entire 2 GB file from the beginning each time you go through it, which in itself already takes a while and you'll be able to process fewer and fewer rows each time you start again.

The best solution here is to remember the current position of the file pointer, for which ftell is the function you're looking for:

$lastPosition = file_get_contents('last_position.txt');
$fh = fopen('my.csv', 'r');
fseek($fh, $lastPosition);

while ($row = fgetcsv($fh)) {
    ...

    file_put_contents('last_position.txt', ftell($fh));
}

This allows you to jump right back to the last position you were at and continue reading. You obviously want to add a lot of error handling here, so you're never in an inconsistent state no matter which point your script is interrupted at.

OTHER TIPS

You can avoid timeout and memory error to some extent when reading like a Stream. By Reading line by line and then inserts each line into a database (Or Process accordingly). In that way only single line is hold in memory on each iteration. Please note don't try to load a huge csv-file into an array, that really would consume a lot of memory.

if(($handle = fopen("yourHugeCSV.csv", 'r')) !== false)
{
    // Get the first row (Header)
    $header = fgetcsv($handle);

    // loop through the file line-by-line
    while(($data = fgetcsv($handle)) !== false)
    {
        // Process Your Data
        unset($data);
    }
    fclose($handle);
}

I think a better solution (it will be phenomnally inefficient to continuously rewind and write to open file stream) would be to track the file position of each record read (using ftell) and store it with the data you've read - then if you have to resume, then just fseek to the last position.

You could try loading the file directly using mysql's read file function (which will likely be a lot faster) although I've had problems with this in the past and ended up writing my own php code.

I have a hard timeout on the server set to 180 seconds by the hosting provider, and max memory utilization limit of 128mb for any single script. These limits cannot be changed by me.

What have you tried?

The memory can be limited by other means than the php.ini file, but I can't imagine how anyone could actually prevent you from using a different execution time (even if ini_set is disabled, from the command line you could run php -d max_execution_time=3000 /your/script.php or php -c /path/to/custom/inifile /your/script.php )

Unless you are trying to fit the entire datafile into memory then there should be no issue with a memory limit of 128Mb

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top