Comparing multiple very large csv files against each other

https://stackoverflow.com/questions/7626026

06-02-2021
|

Question

I have n csv files which I need to compare against each other and modify them afterwards. The Problem is that each csv file has around 800.000 lines.

To read the csv file I use fgetcsv and it works good. Get some memory pikes but in the end it is fast enough. But if I try to compare the array against each other it takes ages.

One other Problem is that I have to use a foreach to get the csv data with fgetcsv because of the n amount of files. I end up with one ultra big array and can't compare it with array_diff. So i need to compare it with nested foreach loops and that take ages.

a code snippet for better understanding:

foreach( $files as $value ) {
    $data[] = $csv->read( $value['path'] );
}

my csv class use fgetcsv to add the output to the array:

fgetcsv( $this->_fh, $this->_lengthToRead, $this->_delimiter, $this->_enclosure )

Every data of all the csv files are stored in the $data array. This is probably the first big mistake to use only one array, but I have no clue how to stay flexible with the files without to use an foreach. I tried to use flexible variable names but I stucked there as well :)

Now I have this big array. Normally if I try to compare the values against each other and to find out if the data from file one exists in file two and so on, I use array_diff or array_intersect. But in this case I have only this one big array. And as I said, to run an foreach over it takes ages.

Also after only 3 files I have an array with 3 * 800.000 entries. I guess latest after 10 files my memory will explode.

So is there any better way to use PHP to compare n amount of very large csv files?

Solution

Use SQL

Create a table with the same columns as your CSV files.
Insert the data from the first CSV file.
Add indexes to speed up queries.
Compare with other CSV files by reading a line and issuing a SELECT.

You did not describe how you compare n files, and there are several ways to do so. If you just want to find the line that are in A1 but not in A2,...,An, then you'll just have to add a boolean column diff in your table. If you want to know in which files a line is repeated, you'll need a text column, or a new table if a line can be in several files.

Edit: a few words on performance if you're using MySQL (I do not now much about other RDBMS).

Inserting lines one by one would be too slow. You probably can't use LOAD DATA unless you can put the CSV files directly onto the DB server's filesystem. So I guess the best solution is to read a few hundreds of lines in the CSV then send a multiple insert query INSERT INTO mytable VALUES (..1..), (..2..).

You can't issue a SELECT for each line you read in your other files, so you'd better put them in another table. Then issue a multiple-table update to mark the rows that are identical in the tables t1 and t2: UPDATE t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b) SET t1.diff=1

Maybe you could try using sqlite. No concurrency problems here, and it could be faster than the client/server model of MySQL. And you don't need to setup much to use sqlite.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow