Pergunta

I have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data.

I'm looking for the most efficient solution for this problem, speed mainly. I don't think by the way that we will have memory issues. Also I would like to know if hashing is really a good solution because we will have to use 64 bit hashing solution to reduce the chance for a collision to less than 1% (we are having around 30000000 rows per aggregation).

For example

file 1: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,50,60,70,80
        a3,b2,c4,60,60,80,90 

file 2: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,30,50,90,40
        a3,b2,c4,30,70,50,90

result: f1,f2,f3,v1,v2,v3,v4  
        a1,b1,c1,80,110,160,120
        a3,b2,c4,90,130,130,180

algorithm that we thought until now:

  1. hashing (using concurentHashTable)

  2. merge sorting the files

  3. DB: using mysql or hadoop or redis.

The solution needs to be able to handle Huge amount of data (each file more than two million rows)

a better example: file 1

country,city,peopleNum
england,london,1000000
england,coventry,500000

file 2:

country,city,peopleNum
england,london,500000
england,coventry,500000
england,manchester,500000

merged file:

country,city,peopleNum
england,london,1500000
england,coventry,1000000
england,manchester,500000

The key is: country,city. This is just an example, my real key is of size 6 and the data columns are of size 8 - total of 14 columns.

We would like that the solution will be the fastest in regard of data processing.

Foi útil?

Solução

Since the key-fileds are always the first colums i would sort the source rows (whithout the header) by the keys and then advance in both files line by line similar to Merge_algorithm used in Mergesort.

But instead of sorting the 2 csv-lists to one you compute the sum of elements that are in both list. Elements that are in one list only are simply copied.

The algorithm looks similar to this:

While (NOT EndOfFile(left-item) AND NOT EndOfFile(right-item))
  if (EndOfFile(right-item) OR left-item.key < right-item.key) store(left-item); advance left-item;
  if (EndOfFile(left-item) OR left-item.key > right-item.key) store(right-item); advance right-item;
  if (left-item.key = right-item.key) store(sum(right-item, left-item)); advance right-item; advance left-item

Outras dicas

I would use a database to store the results (and handle the transactions) and then just write a small script which reads a file and inserts and/or updates the data accordingly (line per line). You can then run the script in parallel on all your input files and fork off as much as your machine can take, or even run it on multiple machines.

The database will be the bottleneck, though.

Licenciado em: CC-BY-SA com atribuição
scroll top