Create a 350000 column csv file by merging smaller csv files

https://stackoverflow.com/questions/17699614

03-06-2022
|

Question

I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.

I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?

Solution

This python script will do what you want:

#!/usr/bin/env python2

import os
import sys
import codecs

fhs = []
count = 0
for filename in sys.argv[1:]:
    fhs.append(codecs.open(filename,'r','utf-8'))
    count += 1

while count > 0:
    delim = ''
    for fh in fhs:
        line = fh.readline()
        if not line:
            count -= 1
            line = ''

        sys.stdout.write(delim)
        delim = ','
        sys.stdout.write(line.rstrip())
    sys.stdout.write('\n')

for fh in fhs:
    fh.close()

Call it with all the CSV files you want to merge and it will print a new file to stdout.

Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.

So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.

Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow