Simultaneous line-by-line reading of n-number of files in python

https://stackoverflow.com/questions/23373702

12-07-2023
|

Question

I have an unknown number (it may and will change over time) of measurement data CSV files in folder on which I would like to perform statistics on. CSVs have 5 columns of data in all of them. I want to be able to do statistical analysis on each line separately (average over multiple measurements, stdev, etc). ATM I've gotten so far as list files in folder, stash them into the list and try to open files from list. It gets very confusing when trying to iterate over lines over files. Right now I was just trying to append contents to the list and output them into other file. No luck. Code may not be very clean, I'm a beginner in programming, but here we go:

import re
import os

lines_to_skip = 25
workingdir = os.path.dirname(os.path.realpath(__file__))
file_list = []
templine = []
lineNo = 0

print ("Working in %s" %workingdir)
os.chdir(workingdir)
for file in os.listdir(workingdir):
        if file.endswith('.csv'):
                #list only file name without extension (to be able to use filename as variable later)
                file_list.append(file[0:-4])
#open all files in the folder
print (file_list)
for i, value in enumerate(file_list):
    exec "%s = open (file_list[i] + '.csv', 'r')" % (value)

#open output stats file
fileout = open ('zoutput.csv', 'w')

#assuming that all files are of equal length (as they should be)
exec "for x in len(%s + '.csv'):" % (file_list[0])
for i in xrange(lines_to_skip):
        exec "%s.next()" % (file_list[0])
        for j, value in enumerate(file_list):
                templine[:]=[]
                #exec "filename%s=value" % (j)
                exec "line = %s.readline(x)" % (value)
                templine.extend(line)
        fileout.write(templine)

fileout.close()
#close all files in the folder
for i, value in enumerate(file_list):
    #exec "filename%s=value" % (i)
    exec "%s.close()" % (value)

Any suggestions how I could do it other way or improve existing approach? First 25 lines are just info fields, which for my purpose are useless. I could just remove first 25 lines from each file separately (instead of trying to skip them), but I guess it doesn't matter much. Please don't recommend to use spreadsheets or other statistical software - none of them I've tried so far is able to chew amounts of data I have. Thanks

Solution

If I understand your question correctly you want to paste the columns of each file onto one another and, from N files, with C columns and R rows, you want to process one row at a time, where each row has N*C columns?

$ cat rowproc.py
import sys

for l in sys.stdin:
    row = map(float, l.split())
# process row

$ paste *.csv | tail -n+25 | python rowproc.py

Or, if you're unlucky enough to not have a Unix-like environment handy and have to do everything in python:

import sys
from  itertools import izip

filehandles = [ open(fn) for fn in sys.argv[1:] ]
for i, rows in enumerate(izip(*filehandles)):
    if i<25: continue

    cols = [ map(float, row.split()) for row in rows ]
    print cols

Result:

[[150.0, 26.0], [6.0, 8.0], [14.0, 10.0]]
[[160.0, 27.0], [7.0, 9.0], [16.0, 11.0]]
[[170.0, 28.0], [8.0, 10.0], [18.0, 12.0]
...

As long as you're able to open enough files simultaneously, both of these methods will handle arbitrarily large amounts of data.

If you can't pass the filenames through argv then use Glob

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow