سؤال

I'm runnning a script to restore some header columns to a CSV file. It takes the original file that has the header columns as a dictionary and stitches them back into the file which has lost it's header columns.

The issue is that it is incredibly slow. These files are both moderately large (~50mb) with 200,000 rows by 96 columns. At the moment the output file looks correct when I preview it. Growing in size by about 200kb per 10 minutes.

I'm an absolute noob at coding, so any help to figure out why the script is so slow would be appreciated.

hapinfile = file('file_with_header_columns', 'r')
hapoutfile = file('file_missing_header_columns.csv', 'r')
o = file('filescombined.txt', 'w')

dictoutfile={}

for line in hapoutfile:
    a=line.rstrip('\n').rstrip('\r').split('\t')
    dictoutfile[a[0]]=a[1:]

hapinfile.close()

for line in hapinfile:
    q=line.rstrip('\n').rstrip('\r').split('\t')
    g=q[0:11]
    for key, value in dictoutfile.items():
        if g[0] == key:
            g.extend(value)
            o.write(str('\t'.join(g)+'\n'))


hapoutfile.close()
o.close()
هل كانت مفيدة؟

المحلول

It's taking forever because of the nested for loop uselessly trudging through the dict again and again. Try this:

for line in hapinfile:
    q=line.rstrip('\n').rstrip('\r').split('\t')
    g=q[0:11]
    if g[0] in dictoutfile:
        g.extend( dictoutfile[g[0]] )
        o.write(str('\t'.join(g)+'\n'))

نصائح أخرى

For starters, you don't need the internal loop in the second part. That's a dictionary you're looping over, you should just access the value using g[0] as the key. That'll save you a loop over a huge dictionary which happens for every line in the header-less file. If needed, you can check whether g[0] is in the dictionary to avoid KeyErrors.

from __future__ import with_statement   # if you need it

import csv 

with open('file_with_header_columns', 'r') as hapinfile,
         open('file_missing_header_columns', 'r') as hapoutfile,
         open('filescombined.txt', 'w') as outfile:
    good_data = csv.reader(hapoutfile, delimiter='\t')
    bad_data = csv.reader(hapinfile, delimiter='\t')
    out_data = csv.writer(outfile, delimiter='\t')
    for data_row in good_data:
        for header_row in bad_data:
            if header_row[0] == data_row[0]
                out_data.writerow(data_row)
                break   # stop looking through headers

You seem to have a really unfortunate problem here in that you have to do nested loops to find your data. If you could do something like sort the CSV files by header fields, you could get more efficiency. As it is, take advantage of the CSV module and condense everything. You can make use of break which, while a bit odd in a for loop, will at least "short-circuit" you out of the search through the second file once you've found your header.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top