Frage

I am running the following script in order to append files to one another by cycling through months and years if the file exists, I have just tested it with a larger dataset where I would expect the output file to be roughly 600mb in size. However I am running into memory issues. Firstly is this normal to run into memory issues (my pc has 8 gb ram) I am not sure how I am eating all of this memory space?

Code I am running

import datetime,  os
import StringIO

stored_data = StringIO.StringIO()

start_year = "2011"
start_month = "November"
first_run = False

current_month = datetime.date.today().replace(day=1)
possible_month = datetime.datetime.strptime('%s %s' % (start_month, start_year), '%B %Y').date()
while possible_month <= current_month:
    csv_filename = possible_month.strftime('%B %Y') + ' MRG.csv'
    if os.path.exists(csv_filename):
        with open(csv_filename, 'rb') as current_csv:
            if first_run != False:
                next(current_csv)
            else:
                first_run = True
            stored_data.writelines(current_csv)
    possible_month = (possible_month + datetime.timedelta(days=31)).replace(day=1)
if stored_data:
    contents = stored_data.getvalue()
    with open('FullMergedData.csv', 'wb') as output_csv:
        output_csv.write(contents)

The trackback I receive:

Traceback (most recent call last):
  File "C:\code snippets\FullMerger.py", line 23, in <module>
    contents = stored_output.getvalue()
  File "C:\Python27\lib\StringIO.py", line 271, in getvalue
    self.buf += ''.join(self.buflist)
MemoryError

Any ideas how to achieve a work around or make this code more efficient to overcome this issue. Many thanks
AEA

Edit1

Upon running the code supplied alKid I received the following traceback.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 22, in <module>
    output_csv.writeline(line)
AttributeError: 'file' object has no attribute 'writeline'

I fixed the above by changing it to writelines however I still received the following trace back.

Traceback (most recent call last):
  File "C:\FullMerger.py", line 19, in <module>
    next(current_csv)
StopIteration
War es hilfreich?

Lösung

In stored_data, you're trying to store the whole file, and since it's too large, you're getting the error you are showing.

One solution is to write the file per line. It is far more memory-efficient, since you only store a line of data in the buffer, instead of the whole 600 MB.

In short, the structure can be something this:

with open('FullMergedData.csv', 'a') as output_csv: #this will append  
# the result to the file.
    with open(csv_filename, 'rb') as current_csv:
        for line in current_csv:   #loop through the lines
            if first_run != False:
                next(current_csv)
                first_run = True #After the first line,
                #you should immidiately change first_run to true.
            output_csv.writelines(line)  #write it per line

Should fix your problem. Hope this helps!

Andere Tipps

Your memory error is because you store all the data in a buffer before writing it. Consider using something like copyfileobj to directly copy from one open file object to another, this will only buffer small amounts of data at a time. You could also do it line by line, which will have much the same effect.

Update

Using copyfileobj should be much faster than writing the file line by line. Here is an example of how to use copyfileobj. This code opens two files, skips the first line of the input file if skip_first_line is True and then copies the rest of that file to the output file.

skip_first_line = True

with open('FullMergedData.csv', 'a') as output_csv:
    with open(csv_filename, 'rb') as current_csv:
        if skip_first_line:
            current_csv.readline()
        shutil.copyfileobj(current_csv, output_csv)

Notice that if you're using copyfileobj you'll want to use current_csv.readline() instead of next(current_csv). That's because iterating over a file object buffers part of the file, which is normally very useful, but you don't want that in this case. More on that here.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top