Python: when writing to a large file, keep the file open or to open it and append to it as needed?

StackOverflow https://stackoverflow.com/questions/23679495

  •  23-07-2023
  •  | 
  •  

Pergunta

I am wondering how to best handle writing to a big file in python.

My python code loops many times running an external program (ancient Fortran with a weird input file format), reads its output (a one line file) does some very simple processing and writes to a compiled output file. The external program executes fast (considerably under 1 second).

import subprocess as sp

f_compiled_out = open("compiled.output", "w") 

for i in range(len(large_integer)):

  write_input_for_legacy_program = prepare_input()

  sp.call(["legacy.program"])

  with open("legacy.output", "r") as f:
    input = f.readline()

  output = process(input)

  f_compiled_out.write(output)


close(f_compiled_out)

There are three options that I can think of for producing the compiled output file.

  1. What I am already doing.

  2. open f_compiled_out using with open("comiled.output", "a") as f: f.write(output) on each cycle of main loop

  3. Use awk to do the simple processing and cat the output onto the end "compiled.output".

So what is the overhead of (1) keeping a large file open and writing to the end of it vs. (2) opening and appending to it for each write vs. (3) using awk to do the processing and cat to build up the "compiled.output".

At no stage does the whole output need to be in memory.

P.S. if anyone can see any other obvious things that would be slowing this down as N_loops gets large that would be awesome too!

Foi útil?

Solução

Opening and closing the files definitely has a cost. However if your legacy program takes one or more second to respond you propably won't notice.

def func1():
    for x in range(1000):
        x = str(x)
        with open("test1.txt", "a") as k:
            k.write(x)

1 loops, best of 3: 2.47 s per loop

def func2():
    with open("test2.txt", "a") as k:
        for x in range(1000):
            x = str(x)
            k.write(x)

100 loops, best of 3: 6.66 ms per loop

However if your file get's really big it becomes slower: (800+mb)

def func3(file):
    for x in range(10):
        x = str(x)
        with open(file, "a") as k:
            k.write(x)

12kb file:

10 loops, best of 3: 33.4 ms per loop

800mb+ file:

1 loops, best of 3: 24.5 s per loop

Keeping the file open will mainly cost you memory.

I would suggest using SQlite to store your data.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top