Pregunta

I need to find all lines of a text file containing a particular string and write each line in a different text file. How can I improve my code to prevent system crash for reading first 5,000,000,000 lines of a big text file (6GB size)? After compiling the code my pc running slow, and suddenly freezing up. Even I stop compiling process, memory remains occupied and same problem comes up. My IDE is Spyder and I use Python 2.7. Thank you!

My code is:

import fileinput

ot = 'N'
j = 1
i = 1
string = "ABCD"

for line in fileinput.input(['/../myfile.txt']):
    if i<=5000000000:
        if string in line:
            output = open(ot + str(j) + '.txt', 'w')
            output.write(line)
            output.close()
            j += 1
        i += 1
¿Fue útil?

Solución

You can try this code:

file_input = open('mhyfile.txt','r')
for line in file_input:
    #Your code here

The for line in file_input: loop will read the file line by line. But I test in my linux system and find fileinput.input() use no more memory. I think you should give more information about your problem.

One possible problem is that you write too many files into your disk and cause the system crash. You can try to write the selected lines into one single file and mark the line number j.

Otros consejos

from itertools import izip
ot = 'N%d.txt'
j = 1
lim = 5*10**9
with open('myfile.txt') as f:
    #the xrange part replaces the i < 5e9 thing you had.
    for line, _ in izip(f,xrange(lim)):
        if 'ABCD' in line:
            output = open(ot % j, 'w')
            output.write(line)
            output.close()
            j += 1

This should run fine, but it might take a while if your file is huge, though it shouldn't take up much memory.

EDIT
I added izip to avoid eating up tons of memory. izip is like zip, except it returns a generator instead of list.

The canonical way to choose just the first limit items of an iterator is to use islice from itertools - islice(my_file, limit) is similar to my_file.readlines()[:limit], except that it avoids reading the whole file into memory. Counting just the lines with a given string in them is only a little bit more complex: use a generator expression to get just those lines, then islice those.

from itertools import islice
ot = 'N%d.txt'
limit = 5000000000  

with open('myfile.txt') as f:
   lines = (line for line in f if 'ABCD' in line)
   for j, line in enumerate(islice(lines, limit), start=1):
       with open(it % j, 'w') as out:
          out.write(line)

Try this:

file_num = 1

with open('myfile.txt', 'r') as file:
    for i in range(5000000000):
        if file.readline(i) == 'ABCD':
            with open('N' + file_num + '.txt', 'w') as write_file:
                write_file.write(file.readline(i))
                file_num += 1

Not sure how well it well help with crashing but it is much cleaner. Ask questions below.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top