Question

I am currently working on my first Python project and I need to parse through a 2GB file. I've found out that if I went line by line it would be very very slow... However the buffering method, using:

f = open(filename)                  
lines = 0
buf_size = 1024 * 1024
read_f = f.read 
buf = read_f(buf_size)
while buf:
    for line in buf:
      #code for string search
      print line
    buf = read_f(buf_size)

Here the print line doesn't print a "line", it only prints a character at a time per line. So I am having problem doing substring find on it... Please Help!

Was it helpful?

Solution

print line prints a character because buf is a string, and iterating over a string yields the characters of the string as 1-character strings.

When you say that reading line-by-line was slow, how did you implement the read? If you were using readlines(), that would explain the slowness (see http://stupidpythonideas.blogspot.com/2013/06/readlines-considered-silly.html).

Files are iterable over their lines, and Python will pick a buffer size when iterating, so this might suit your needs:

for line in f:
    # do search stuff

If you want to specify the buffer size manually, you could also do this:

buf = f.readlines(buffersize)
while buf:
    for line in buf:
        # do search stuff
    buf = f.readlines(buffersize)

Though, the first of the two is usually better.

OTHER TIPS

The problem is that buf is a string...

Say buf = "abcd"

That means, buf[0] = a, buf[1]=b and so on.

for line in buf:
    print line

would result in a b c d

That means in your for-loop, you do not loop over "lines", but over all elements of the buf-string. You may use readlines or split your buffer to single lines by looking for "\n".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top