Question

I have one problem regarding using python to process the trace file (it contains billion lines of data).

What I want to do is, the program will find one specific line in the file (say it is line# x), and it needs to find another symbol from this (line# x) in the file. Once it finds the line, starts from (line# x) again to search another one.

What I did now, is as following, but the problem is it always needs to reopen the file and read from the beginning to find the match ones (line # > x, and contain the symbol I want). For one big trace file, it takes too long to processing.

1.

    for line in file.readlines()
      i++ #update the line number
      if i > x:
          if (line.find()):

or:

   for i, line in enumerate(open(file)):
      if i > x:
          if ....

Anyone can give me one hint on better ideas?

Thanks

Was it helpful?

Solution

If the file is otherwise stable, use fileobj.tell() to remember your position in the file, then next time use fileobj.seek(pos) to return to that same position in the file.

This only works if you do not use the fileobject as an iterator (no for line in fileobject) or next(fileobject)) as that uses a read-ahead buffer that will obscure the exact position.

Instead, use:

for line in iter(fileobj.readline, ''):

to still use fileobj in an iteration context.

OTHER TIPS

I suggest you use random access, and record where your line started. Something like:

index = []

fh = open(gash.txt)

for line in fh:
    if target in line:
        index.append(fh.tell() - len(line))

Then, when you want to recall the contents, use fh.seek(index[n]).

A couple of "gotchas":

  1. Notice that the index position will not be the same as the line number. If you need the line number then maybe use a dictionary, with the line number as the key.

  2. On Windows, you will have to adjust the file position by -1. This is because the "\r" is stripped out and does not appear in the len(line).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top