Question

In Python 2.5, I am reading a structured text data file (~30 MB in size) using a file pointer:

fp = open('myfile.txt', 'r')
line = fp.readline()
# ... many other fp.readline() processing steps, which
# are used in different contexts to read the structures

But then, while parsing the file, I hit something interesting that I want to report the line number of, so I can investigate the file in a text editor. I can use fp.tell() to tell me where the byte offset is (e.g. 16548974L), but there is no "fp.tell_line_number()" to help me translate this to a line number.

Is there either a Python built-in or extension to easily track and "tell" what line number a text file pointer is on?

Note: I'm not asking to use a line_number += 1 style counter, as I call fp.readline() in different contexts and that approach would require more debugging than it is worth to insert the counter in the right corners of the code.

Was it helpful?

Solution

A typical solution to this problem is to define a new class that wraps an existing instance of a file, which automatically counts the numbers. Something like this (just off the top of my head, I haven't tested this):

class FileLineWrapper(object):
    def __init__(self, f):
        self.f = f
        self.line = 0
    def close(self):
        return self.f.close()
    def readline(self):
        self.line += 1
        return self.f.readline()
    # to allow using in 'with' statements 
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

Use it like this:

f = FileLineWrapper(open("myfile.txt", "r"))
f.readline()
print(f.line)

It looks like the standard module fileinput does much the same thing (and some other things as well); you could use that instead if you like.

OTHER TIPS

You might find the fileinput module useful. It provides a generalized interface for iterating over an arbitrary number of files. Some relevant highlights from the docs:

fileinput.lineno()

Return the cumulative line number of the line that has just been read. Before the first line has been read, returns 0. After the last line of the last file has been read, returns the line number of that line.

fileinput.filelineno()

Return the line number in the current file. Before the first line has been read, returns 0. After the last line of the last file has been read, returns the line number of that line within the file.

The following code will print the line number(where the pointer is currently on) while traversing through the file('testfile')

file=open("testfile", "r")
for line_no, line in enumerate(file):
    print line_no     # The content of the line is in variable 'line'
file.close()

output:

1
2
3
...

I don't think so, not in the way you desire (as in a standard built-in feature of Python file handles returned by open).

If you're not amenable to tracking the line number manually as you read the lines or the use of a wrapper class (excellent suggestions by GregH and senderle, by the way), then I think you'll have to simply use the fp.tell() figure and go back to the start of the file, reading until you get there.

That's not too bad an option since I'm assuming error conditions will be less likely than everything working swimmingly. If everything works okay, there's no impact.

If there's an error, then you have the extra effort of rescanning the file. If the file is big, that may impact your perceived performance - you should take that into account if it's a problem.

One way might be to iterate over the line and keep an explicit count of the number of lines already seen:

>>> f=open('text.txt','r')
>>> from itertools import izip
>>> from itertools import count
>>> f=open('test.java','r')
>>> for line_no,line in izip(count(),f):
...     print line_no,line

The following code creates a function Which_Line_for_Position(pos) that gives the number of the line for the position pos, that is to say the number of line in which lies the character situated at position pos in the file.

This function can be used with any position as argument, independantly from the value of the file's pointer's current position and from the historic of the movements of this pointer before the function is called.

So, with this function, one isn't limited to determine the number of the current line only during an uninterrupted iteration on the lines, as it is the case with Greg Hewgill's solution.

with open(filepath,'rb') as f:
    GIVE_NO_FOR_END = {}
    end = 0
    for i,line in enumerate(f):
        end += len(line)
        GIVE_NO_FOR_END[end] = i
    if line[-1]=='\n':
        GIVE_NO_FOR_END[end+1] = i+1
    end_positions = GIVE_NO_FOR_END.keys()
    end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

.

The same solution can be written with the help of the module fileinput:

import fileinput

GIVE_NO_FOR_END = {}
end = 0
for line in fileinput.input(filepath,'rb'):
    end += len(line)
    GIVE_NO_FOR_END[end] = fileinput.filelineno()
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = fileinput.filelineno()+1
fileinput.close()

end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

But this solution has some inconveniences:

  • it needs to import the module fileinput
  • it deletes all the content of the file !! There must be something wrong in my code but I don't know fileinput enough to find it. Or is it a normal behaviour of fileinput.input() function ?
  • it seems that the file is first entirely read before any iteration can be launched. If so, for a file very big, the size of the file may exceed the capacity of the RAM. I am not sure of this point: I tried to test with a file of 1,5 GB but it's rather long and I dropped this point for the moment. If this point is right, it constitutes an argument to use the other solution with enumerate()

.

exemple:

text = '''Harold Acton (1904–1994)
Gilbert Adair (born 1944)
Helen Adam (1909–1993)
Arthur Henry Adams (1872–1936)
Robert Adamson (1852–1902)
Fleur Adcock (born 1934)
Joseph Addison (1672–1719)
Mark Akenside (1721–1770)
James Alexander Allan (1889–1956)
Leslie Holdsworthy Allen (1879–1964)
William Allingham (1824/28-1889)
Kingsley Amis (1922–1995)
Ethel Anderson (1883–1958)
Bruce Andrews (born 1948)
Maya Angelou (born 1928)
Rae Armantrout (born 1947)
Simon Armitage (born 1963)
Matthew Arnold (1822–1888)
John Ashbery (born 1927)
Thomas Ashe (1836–1889)
Thea Astley (1925–2004)
Edwin Atherstone (1788–1872)'''


#with open('alao.txt','rb') as f:

f = text.splitlines(True)
# argument True in splitlines() makes the newlines kept

GIVE_NO_FOR_END = {}
end = 0
for i,line in enumerate(f):
    end += len(line)
    GIVE_NO_FOR_END[end] = i
if line[-1]=='\n':
    GIVE_NO_FOR_END[end+1] = i+1
end_positions = GIVE_NO_FOR_END.keys()
end_positions.sort()


print '\n'.join('line %-3s  ending at position %s' % (str(GIVE_NO_FOR_END[end]),str(end))
                for end in end_positions)

def Which_Line_for_Position(pos,
                            dic = GIVE_NO_FOR_END,
                            keys = end_positions,
                            kmax = end_positions[-1]):
    return dic[(k for k in keys if pos < k).next()] if pos<kmax else None

print
for x in (2,450,320,104,105,599,600):
    print 'pos=%-6s   line %s' % (x,Which_Line_for_Position(x))

result

line 0    ending at position 25
line 1    ending at position 51
line 2    ending at position 74
line 3    ending at position 105
line 4    ending at position 132
line 5    ending at position 157
line 6    ending at position 184
line 7    ending at position 210
line 8    ending at position 244
line 9    ending at position 281
line 10   ending at position 314
line 11   ending at position 340
line 12   ending at position 367
line 13   ending at position 393
line 14   ending at position 418
line 15   ending at position 445
line 16   ending at position 472
line 17   ending at position 499
line 18   ending at position 524
line 19   ending at position 548
line 20   ending at position 572
line 21   ending at position 600

pos=2        line 0
pos=450      line 16
pos=320      line 11
pos=104      line 3
pos=105      line 4
pos=599      line 21
pos=600      line None

.

Then, having function Which_Line_for_Position() , it is easy to obtain the number of a current line : just passing f.tell() as argument to the function

But WARNING: when using f.tell() and doing movements of the file's pointer in the file, it is absolutely necessary that the file is opened in binary mode: 'rb' or 'rb+' or 'ab' or ....

Messing around with a similar problem recently and came up with this class based solution.

class TextFileProcessor(object):

    def __init__(self, path_to_file):
        self.print_line_mod_number = 0
        self.__path_to_file = path_to_file
        self.__line_number = 0

    def __printLineNumberMod(self):
        if self.print_line_mod_number != 0:
            if self.__line_number % self.print_line_mod_number == 0:
                print(self.__line_number)

    def processFile(self):
        with open(self.__path_to_file, 'r', encoding='utf-8') as text_file:
            for self.__line_number, line in enumerate(text_file, start=1):
                self.__printLineNumberMod()

                # do some stuff with line here.

Set the print_line_mod_number property to the cadence you want logged and then call processFile.

For example... if you want feedback every 100 lines it would look like this.

tfp = TextFileProcessor('C:\\myfile.txt')
tfp.print_line_mod_number = 100
tfp.processFile()

The console output would be

100
200
300
400
etc...

Regarding the solution by @eyquem, I suggest using mode='r' with the fileinput module and fileinput.lineno() option and it has worked for me.

Here is how I am implementing these options in my code.

    table=fileinput.input('largefile.txt',mode="r")
    if fileinput.lineno() >= stop : # you can disregard the IF condition but I am posting to illustrate the approach from my code.
           temp_out.close()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top