I am trying to find a percent of where I am when reading through a csv file. I know how I could do this using tell() with a file object, but when I read that file object using csv.reader, then do a for loop on the rows in my reader object, the tell() function always returns as if it is at the end of the file, no matter where I am in the loop. How can I find where I am?

Current code:

with open(FILE_PERSON, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting

I threw "justtesting" in there just to prove that tell() does return 0 until I start my for loop.

This will return the same thing for every row in my csv file: 579 of 579 | 0

What am I doing wrong?

有帮助吗?

解决方案

The csv library utilizes a buffer when reading your file, so the file pointer jumps in larger blocks. It does not read your file line-by-line.

It reads the data in larger chunks to make parsing easier, and because newlines could be embedded in quotes, reading CSV data line-by-line would not work.

If you have to give a progress report, then you need to pre-count the number of lines. The following will only work if your input CSV file does not embed newlines in column values:

with open(FILE_PERSON, 'rb') as csvfile:
    linecount = sum(1 for _ in csvfile)
    csvfile.seek(0)
    spamreader = csv.reader(csvfile)
    for line, row in enumerate(spamreader):
        print '{} of {}'.format(line, linecount)

There are other methods to count the number of lines (see How to get line count cheaply in Python?) but since you'll be reading the file anyway to process it as a CSV, you may as well make use of the open file you have for that. I'm not certain that opening the file as a memory map, then read it as a normal file again is going to perform any better.

其他提示

csvreader docs say:

... csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called ...

Therefore a small change to the OP's original code:

import csv
import os
filename = "tar.data"
with open(filename, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting
###############################################
def generator(csvfile):
    # readline seems to be the key
    while True:
        line = csvfile.readline()
        if not line:
            break
        yield line
###############################################
print
with open(filename, 'rb', 0) as csvfile:
    spamreader = csv.reader(generator(csvfile))
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "-", justtesting

Running this against my test data gives the following, showing that the two different approaches produce different results.

224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0

16 of 224 - 0
32 of 224 - 0
48 of 224 - 0
64 of 224 - 0
80 of 224 - 0
96 of 224 - 0
112 of 224 - 0
128 of 224 - 0
144 of 224 - 0
160 of 224 - 0
176 of 224 - 0
192 of 224 - 0
208 of 224 - 0
224 of 224 - 0

I set zero buffering on the open but it made no difference, the thing is readline in the generator.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top