Question

I am trying to split up a large xml file into smaller chunks. I write to the output file and then check its size to see if its passed a threshold, but I dont think the getsize() method is working as expected.

What would be a good way to get the filesize of a file that is changing in size.

Ive done something like this...

import string
import os

f1 = open('VSERVICE.xml', 'r')
f2 = open('split.xml', 'w')

for line in f1:
  if str(line) == '</Service>\n':
    break
  else:
    f2.write(line)
    size = os.path.getsize('split.xml')
    print('size = ' + str(size))

running this prints 0 as the filesize for about 80 iterations and then 4176. Does Python store the output in a buffer before actually outputting it?

Was it helpful?

Solution

Yes, Python is buffering your output. You'd be better off tracking the size yourself, something like this:

size = 0
for line in f1:
  if str(line) == '</Service>\n':
    break
  else:
    f2.write(line)
    size += len(line)
    print('size = ' + str(size))

(That might not be 100% accurate, eg. on Windows each line will gain a byte because of the \r\n line separator, but it should be good enough for simple chunking.)

OTHER TIPS

File size is different from file position. For example,

os.path.getsize('sample.txt') 

It exactly returns file size in bytes.

But

f = open('sample.txt')
print f.readline()
f.tell() 

Here f.tell() returns the current position of the file handler - i.e. where the next write will put its data. Since it is aware of the buffering, it should be accurate as long as you are simply appending to the output file.

Have you tried to replace os.path.getsize with os.tell, like this:

f2.write(line)
size = f2.tell()

Tracking the size yourself will be fine for your case. A different way would be to flush the file buffers just before you check the size:

f2.write(line)
f2.flush()  # <-- buffers are written to disk
size = os.path.getsize('split.xml')

Doing that too often will slow down file I/O, of course.

To find the offset to the end of a file:

file.seek(0,2)
print file.tell()

Real world example - read updates to a file and print them as they happen:

file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
while True:
    #set the file size agian
    file.seek(0,2)
    neweof = file.tell()
    #if the file is larger...
    if neweof > eof:
        #go back to last position...
        file.seek(eof)
        # print from last postion to current one
        print file.read(neweof-eof),
        eof = neweof
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top