Question

The Python 2 docs for filecmp() say:

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Which sounds like two files which are identical except for their os.stat() signature will be considered unequal, however this does not seem to be the case, as illustrated by running the following code snippet:

import filecmp
import os
import shutil
import time

with open('test_file_1', 'w') as f:
    f.write('file contents')
shutil.copy('test_file_1', 'test_file_2')
time.sleep(5)  # pause to get a different time-stamp
os.utime('test_file_2', None)  # change copied file's time-stamp

print 'test_file_1:', os.stat('test_file_1')
print 'test_file_2:', os.stat('test_file_2')
print 'filecmp.cmp():', filecmp.cmp('test_file_1', 'test_file_2')

Output:

test_file_1: nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0,
  st_uid=0, st_gid=0, st_size=13L, st_atime=1320719522L, st_mtime=1320720444L, 
  st_ctime=1320719522L)
test_file_2: nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, 
  st_uid=0, st_gid=0, st_size=13L, st_atime=1320720504L, st_mtime=1320720504L, 
  st_ctime=1320719539L)
filecmp.cmp(): True

As you can see the two files' time stamps — st_atime, st_mtime, and st_ctime— are clearly not the same, yet filecmp.cmp() indicates that the two are identical. Am I misunderstanding something or is there a bug in either filecmp.cmp()'s implementation or its documentation?

Update

The Python 3 documentation has been rephrased and currently says the following, which IMHO is an improvement only in the sense that it better implies that files with different time stamps might still be considered equal even when shallow is True.

If shallow is true, files with identical os.stat() signatures are taken to be equal. Otherwise, the contents of the files are compared.

FWIW I think it would have been better to simply have said something like this:

If shallow is true, file content is compared only when os.stat() signatures are unequal.

Was it helpful?

Solution

You're misunderstanding the documentation. Line #2 says:

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Files with identical os.stat() signatures are taken to be equal, but the logical inverse is not true: files with unequal os.stat() signatures are not necessarily taken to be unequal. Rather, they may be unequal, in which case the actual file contents are compared. Since the file contents are found to be identical, filecmp.cmp() returns True.

As per the third clause, once it determines that the files are equal, it will cache that result and not bother re-reading the file contents if you ask it to compare the same files again, so long as those files' os.stat structures don't change.

OTHER TIPS

It seems that 'rolling your own' is indeed what is required to produce a desirable result. It would simply be nice if the documentation were clear enough to make a casual reader reach that conclusion.

Here's the function I am presently using:

def cmp_stat_weak(a, b):
    sa = os.stat(a)
    sb = os.stat(b)
    return (sa.st_size == sb.st_size and sa.st_mtime == sb.st_mtime)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top