Question

I'm trying to keep track of a large amount of files referenced in a database that might have transferred or might not yet exist.

I'm finding that looping over thousand files over a slow network checking to see if a file exists using the isfile is quite slow.

os.path.isfile(filepath)

Would it be faster to use files=os.listdir(path) to get a list of files instead? And then compare with my database? Is there another way of doing this?

Was it helpful?

Solution

Would it be faster to use files=os.listdir(path) to get a list of files instead?

It depends.

If you're looking for 1000 files out of 3000 across 30 directories, listing the 30 directories will be faster with almost any filesystem.

If you're looking for 1000 files out of 100000 across 1000 directories, it will obviously be slower to list 1000 directories than to just stat 1000 files.

As a rough guide, on a typical *nix system like OS X or Linux, listing a directory takes about as long per few dozen filenames as stat-ing a single file. However, on some network filesystems, you may have much worse latency than bandwidth problems, in which case that ratio might go way up.

For your real-life use case, if it's not obvious which will be faster, try them both (maybe for a smaller subset) and compare.

One last thing—if you're trying to "keep track" over a long continuous period, and you're currently periodically re-checking them all, there may be a way to avoid that. Depending on your platform and your sharing protocol, you may be able to set up a filesystem watch on the files or directories and detect when a change happens instead.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top