Limitation to Python's glob?

https://stackoverflow.com//questions/11675301

12-12-2019
|

Question

I'm using glob to feed file names to a loop like so:

inputcsvfiles = glob.iglob('NCCCSM*.csv')

for x in inputcsvfiles:

    csvfilename = x
    do stuff here

The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.

Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?

PS- initially I was using glob.glob, but glob.iglob fairs no better.

Edit:

An expansion of above for more context...

    # typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.   
    inputcsvfiles = glob.iglob('NCCCSM*.csv')

    # loop over individial input files    
      for x in inputcsvfiles:

        csvfile = x
        modelname = x[0:5]

        # ArcPy
        arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")

        do more stuff after

The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Solution

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.

My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

OTHER TIPS

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.

There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?

#credit - @dabeaz - generators tutorial

import os
import fnmatch

def gen_find(filepat,top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist,filepat):
            yield os.path.join(path,name)

# Example use

if __name__ == '__main__':
    lognames = gen_find("NCCCSM*.csv",".")
    for name in lognames:
        print name

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow