Question

Quick question regarding globbing in python.

I have a directory of files that go 'sync_0001.tif', 'sync_0002.tif', ... , 'sync_2400.tif'. I'd like to obtain 3 subset lists of those files: 1 for the first 800 files, second 800 files, and last 800 files. The only problem is the 0's before the numbers. I can't figure out the right way to glob and obtain those lists. The third list is easy because there are no 0's padding any of those files (s3=glob.glob('sync_[1601-2400].tif'). The other two are trickier because the number of 0's out front varies.

I tried this, but got 'bad character range,' I'm guessing because of the 0's:

s1 = glob.glob('sync_' + '{[0001-0009], [0010-0099], [0100-0800]}' + '.tif')
s2 = glob.glob('sync_' + '{[0801-0999], [1000-1600]}' + '.tif')

I then tried moving the 0's out front like so, but got an empty list:

s1 = glob.glob('sync_' + '{000[1-9], 00[10-99], 0[100-800]}' + '.tif')

What's the best way to achieve these three lists? I'm starting to think I have the whole glob thing wrong, so if someone could shed some light that would be great. Thanks!

Was it helpful?

Solution

The fnmatch module underpinning the glob.glob() function is not nearly sophisticated enough for your task.

Just grab all filenames and partition them after sorting:

filenames = sorted(glob.glob('sync_[0-9][0-9][0-9][0-9].tif'))

This works because your numbers are padded and can thus be sorted lexicographically. Then partition them:

s1 = [f for f in filenames if 0 < int(f[5:9]) <= 800]
s2 = [f for f in filenames if 800 < int(f[5:9]) <= 1600]
s3 = [f for f in filenames if 1600 < int(f[5:9]) <= 2400]

The directory I/O will be the slowest here anyway. You can make this all a little more efficient by looping just once and swapping what you append to:

target = s1 = []
s2 = []
s3 = []
for f in filenames:
    num = int(f[5:9])
    if num > 800:
        target = s2
    elif num > 1600:
        target = s3
    target.append(f)

but for a task like this sticking to the simpler list comprehensions is just fine too.

OTHER TIPS

The best way to do is simply:

  1. Glob all the files that start with sync
  2. Sort the list by the number component
  3. Split it into chunks of 800

Since you already know globbing, the rest is:

import glob
import re
from itertools import izip_longest

# https://docs.python.org/2/library/itertools.html#recipes
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)


def sorter(x):
    return int(re.search('(\d+)',x).groups()[0])

files = glob.glob('sync*.tif')
sorted_files = sorted(files, key=sorter)
in_batches = list(grouper(sorted_files, 800))

As the pattern is always sync_ (after your edit), you can simplify the code above to the following:

files = glob.glob('sync_*.tif')
sorted_files = sorted(files, key=lambda x: int(x.split('_')[1]))
in_batches = list(grouper(sorted_files, 800))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top