Вопрос

Given a directory with a large number of small files (>1 mio) what's a fast way to remember which files were already processed (for a database import).

The first solution I tried was a bash script:

#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
    filename=`basename $f`

    #check whether the filename is already contained in the process list
    onlist=`grep $filename $processed_files`
    if [[ -z $onlist ]]
        then
            echo "processing, new: $filename"
            #unzip file and import into mongodb

            #write filename into processed list
            echo $filename #>> $processed_files
    fi
done

For a smaller sample (160k files) this ran ~8 minutes (without any processing)

Next I tried a python script:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]

with open(processed_files_file, "a") as pff:
  for root, dirs, files in os.walk(path):
      for file in files:
          if file.endswith(".gz"):
              if file not in processed_files:
                  pff.write("%s\n" % file)

This runs in less than 2 mins.

Is there a significantly faster way that I'm overlooking?

Other solutions:

  • Moving processed files to a different locations is not convenient since I use s3sync to download new files
  • since the files have a timestamp as part of their name I might consider to rely on processing them in order and only compare the name to a "last processed" date
  • alternatively I could keep track of the last time a processing ran, and only process files that have been modified since.
Это было полезно?

Решение

Just use a set:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)

Другие советы

Alternative approach using standard command line utilities:

Just diff a file containing a listing of all files with a file containing a listing of processed files.

Easy to try, and should be quite fast.

If you include full timestamps in the listing, you can pick up 'changed' files this way too.

If the files are not modified after they are processed, one option is to remember the latest processed file and then use find's -newer option to retrieve not-yet-processed files.

find $rawdatapath -name '*.gz' -newer $(<latest_file) -exec process.sh {} \;

where process.sh looks like

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
echo $1 > latest_file

This is untested. Lookout for unwanted side effects before considering implementing this strategy.

If a hacky/quick'n'dirty solution is acceptable, one funky alternative is to encode the state (processed or not processed) in the file permissions, for instance in the group read permission bit. Assuming your umask is 022, so that any newly created file has permissions 644, change the permission to 600 after processing a file and use find's -perm option to retrieve not-yet-processed files.

find $rawdatapath -name '*.gz' -perm 644 -exec process.sh {} \;

where process.sh looks like

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
chmod 600 $1

Again this is untested. Lookout for unwanted side effects before considering implementing this strategy.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top