How to process only new (unprocessed) files in linux

Question 1

Just use a set:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)

Question 2

Alternative approach using standard command line utilities:

Just diff a file containing a listing of all files with a file containing a listing of processed files.

Easy to try, and should be quite fast.

If you include full timestamps in the listing, you can pick up 'changed' files this way too.

Question 3

If the files are not modified after they are processed, one option is to remember the latest processed file and then use find's -newer option to retrieve not-yet-processed files.

find $rawdatapath -name '*.gz' -newer $(<latest_file) -exec process.sh {} \;

where process.sh looks like

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
echo $1 > latest_file

This is untested. Lookout for unwanted side effects before considering implementing this strategy.

If a hacky/quick'n'dirty solution is acceptable, one funky alternative is to encode the state (processed or not processed) in the file permissions, for instance in the group read permission bit. Assuming your umask is 022, so that any newly created file has permissions 644, change the permission to 600 after processing a file and use find's -perm option to retrieve not-yet-processed files.

find $rawdatapath -name '*.gz' -perm 644 -exec process.sh {} \;

where process.sh looks like

#!/bin/env bash 
echo "processing, new: $1"
#unzip file and import into mongodb 
chmod 600 $1

Again this is untested. Lookout for unwanted side effects before considering implementing this strategy.