If the files are not modified after they are processed, one option is to remember the latest processed file and then use find
's -newer
option to retrieve not-yet-processed files.
find $rawdatapath -name '*.gz' -newer $(<latest_file) -exec process.sh {} \;
where process.sh looks like
#!/bin/env bash
echo "processing, new: $1"
#unzip file and import into mongodb
echo $1 > latest_file
This is untested. Lookout for unwanted side effects before considering implementing this strategy.
If a hacky/quick'n'dirty solution is acceptable, one funky alternative is to encode the state (processed or not processed) in the file permissions, for instance in the group read permission bit. Assuming your umask
is 022
, so that any newly created file has permissions 644
, change the permission to 600
after processing a file and use find
's -perm
option to retrieve not-yet-processed files.
find $rawdatapath -name '*.gz' -perm 644 -exec process.sh {} \;
where process.sh looks like
#!/bin/env bash
echo "processing, new: $1"
#unzip file and import into mongodb
chmod 600 $1
Again this is untested. Lookout for unwanted side effects before considering implementing this strategy.