Question

In Hadoop how to handle daily increasing data:

For example:

1st day I may have 1 million files in some input folder (e.g. hadoop/demo)

2nd day in the same folder, files may increase from existing 1 million files + another new 1 million files so totally 2 million.

likewise 3rd 4th days... keep goes.

My constraint is -> 1st day's files should not be processed on the next day.

(i.e) Already proceeded files should not processed again when new files are added with them. More specifically, only the new added files should be processed and older files should be neglected.

So help me in the way that I can solve this issue.

Still if you didn't understand the constraint, kindly say where it's unclear so that I can elaborate more about my constraint!

Was it helpful?

Solution

It is not somethinkg supported by hadoop itself, since it is part of the application logic. I would suggest some HDFS based solution, so you will have directory (or better hierarchy of directories with subdirectory for each day) with data yet to be processed.
Your daily job should take all data there, process it and move to the "processed" folder.
Usual trade-off which makes sense is to make logic in the way that accidental double processing of some file will not cause problems.
. In this case crash of the job after processing, but before the move will not make a problems.
Instead of daily scheduling you might use some wokrflow tools lie oozie capable to trigger jobs by the data availability, alhough I am personally didn't used them yet.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top