Just a thought on using a single Job:
total_count
can be calculated from the map phase of the first job. Actually, it is already counted as MAP_OUTPUT_RECORDS
. This is the sum of all the map output (key, value)
pairs. So, if you always have 1 as value, then this sum is what you want, i.e. the total number of words in your document (with repetition).
Now, I don't know if you can get this counter in the configuration of the reducers. Then, you could just output for each word the pair (word, wordCount/MAP_OUTPUT_RECORDS)
. I think you can do this through:
New API:
context.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();
Old API:
reporter.getCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_OUTPUT_RECORDS").getValue();