Separate output per file in Hadoop and Amazon EMR/S3

https://stackoverflow.com/questions/23172785

06-07-2023
|

Question

I'm working on a project that analyzes the words in books using Hadoop. I have a program similar to the standard word count example (see here). The program currently counts all of the words in all of the files in a directory and combines it all together. However, I also need to get a word count and individual output for each file processed, or at least something that will give me per-file word counts. How can I do this?

Solution

It Seems you are looking for MultipleOutputFormat

There is alerady impplementation code here link1 and complete explanation and example code here link2. Just map your outputfile as input filename or whatever you wish, The files will get "/outputfolder/part-nnnnn" for each group, name "part" can be changed, where nnnnn is partition ID associated with the reduce task. This can't be dodged, you have to rename / merge these files on the HDFS.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow