If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable
false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.
WholeFileInputFormat
(not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.
Value can be the file name. Calling getInputSplit()
on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName()
would yield you the file name. This would give you the filename
, which could be emitted as the value.
I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils
, but the javadocs says that this might be what works good for you.
Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference
Also including the grepcode link to MD5FileUtils