HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat
to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize
, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.
Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2);
in your WC program. Also,it looks like you have mistaken the 2
here. It is not the number of lines in a file. It is the Split size, in long
, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.
Does this sound OK?