hadoop - how total mappers are determined

Question 1

That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.

Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?

Question 2

HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.

Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.

Does this sound OK?