Hadoop: Do Mappers run parallel When we use NLineInputFormat?

Question 1

For NLineInputFormat to work, each split needs to know where the x Nth line starts. As you note in your comment to Tariq's answer, the mapper can't just know where the 3rd line (banana starts), it acquires this informaiton from the Map's InputSplit.

This is actually taken care of in the input format's getSplitsForFile method, which opens each input file up, and discovers the byte offsets where each Nth line starts (and generates an InputSplit to be processed by a Map task).

As you can imagine, this doesn't scale well for large input files (or for huge sets of input files) as the InputFormat needs to open up and read every single file to discover the split boundaries.

I've never used this input format myself, but i imagine its probably best used when you have a lot of CPU intensive work to do for every line in a smallish input file - so rather than 1 mapper doing all the work for a 100 record file, you can partition the load across many mappers (say 10 lines across 10 mappers).

Question 2

Yes.

It's possible!

Reason :

The mechanism is still the same and works on the raw data. The N in NLineInputFormat represents refers to the number of lines of input that each mapper receives. Number of records, to be precise. Since, NLineInputFormat uses LineRecordReader, each line is one Record. It doesn't change the way splits are created, which is normally based on the size of an HDFS block(remember NLineInputFormat is a member of FileInputFormat family).