For NLineInputFormat
to work, each split needs to know where the x Nth line starts. As you note in your comment to Tariq's answer, the mapper can't just know where the 3rd line (banana starts), it acquires this informaiton from the Map's InputSplit
.
This is actually taken care of in the input format's getSplitsForFile
method, which opens each input file up, and discovers the byte offsets where each Nth line starts (and generates an InputSplit
to be processed by a Map task).
As you can imagine, this doesn't scale well for large input files (or for huge sets of input files) as the InputFormat
needs to open up and read every single file to discover the split boundaries.
I've never used this input format myself, but i imagine its probably best used when you have a lot of CPU intensive work to do for every line in a smallish input file - so rather than 1 mapper doing all the work for a 100 record file, you can partition the load across many mappers (say 10 lines across 10 mappers).