Question

If yes, How does HDFS split input file into N lines to read by per mapper ?

I believe It's impossible!

When the splitter needs offset or bytes to split, It can be possible to split without processing whole of input file.

But when the number of '\n' or new line characters is important, before splitting it is necessary to process total input file (to count new line characters).

Was it helpful?

Solution

For NLineInputFormat to work, each split needs to know where the x Nth line starts. As you note in your comment to Tariq's answer, the mapper can't just know where the 3rd line (banana starts), it acquires this informaiton from the Map's InputSplit.

This is actually taken care of in the input format's getSplitsForFile method, which opens each input file up, and discovers the byte offsets where each Nth line starts (and generates an InputSplit to be processed by a Map task).

As you can imagine, this doesn't scale well for large input files (or for huge sets of input files) as the InputFormat needs to open up and read every single file to discover the split boundaries.

I've never used this input format myself, but i imagine its probably best used when you have a lot of CPU intensive work to do for every line in a smallish input file - so rather than 1 mapper doing all the work for a 100 record file, you can partition the load across many mappers (say 10 lines across 10 mappers).

OTHER TIPS

Yes.

It's possible!

Reason :

The mechanism is still the same and works on the raw data. The N in NLineInputFormat represents refers to the number of lines of input that each mapper receives. Number of records, to be precise. Since, NLineInputFormat uses LineRecordReader, each line is one Record. It doesn't change the way splits are created, which is normally based on the size of an HDFS block(remember NLineInputFormat is a member of FileInputFormat family).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top