Question

First, I know this is totally against the purpose of hadoop, parallelism, and MR. That being said I have a very specific use case.

I want to send the contents of an entire sequence file, no matter how big, to a single mapper instance but cant figure out how to handle this.

I know I can do this in a reducer by using an identity mapper, but I don't want to go through the overhead of sorting / grouping to get the data into the reducer.

I also know that I can just read a sequence file locally without mappers or reducers, but that doesn't fit my use case either.

Was it helpful?

Solution

Just increase your block size for the file to be a bit more than size of the file. This will ensure that the file goes to single mapper. This has to be done when putting the file in HDFS.

As you indicated since these files are generated by another MR:

You can create your InputFormat and override the getSplits() method.

The getSplits() returns an array of InputSplits. Return a single split instead of breaking it into multiple parts.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top