Question

Follow up to Pig: Force UDF to occur in Reducer or set number of mappers . I have a UDF that's running as a map step in my pig workflow. It takes a list of X files, 1 per each reducer that saved it from a prior step. I want there to be X mappers (1 per input file) to run this UDF because it's very time consuming so Pig isn't running it as parallel as I want. Based on Hadoop streaming: single file or multi file per map. Don't Split I figured the solution was to prevent splitting so I made a pig Load Func like.

public class ForceMapperPerInputFile extends PigStorage {
    @Override
    public InputFormat getInputFormat() {
        return new MapperPerFileInputFormat();
    }
}
class MapperPerFileInputFormat extends PigTextInputFormat {
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
       return false;
    }
}

When I used this it has the exact opposite effect of what I wanted, the number of mapper tasks decreased by nearly half.

How can I actually force exactly one mapper per input file?

Was it helpful?

Solution

SET pig.noSplitCombination true;

(or -Dpig.noSplitCombination=true as one of the command-line options when running the script)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top