SET pig.noSplitCombination true;
(or -Dpig.noSplitCombination=true
as one of the command-line options when running the script)
質問
Follow up to Pig: Force UDF to occur in Reducer or set number of mappers . I have a UDF that's running as a map step in my pig workflow. It takes a list of X files, 1 per each reducer that saved it from a prior step. I want there to be X mappers (1 per input file) to run this UDF because it's very time consuming so Pig isn't running it as parallel as I want. Based on Hadoop streaming: single file or multi file per map. Don't Split I figured the solution was to prevent splitting so I made a pig Load Func like.
public class ForceMapperPerInputFile extends PigStorage {
@Override
public InputFormat getInputFormat() {
return new MapperPerFileInputFormat();
}
}
class MapperPerFileInputFormat extends PigTextInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
}
When I used this it has the exact opposite effect of what I wanted, the number of mapper tasks decreased by nearly half.
How can I actually force exactly one mapper per input file?
解決
SET pig.noSplitCombination true;
(or -Dpig.noSplitCombination=true
as one of the command-line options when running the script)