質問

Follow up to Pig: Force UDF to occur in Reducer or set number of mappers . I have a UDF that's running as a map step in my pig workflow. It takes a list of X files, 1 per each reducer that saved it from a prior step. I want there to be X mappers (1 per input file) to run this UDF because it's very time consuming so Pig isn't running it as parallel as I want. Based on Hadoop streaming: single file or multi file per map. Don't Split I figured the solution was to prevent splitting so I made a pig Load Func like.

public class ForceMapperPerInputFile extends PigStorage {
    @Override
    public InputFormat getInputFormat() {
        return new MapperPerFileInputFormat();
    }
}
class MapperPerFileInputFormat extends PigTextInputFormat {
    @Override
    protected boolean isSplitable(JobContext context, Path file) {
       return false;
    }
}

When I used this it has the exact opposite effect of what I wanted, the number of mapper tasks decreased by nearly half.

How can I actually force exactly one mapper per input file?

役に立ちましたか?

解決

SET pig.noSplitCombination true;

(or -Dpig.noSplitCombination=true as one of the command-line options when running the script)

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top