Question

We use Elastic Map Reduce quite extensively, and are processing more and more data with it. Sometimes our jobs fail because the data is malformed. We've constantly revised our map scripts to handle all sorts of exceptions but sometimes there is still some malformed data that manages to break our scripts.

  1. Is it possible to specify Elastic Map Reduce to "continue on error" even when some of the map or reduce jobs fail?

  2. At least, is it possible to increase the minimum number of failed tasks under which the entire cluster fails (sometimes, we have only 1 failed job out of 500 or so jobs, and we would like to obtain at least those results and have the cluster continue running.)

  3. In addition, while we can revise our map script to handle new exceptions, we use the default Hadoop "aggregate" reducer, and when that fails, we have no way to catch an exception. Is there any special way to handle errors in the "aggregate" reducer, or do we have to work with anything available to us in question #2 above (increasing minimum number of failed tasks.)

Was it helpful?

Solution

You may catch Exception in both mapper and reducer and inside the catch block have a counter like the following:

catch (Exception ex){
    context.getCounter("CUSTOM_COUNTER", ex.getMessage()).increment(1);
    System.err.println(GENERIC_INPUT_ERROR_MESSAGE + key + "," + value); // also log the payoad which resulted in the exception
    ex.printStackTrace();
}

If the exception message is something you would have expected and also the counter's value is acceptable then you can very well go ahead with the results or else investigate the logs. I know catching Exception isn't advised but if you want to "continue on error", then it's pretty much the same thing. Since here cost of clusters are at stake, I think we are better off catching Excpetion instead of specific exceptions.

Though, there may be side effects to it, such as your code might be run on entirely wrong input and but for the catch it would have failed much earlier. But chances of something like this happening is very less.

EDIT:

For your point #2, you may set max number of allowed failures per tracker by using the following:

        conf.setMaxTaskFailuresPerTracker(noFailures);

OR

The config which you must set is mapred.max.tracker.failures. As you may know the default is 4. For all other mapred configurations see here.

OTHER TIPS

If I am reading your question right, you can have your cluster continue on failure to the next step defined in the elastic-mapreduce call in the ruby based command line tool for emr

--jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --args "s3://bucket/scripts/script.sh" --step-name "do something using bash" --step-action CONTINUE  \
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top