Elastic Map Reduce: continue on error?

Question 1

You may catch Exception in both mapper and reducer and inside the catch block have a counter like the following:

catch (Exception ex){
    context.getCounter("CUSTOM_COUNTER", ex.getMessage()).increment(1);
    System.err.println(GENERIC_INPUT_ERROR_MESSAGE + key + "," + value); // also log the payoad which resulted in the exception
    ex.printStackTrace();
}

If the exception message is something you would have expected and also the counter's value is acceptable then you can very well go ahead with the results or else investigate the logs. I know catching Exception isn't advised but if you want to "continue on error", then it's pretty much the same thing. Since here cost of clusters are at stake, I think we are better off catching Excpetion instead of specific exceptions.

Though, there may be side effects to it, such as your code might be run on entirely wrong input and but for the catch it would have failed much earlier. But chances of something like this happening is very less.

EDIT:

For your point #2, you may set max number of allowed failures per tracker by using the following:

        conf.setMaxTaskFailuresPerTracker(noFailures);

OR

The config which you must set is mapred.max.tracker.failures. As you may know the default is 4. For all other mapred configurations see here.

Question 2

If I am reading your question right, you can have your cluster continue on failure to the next step defined in the elastic-mapreduce call in the ruby based command line tool for emr

--jar s3://elasticmapreduce/libs/script-runner/script-runner.jar --args "s3://bucket/scripts/script.sh" --step-name "do something using bash" --step-action CONTINUE  \