Frage

In a elastic mapreduce streaming job, what is going to happen if a mapper suddenly dies? The data that were already processed will be replayed? If so, is there any option to disable that?

I am asking because I am using EMR to insert some data to third party database. Every mapper sends the data coming in through HTTP. In this case if a mapper crashes I don't want to replay the HTTP requests and I need to continue where I were left.

War es hilfreich?

Lösung

MR is a fault tolerant framework. When a Map task fails (streaming API or Java API) the behavior is the same.

Once the job tracker is notified that the task has failed it will try and reschedule the task. The temporary output generated by the failed task is deleted.

A more detailed discussion on how failures are handled in MR can be seen here

For your particular case I think you need to refer to the external source in your setup() method to find out the records which have been processed, then use this information in your mapper() methods to decide whether a particular record should be processed or not.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top