Question

Is the cleanup() method called for failed map tasks? If so, how it ensures 'atomicity'?

In my case, am preparing some statistics in mapper which are written to DB in the cleanup() method. In this case, if a mapper fails in the mid of executing its input split, the cleanup method will write the till processed data into DB? This will result in incorrect statistics, as the alternate mapper attempt will also write the same data again.

Was it helpful?

Solution

Depending on when your mapper fails, the cleanup may be called or not. For example, if your mapper fails in map method, the cleanup will not be invoked. But if your mapper fails in cleanup method, the cleanup has already been called.

If the mapper fails, hadoop usually will relaunch the mapper task on another machine. So you need to make sure that running your mappers or reducers several times is always outputing the same result, or it will be hard to debug.

For your situation, you can set up some Counters to collect the statistics data and read the Counters after your Job successes. If some mapper fails, the part Counters will be dropped out. Counters after your Job successes will be guaranteed to be correct.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top