How to debug python MapReduce programs written in mrjob from eclipse

Question 1

Debugging MrJobs can be quite a challenge sometimes. My learning curve started from using try... except clauses inside mappers and reducers yielding the exceptions produced (using the traceback module) into the results instead of breaking the job flow. But that first approach was time consuming as many times you have to wait several minutes until the job is done, and in the end, most errors ended up undefined variables, or syntax errors. So then I tried using small test logs to feed the jobs, that reduced significantly the amount of time spent running the jobs to see what the problem was. Another approach was to test the mappers and reducers outside of hadoop, this can be very convenient as you can use pdb for that purpose, and figure out problems quickly.

Finally, you can also try the suggestion of using MrJob's documentation, then you will find how to run the job locally which comes very handy: http://packages.python.org/mrjob/runners-inline.html

Question 2

The key is to have as much test coverage as possible. Even if you run the the jobs locally, rerunning the jobs repeatedly can quickly eat up your day. What worked out for me was to break down the map and reduce steps into a sequence of smaller functions, and write unittests for each small step.

Also, watch out for the difference between your local Python version and the one installed on the Hadoop instances (latest EMR instances use Python 2.6). I have listed a few debugging tips here.