How to optimize this MapReduce function, Python, mrjob

Question

There are two ways to follow.

1. Improve your process

Your are doing a distributed word count. This operation is algebraic but your are not taking advantage of this property.

For every words of your input you are sending a record to the reducers. Theses bytes have to be partitioned, sent over the network and then sorted by the reducer. It is nor efficient nor scalable, the amount of data to be send by the mappers to the reducers is usually a bottleneck.

You should add a combiner to your job. It will do exactly the same thing than your current reducer. The combiner is run just after the mapper in the same address space. It means than the amount of data you are sending over the network is no longer linear with number of words of your input, but is bounded by the number of unique words. Which is usually several order of magnitude lower.

Since distributed word count example is overused, you will easily find more information by searching "distributed word count combiner". All algebraic operations must have a combiner.

2. Use more efficient tools

Mrjob is a great tool to quickly write map reduce jobs. Usually it is faster to write a python Job than a Java one. However it has a runtime cost:

Python is usually slower than Java
MRJob is slower than most of the python framework because is does not, yet, use typedbytes

You have to decide if it worths rewriting some of your jobs in Java using the regular API. If you are writing long lived batch jobs, it could make sense to invest some development time to decrease the runtime costs.

In the long term writing a Java Job is usually not much longer than writing it in python. But you have to make some up front investments: create a project with a build system, package it, deploy it etc. With MRJob you just have to execute your python text file.

Cloudera did a benchmark of the Hadoop python frameworks few months ago. MRJob was way slower than their Java jobs (5 to 7 times). MRJob performances should improve when typedbytes will be available but Java jobs will still be 2 to 3 times faster.