Pai's solution is technically correct, but in practice this will give you a lot of strife, as setting the partitioning can be a big pain (see https://groups.google.com/forum/#!topic/mrjob/aV7bNn0sJ2k).
You can achieve this task more easily by using mrjob.step, and then creating two reducers, such as in this example: https://github.com/Yelp/mrjob/blob/master/mrjob/examples/mr_next_word_stats.py
To do it in the vein you're describing:
from mrjob.job import MRJob
import re
from mrjob.step import MRStep
from collections import defaultdict
wordRe = re.compile(r"[\w]+")
class MRComplaintFrequencyCount(MRJob):
def mapper(self, _, line):
self.increment_counter('group','num_mapper_calls',1)
#Issue is third column in csv
issue = line.split(",")[3]
for word in wordRe.findall(issue):
#Send all map outputs to same reducer
yield word.lower(), 1
def reducer(self, key, values):
self.increment_counter('group','num_reducer_calls',1)
wordCounts = defaultdict(int)
total = 0
for value in values:
word, count = value
total+=count
wordCounts[word]+=count
for k,v in wordCounts.iteritems():
# word, frequency, relative frequency
yield k, (v, float(v)/total)
def combiner(self, key, values):
self.increment_counter('group','num_combiner_calls',1)
yield None, (key, sum(values))
if __name__ == '__main__':
MRComplaintFrequencyCount.run()
This does a standard word count and aggregates mostly in the combiner, then uses "None" as the common key, so every word indirectly gets sent to the reducer under the same key. In the reducer you can get the total word count and compute relative frequencies.