Question

I'm trying to understand the example for mrjob better

from mrjob.job import MRJob  
class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)
if __name__ == '__main__':
    MRWordFrequencyCount.run()

I run it by

$ python word_count.py my_file.txt

and it works as expected but I don't get how it automatically knows that it's going to read a text file and split it by each line. and I'm not sure what the _ does either.

From what I understand, the mapper() generates the three key/value pairs for each line correct? What if I want to work with each file in a folder?

And the reducer() automatically know how to add each key's values up?

What if I want to run unit tests via map reduce, what would the mapper and reducer look like? Is it even necessary?

Was it helpful?

Solution

The mapper method receives a key-value pair already parsed out from input text. mrjob uses Hadoop streaming, and each input text is divided by the new line character and then each line is split into key-value pair based on an input protocol in use. That's something the framework takes care of for you, so you don't have to do any heavy lifting; you can just assume you will get proper key and value.

However, you do need to specify what kind of input text files are specified. For example, if the key and/or value are not plain text (as in the original question) but serialized JSON, then you use JSONProtocol/JSONValueProtocol, etc., instead of RawValueProtocol which is the default.

For the initial mapper, each line is read into value (by RawValueProtocol), so that is why you don't receive key. Using _ is just a Python convention for an unused dummy variable. (However, _ is actually a valid name for a Python variable. You can do something like this a = 3; _ = 2; b = a + _. Blasphemy, isn't it?)

mrjob can take multiple input files. You can do for example

$ python wordcount.py text1.txt text2.txt

If you want all text files as input to an mrjob job, you can do things like

$ python wordcount.py inputdir/*.txt

or just simply

$ python wordcount.py inputdir

and all the files selected are used as input.

What reducer receives is a key and the iterator for all the values associated with that key. So if you example, the variable values in the reducer method is an iterator. If you want to do something over all values, you need to actually iterate over all of them. In the specific example in the question, the built-in function sum can take an iterator as an argument, and that's why you can do it in one shot. But it is effectively similar to sum([value for value in values]).

I actually don't know how you would unit test mrjob scripts. I have usually just tested on a small chunk of test data before production run.

OTHER TIPS

I don't know much about mrjob,so I'm going to make a few assumptions. First,the _ means to disregard the key (verified after a Google search). Second, i would assume it would work on a comma separated list of files or a directory. Next,this code is devoid of a setup probably because these are default method names. I'm sure if you named your mapper or reducer something different mrjob couldn't pick it up automatically.

I found some examples here.

from mrjob.job import MRJob

class MRRatingCounter(MRJob):
    def mapper(self, key, line):


        (userID, movieID, rating, timestamp) = line.split('\t')
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == '__main__':
    MRRatingCounter.run()

So please correct me based of above discussion :

Please correct me if i am wrong , so in this case key is taking the value of input file in this case can i read it in my mind like this :

def mapper ( self{ this is the object/instance} , key{ input text file in this case the name of the file that is and correct me ml-100k/u.data , line{ this is the are trying to pass to mapper() every time from the data file ., )

Another code from Udemy where i am trying to learn is as asked by the original question : class MRFriendsByAge(MRJob):

def mapper(self, _, line):
    (ID, name, age, numFriends) = line.split(',')
    yield age, float(numFriends)

def reducer(self, age, numFriends):
    total = 0
    numElements = 0
    for x in numFriends:
        total += x
        numElements += 1

    yield age, total / numElements

if name == 'main': MRFriendsByAge.run()

Found a MR job book and am trying to see if its make sense but i am still struggling

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top