Question

I have a mapper that gives me words like,

apple 10 12012013
apple 20 12022013
apple 30 12042013
apple 40 12032013

where the first value is the word, followed by occurence of that word on that day mentioned in the 3rd column.

I have written a reducer which can pick the key as apple and then get its total count.

so the output is like this

apple 100

But I need to output as,

100 apple 12012013:10 12022013:20 12032013:30 12042013:40 

Any idea how should I modify my mapper ?

I am running this Map reduce job on Amazon EMR Hadoop Streaming.

EDIT: The below code works perfectly, but I am getting the output as :

4   apple   20130601
:1  20130602
:1  20130601
:1  20130602
:1  

Any idea ?

Was it helpful?

Solution

This should do it:

>>> with open('filename') as f:
    dic = {}
    for line in f:
        name, quan, dt = line.split()
        dic.setdefault(name, []).append((dt, quan))
...         

>>> for k, v in dic.items():
    total = sum(int(x[1]) for x in v)
    print '{} {} {}'.format(total, k, ' '.join('{}:{}'.format(x, y) for x, y in v))
...     
100 apple 12012013:10 12022013:20 12042013:30 12032013:40

If the data with same product is always grouped together than you can also use itertools.groupby:

>>> from itertools import groupby
>>> with open('abc1') as f:
    for k, g in groupby(f, key=lambda x:x.split()[0]):
        data = [x.split()[1:] for x in g]
        total = sum(int(x[0]) for x in data)
        print '{} {} {}'.format(total, k, ' '.join('{}:{}'.format(y, x) for x, y in data))
...         
100 apple 12012013:10 12022013:20 12042013:30 12032013:40

Update:

If the input is coming from a file stream then you can use sys.stdin:

import sys
from itertools import groupby
for k, g in groupby(sys.stdin, key=lambda x:x.split()[0]):
    ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top