Question

I am trying to understand the map-reduce concept and looking at implementing small programs using mincemeat.py, an open source library for python.

I have obtained the simple word count for a bag of words using the mapper and reducer. However, I would like to implement finding tf-idf scores for all words across documents. To accomplish this, first step I thought is to obtain a dictionary of the type {[word,docID]->count}. For this I have written the following code

def mapfn(k, v):
    for line in v.splitlines():
        for word in line.split():
            l = [word.lower(), k]
            yield l, 1

However, when I run the program, I am getting the following error.

error: uncaptured python exception, closing channel <__main__.Client connected at 0x8a434ac> 
(<type 'exceptions.TypeError'>:unhashable type: 'list'
 [/usr/lib/python2.7/asyncore.py|read|83]
 [/usr/lib/python2.7/asyncore.py|handle_read_event|444] 
 [/usr/lib/python2.7/asynchat.py|handle_read|140] 
 [mincemeat.py|found_terminator|96] 
 [mincemeat.py|process_command|194] 
 [mincemeat.py|call_mapfn|171])

What I understand is that we cannot yield a list inside map when using mincemeat.py because the error says the list is not expected while reducing. Am I correct? If I am correct, is there any way out to accomplish this? Or, do I need to look at any other libraries other than mincemeat?

Was it helpful?

Solution

I don't know mincemeat, but it's pretty clear that it's trying to use the list as the key of a dictionary or set, which isn't possible. Instead of yielding a list, try yielding a tuple instead. (In other words, change [word.lower(),k] to (word.lower(), k).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top