Before this question is flagged duplicate, please read through.

This has been asked many number of times with no clear answer. Lets say my task is to compute unigram probability for every word in millions of files. I can emit word counts from mappers and reducers can aggregate the counts for each word. However to compute probabilities, we need total number of words. One way to do would be to send the number of words from each mappers to each reducers with a special key, and sort the keys such that these counts arrive earlier than the individual counts. A reducer can, then simply add up the counts received from mappers to obtain grand total number of words.

But how do I make mappers send counts to all reducers?

I can find out the total number of reducers from job properties, say it is n. Is there not a way to call Context.write() n number of times from each mapper and specifying partition number from 0 to n-1 in turn so that data reaches to all the reducers?

有帮助吗?

解决方案

You could use a custom Partitioner for this purpose.

Given the number of reducers n you can emit your wordcount n times with the keys 1, 2, ... n in your mapper. A custom Partitioner class will ensure, that reducer i gets only values with key i.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top