Question

When writing a MapReduce job (specifically Hadoop if relevant), one must define a map() and a reduce() function, both yielding a sequence of key/value pairs. The data types of the key and value is free to be defined by the application.

In the canonical example of word counting, both functions yield pairs of type (string, int) with the key being a word and the value a count of occurrences. Here - as well as in all other examples I have seen - the outputted key and value types are consistent between the two functions.

Must/should the type of the key/value pair yielded by map() and reduce() be the same within any application of MapReduce? If yes: why?

Was it helpful?

Solution

No. Obviously, of course, the types of the output pairs for map and the input pairs of reduce must be identical, since one is being passed into the other.

It's quite possible, however, that you might have a reduce task that output a different type than the map pairs that came in - for instance, if the map task counted words in a document, but the reduce task calculated an average word frequency. In that case map would be emitting integers but reduce would be emitting floating point numbers.

OTHER TIPS

Map and Reduce outputs can be different. But you need to tell the application that they are. Here is how:

The setOutputKeyClass() and setOutuptValueClass() methods control the output types for the map and the reduce functions, which are often the same.

In case they are different, you can set the map output key/values type by setMapOutputKeyClass() and SetMapOutputValue respectively.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top