Question

I am new to Map/Reduce and Hadoop framework. I am running a Hadoop program on single machine (for trying it out). I have n input files and I want some summary of words from those files. I know map function returns key value pair, but how map is called? Once on each file or Once on each line of every file? Can I configure it?

Is it correct to assume, "reduce" is called for each key?

Was it helpful?

Solution 2

In Hadoop MR framework,the job tracker creates a map task for each InputSplit as determined by the InputFormat specified by your job. Each Inputsplit assigned to map task is further processed by RecordReader to generate input key/value pairs for map function. The map function is called for each key/value pair that is generated by RecordReader.

For default InputFormat i.e TextInputFormat the input split will be single HDFS block which will be processed by single map task and RecordReader will process one line at a time within block and generate key/value pair where key is byte offset of starting of line in file and value is contents of line which will be passed to map function.

The number of reducers is dependent on job configuration by user and all key/value pairs with same key are grouped and will be sent to single reducer sorted by key but at same time single reducer can process multiple keys too.

For more details on InputFormat and customizing it, refer this YDN documentation:

http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat

OTHER TIPS

A map is called for one InputSplit(or split in short) and it is the duty of the InputFormat, you are using in your MR job, to crate these splits. It could be one line, multiple lines, one whole file and so on, based on the logic inside your InputFormat. For example, the default InputFormat, i.e TextInputFormat crates of splits which consist of a single line.

Yes you can configure it by altering the InputFormat you are using.

All the values corresponding to a particular key are clubbed together and multiple keys are partitioned into partitions and an entire partition goes to a reducer for further processing. So, all the values corresponding to a particular key get processed by a single reducer, but a single reducer can get multiple keys.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top