A Mapper
doesn't really know where to send the data, it focuses on 2 things:
- Writes the data to disk. Initially the map output is buffered in memory, and once it hits a certain threshold it gets flushed to disk. But right before going to disk, the data is partitioned by taking a hash of the output key which corresponds to which
Reducer
it will be sent to. - Once a map task is done it will notify the parent task tracker to say it's done, which will then notify the job tracker itself. So the job tracker has the complete mapping between map outputs and task trackers.
From there, when a Reducer
starts, it will keep asking the job tracker for the map outputs corresponding to his partition until it has retrieved them all. Whenever a map output is available, the reduce task will start copying it, and gradually merge as it copies.
If this is still unclear, I will advise looking at the reference book on Hadoop which has a whole chapter describing this part, here is a schema extracted from it that could help you visualize what happens in the shuffle step: