Mapreduce dataflow Internals

Question 1

The processes goes like this :

1- The client configures and sets up the job via Job and submits it to the JobTracker.

2- Once the job has been submitted the JobTracker assigns a job ID to this job.

3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.

5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.

6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.

7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.

8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.

9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.

Below is the pictorial representation of the entire process : enter image description here

Also, I would suggest you to go through this link.

HTH

Question 2

Read Chapter 6 ("How MapReduce Works") of "Hadoop: The Definitive Guide". It explains it in good language. For quick read, see this and this.