Hadoop gen1 vs Hadoop gen2

Question 1

In YARN (the new execution framework in Hadoop 2), MapReduce doesn't exist in the way it did before.

YARN is a more general purpose way to allocate resources on the cluster. ResourceManager, ApplicationMaster, and NodeManager now consist of the new YARN execution framework. The NodeManager is the daemon on every node, so I guess you could say that replaced the TaskTracker. But now it just gives processes instead of just map tasks and reduce tasks.

MapReduce is still there, but it is now an "application" of YARN.

Here is an introduction to YARN, which will go into much more depth: http://hortonworks.com/blog/introducing-apache-hadoop-yarn/

Question 2

Yes Jobtracker was split into resource manager and application master. Application master runs on one or all node managers instances based on the number of jobs submitted. So when job submitted, resource manager talks to one of free node managers to act as application master and that application master will be now job tracker and other node managers will be task trackers which they execute Yarn child. Correct me if I'm wrong.

Question 3

What I get after reading above link is

YARN handle the shortcomes of classic MR by splitting the functionality of Job tracker

functionality of JobTracker in 1.x i.e resource management and job scheduling/monitoring are divided into separate daemons. - global ResourceManager (RM) and per-application ApplicationMaster (AM)

ResourceManager - run at NameNode i.e master side

it DISTRIBUTE RESOURCES among all appl

it has 2 main components: Scheduler and ApplicationsManager.
Scheduler is pure scheduler
ApplicationsManager is responsible for accepting job-submissions

NodeManager - run at DataNode i.e slave side

is the per-machine framework agent
it is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

Central ResourceManager and Node specific Manager together is called YARN

Question 4

Task tracker has been split into three components in Hadoop YARN architecture : Resource Manager, Application Manager and Application Master.

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager has two main components: Scheduler and ApplicationsManager.

The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.

The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

Have a look at documentation link

Have a look at this SE question for more details.

What additional benefit does Yarn bring to the existing map reduce?

Question 5

Yes Jobtracker was split into resource manager and application master. Application master runs on one or all node managers instances based on the number of jobs submitted. So when job submitted, resource manager talks to one of free node managers to act as application master and that application master will be now job tracker and other node managers will be task trackers which they execute Yarn child. find details here: http://ercoppa.github.io/HadoopInternals/HadoopArchitectureOverview.html

Question 6

namenode, datanode, resourcemanager, applicationmaster

You missed another daemons in Hadoop-2.x from above list which is NodeManager. This daemon runs on the individual nodes like tasktracker. On startup, this component registers with the RM and sends information about the resources available on the nodes. Subsequent NM-RM communication is to provide updates on container statuses – new containers running on the node, completed containers, etc.

So here is what happen. RM allocates resources to job. one of the allocated node act like applicationmaster and communicate with other nodes. In simple terms now you can consider application master is jobtracker and all others are tasktraker nodes. RM is free to service other users for more jobs. Now that is the beauty of the MR v2 that you can run multiple MR jobs as well as other applications like Spark jobs on the same cluster. ResourceManage is responsible for management of the cluster and spin allocate resources or nodes for jobs and one of the allocated node becomes application master.

Shahzad

Question 7

Just Remember the below comparisons Job Tracker = Resource Manager (Application manager, known as container 0) + scheduler (FIFO,fair scheduler and capacity scheduler)

Tasktracker = Node manager

Initially when job is submitted in HDPv1 1. The job tracker had the responsibility of calculating the mappers and reducers for job, monitoring dead/live task-trackers, re-spawning mappers and reducers if they fail.

Now in HDPv2 when we submit a job the

Resource manager java process (The same java process act as scheduler) first spawns application manager on any node (also known as container 0), then application manager reads the job code and calculates the resources required by that job and asks for resources from scheduler (which also monitor how many resources does job's queue has). Scheduler calculated and gives names of nodes to AM where it can spawn containers. Then AM spawns containers on those nodes and monitors them . In case any container dies it is the AM which again goes to scheduler and negotiates for more resource. Hence the work of jobtracker is divided between AM and scheduler of YARN. Also please note that each job submitted will have a new AM so there can be multiple AM running but only one scheduler on cluster. The AM is spawned on node managers and scheduler is started on RM node.

Question 8

In Hadoop V2, they use YARN framework for replacing the older version. YARN has central resource manager component which manages resources and allocates the resources to the application. Multiple applications can run on Hadoop via YARN and all application could share common resource management.

http://saphanatutorial.com/how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/

Question 9

         Hadoop 1                                      Hadoop 2
1,it is mapreduce1                                  1,it is yarn mapreduce
2, here it has job tracker,                         2,here it has resource manager  
task tracker                                        ,node manager
3,it can send another task tracker                  3,it can send resource manager
                                                     ,timeline server  which  
                                                         stores applicationhistory