Question

I've got a four node YARN cluster set up und running. I recently had to format the namenode due to a smaller problem.

Later I ran Hadoop's PI example to verify every node was still taking part in the calculation, which they all did. However when I start my own job now one of the nodes is not being used at all.

I figured this might be because this node doesn't have any data to work on. So I tried to balance the cluster using the balancer. This doesn't work and the balancer tells me the cluster is balanced.

What am I missing?

Was it helpful?

Solution

While processing, your ApplicationMaster would negoriate with the NodeManager for containers and NodeManager in turn would try to obtain the nearest datanode resource. Since your replication factor is 3, HDFS would try to place 1 whole copy on a single datanode and distribute the rest across all the datanodes.

1) Change the replication factor to 1 (Since you are only trying to benchmark, reducing replication should not be a big issue).

2) Make sure your client(machine from where you would give your -copyFromLocal command) does not have a datanode running on it. If not, HDFS will tend to place most of the data in this node since it would have reduced latency.

3) Control the file distribution using dfs.blocksize property.

4) Check the status of your datanodes using hdfs dfsadmin -report.

OTHER TIPS

  1. Make sure your node is joinig the resourcemanager. Look into nodemanager log on t the problem node, see if there are errors. Look into the resourcemanager Web UI (:8088 by default) make sure the node is listed there.

  2. Make sure the node is bringing enough resources to the pool to be able to run a job. Check yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb in yarn-site.xml on the node. The memory should be more than the minimum memory requested by a container (see yarn.scheduler.minimum-allocation-mb).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top