Question

I am running three MapReduce jobs in sequence (output of one is the input to another) on a Hadoop cluster with 3 nodes (1 master and 2 slaves).

Apparently, the total time taken by individual jobs to finish on a single node cluster is less than the above by quite a margin. What could be the possible reasons? Is it the network latency? It's running on 100Mbps Ethernet network. Will it help if I increase the number of nodes?

I am using Hadoop Streaming and my code is in python2.7.

Was it helpful?

Solution

MapReduce isn't really meant to handle that small of an input dataset. The MapReduce framework has to determine which nodes will run tasks and then spin up a JVM to run each individual Map and Reduce task(s) (the number of tasks is dependent on the size of your data set). That usually has a latency on the order of tens of seconds. Shipping non local data between nodes is also expensive as it involves sending data over the wire. For such a small dataset, the overhead of setting up a MapReduce job in a distributed cluster is likely higher than the runtime of the job itself. On a single node you only see the overhead of starting up tasks on a local machine and don't have to do any data copying over the network, that's why the job finishes faster on a single machine. If you had multi gigabyte files, you would see better performance on several machines.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top