Question

We are having a strange issue while trying to pump large amount of data using distcp to another cluster .

To give out some numbers,

Data size = 320GB Number of Mappers = ~70 Total number of nodes in our cluster 82 src cluster : hadoop 0.20 Destination cluster : hadoop 2.0.2

When we kick off this job, All the mappers complete successfully but the last one takes too long and when it completes/fails it basically freezes the Job tracker for close to 15 mins after which all the task trackers get restarted thereby restarting all the jobs that were running at that time in the cluster.

We have multiple distcp jobs transferring data to S3 as well as other clusters with same hadoop setupand have not faced this issue. The only difference between this process and other is that here we are using webhdfs, Is webhdfs the bottleneck?

We also tried reducing the file size, check network bandwidth for saturation,load on machines but still could not get around this issue.

Also, is there any other way for data transfer instead of using webhdfs when the two hadoop versions are not the same

Thanks

Was it helpful?

Solution

After setting up ganglia to collect stats and using jconsole, we narrowed down the problem to blocking threads.

The resolution was to increase the

mapred.job.tracker.handler.count from current value of 64 to 128

Our best assumption for the issue was that,All the Distcp mapper tend to complete at the same time, thereby blocking a lot of threads in cleanup process and freezing Job tracker for a good 10-15 minutes

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top