Question

I have two hadoop clusters, My target is to use hadoop -cp to copy all the hdfs files from cluster1 to cluster2

Cluster1: Hadoop 0.20.2-cdh3u4

Cluster2: Hadoop 2.0.0-cdh4.1.1

Now, even just running dfs -ls command against cluster1 remotely on cluster2 as below:

hadoop fs -ls hdfs://cluster1-namenode:8020/hbase

I am getting the exception:

ls: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "cluster2-namenode/10.21.xxx.xxx"; destination host is: "cluster1-namenode":8020;

I think it's due to the hadoop version difference, is there any workaround on this? My old environment-cluster1 doesn't have mapred deployed which rules out all the distcp, bhase copytable options. And the hbase replication ability is not available on cluster1 neither. I am struggling to think of a way to migrate the hdfs data from cluster1 to cluster2 since everybody is talking about upgrading cdh3 to cdh4 instead of migrating from 3 to 4.

Was it helpful?

Solution

Discussed this more in cloudera cdh user mailing thread:

https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/GeT1RTbRVcw

In summary,

  1. Regular DFS commands like dfs -cp between CDH3 and CDH4 will not work as both have different protocol versions (and are incompatible with one another over regular RPC calls).

  2. Distcp can be used to copy hdfs data cross clusters even from cdh3 to cdh4, but a few prerequisites here: you need to run the distcp command on the cdh4 cluster, also cdh4 cluster needs to have mapred deployed/available. cdh3 cluster doesn't necessarily need mapred.

  3. when running distcp command, do not use hdfs for the source path, use hftp for the source path while hftp for the destination path(since hftp is READ-ONLY, you will need write-access to the destination path) so the command looks like:

    hadoop distcp hftp://hadoop-namenode.cluster1/hbase hftp://hadoop-namenode.cluster2/hbase

OTHER TIPS

a few notes from my experience for Shengjie's answer:

  1. it can be run from both CDH4 and CDH3, what matters is that the command is run on the destination cluster.

  2. when running distcp commnad use hftp as source and hdfs as destination (since hftp is read only!) so the command looks like:

hadoop distcp hftp://source.cluster1/path hdfs://destination.cluster1/path

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top