Question

I want to copy data from CDH3 to CDH4 (on a different server). My CDH4 server is set up such that it cannot see the CDH3, so I have to push data upstream from CDH3 to CDH4. (which means I cannot run the distcp command from CDH4 to copy the data). How can I get my data over to CDH4' HDFS via running a command on the lower version CDH3 hadoop or is this not possible?

Was it helpful?

Solution 2

You will have to use distcp with the following command when transferring b/w 2 different versions of HDFS (Notice hftp):

hadoop distcp hftp://Source-namenode:50070/user/ hdfs://destination-namenode:8020/user/

OTHER TIPS

Ideally, you should be able to use distcp to copy the data from one HDFS cluster to another.

hadoop distcp -p -update "hdfs://A:8020/user/foo/bar" "hdfs://B:8020/user/foo/baz"

-p to preserve status, -update to overwrite data if a file is already present but has a different size.

In practice, depending on the exact versions of Cloudera you're using, you may run into incompatibilities issues such as CRC mismatch errors. In this case, you can try to use HTFP instead of HDFS, or upgrade your cluster to the latest version of CDH4 and check the release notes to see if there is any relevant known issue and work-around.

If you still have issues using distcp, feel free to create a new stackoverflow question with the exact error message, versions of CDH3 and CDH4, and exact command.

DistCp is intra-cluster only.

The only way I know is "fs -get", "fs -put" for every subset of data that can fit local disc.

For copying between two different versions of Hadoop, one will usually use HftpFileSystem. This is a read-only FileSystem, so DistCp must be run on the destination cluster (more specifically, on TaskTrackers that can write to the destination cluster). Each source is specified as hftp:/// (the default dfs.http.address is :50070).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top