Hadoop distcp between two secured(kerberos) clusters
-
21-12-2019 - |
Question
I have two Hadoop clusters and both are running the same Hadoop version. I also have a user "testuser" (example) in both clusters (so testuser keytabs is present in both).
Namenode#1 (source cluster): hdfs://nn1:8020
Namenode#2 (dest cluster): hdfs://nn2:8020
I want to copy some files from one cluster to another using hadoop distcp. Example: in source cluster I have a file with path "/user/testuser/temp/file-r-0000" and in destination cluster, the destination directory is "/user/testuser/dest/". So what I want is to copy file-r-0000 from source cluster to target cluster's "dest" directory.
I have tried these so far:
hadoop distcp hdfs://nn1:8020/user/testuser/temp/file-r-0000 hdfs://nn2:8020/user/testuser/dest
hadoop distcp hftp://nn1:8020/user/testuser/temp/file-r-0000 hdfs://nn2:8020/user/testuser/dest
I believe I do not need to use "hftp://" since I have same version of hadoop. Again, I also tried those in both cluster, but all I'm getting are some exceptions related to security.
When running from destination cluster with hftp:
14/02/26 00:04:45 ERROR security.UserGroupInformation: PriviledgedActionException as:testuser@realm cause:java.net.SocketException: Unexpected end of file from server
14/02/26 00:04:45 ERROR security.UserGroupInformation: PriviledgedActionException as:testuser@realm cause:java.net.SocketException: Unexpected end of file from server
14/02/26 00:04:45 INFO fs.FileSystem: Couldn't get a delegation token from nn1ipaddress:8020
When running from source cluster:
14/02/26 00:05:43 ERROR security.UserGroupInformation: PriviledgedActionException as:testuser@realm1 cause:java.io.IOException: Couldn't setup connection for testuser@realm1 to nn/realm2
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Call to nn1ipaddress failed on local exception: java.io.IOException: Couldn't setup connection for testuser@realm1 to nn/realm2
Caused by: java.io.IOException: Couldn't setup connection for testuser@realm1 to nn/realm2
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:560)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:513)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:616)
at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:203)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1254)
at org.apache.hadoop.ipc.Client.call(Client.java:1098)
... 26 more
It also shows me host address is not present in kerberos database (I don't have the exact log for that)
So, do I need to configure kerberos in a different way in order to use discp between them? Or am i missing something here?
Any information will be highly appreciated. Thanks in advance.
Solution
Cross-realm authentication is required to use distcp between two secured cluster. It was not configured in those two clusters. After setting up cross-realm authentication correctly, it worked.