hadoop getmerge to another machine

https://stackoverflow.com/questions/21776339

11-10-2022
|

Pergunta

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?

The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.

Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).

I am looking for something similar to how scp works, like:

hadoop dfs -getmerge /user/hduser/Job-output user@someIP:/home/user/

Alternatively, I would also like to get the HDFS data from a remote host to my local machine.

Could unix pipelines be used in this occasion?

For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.

Solução

This will do exactly what you need:

hadoop fs -cat /user/hduser/Job-output/* | ssh user@remotehost.com "cat >mergedOutput.txt"

fs -cat will read all files in sequence and output them to stdout.

ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow