The move itself is efficient, since it is only at the metadata (i.e., inode) level, not at the data level. In other words, issuing a move (which is internally in Hadoop's code called a rename
, not a move
) is much faster than copying the data. You can take a look at the source code, in case you are interested in the details.
For this reason, you should not do a distcp, since that would be an actual copy of the data. If you want to parallelize it (since you are talking of millions of files), it should not be too hard using hadoop streaming:
- Write several files containing the list of files to rename (src + destination), one per line.
- Write a shell script to issue a rename (hdfs command
mv
) for each line it reads on stdin. - Use streaming: your files with the files are the input, your shell script is the mapper.
Is there anything out there?
I am not aware, but there may be.
Do I really need to do this as mapred?
If you have millions of files, the latency of contacting the namenode will add up, even if the HDFS rename itself is efficient. BUT, if it is a one-time thing, I would rather issue a single-threaded approach and wait, as writing and debugging (even simple code) takes a while too. If you plan on doing this frequently (why?), then I would consider implementing the approach I described above.