Question

I need to move (not copy) a lot of files from one hdfs dir to another hdfs dir in the same cluster.

I could use distcp, but it seems like an overkill since it is copying the file (duplicating it) I just want to move it. Two questions:

A) Is there anything out there:

I would like to use mapreduce to do this since there are millions of files that need to be moved (or renamed to a new path). I also want to integrate it with oozie. I could write a mapreduce job myself, but I was wondering if there is something out there that would do the job.

B) Do I really need to do this as mapred?

Unfortunately, I don't know enough about the performance characteristics of hdfs rename; do you think I could get a way with a single-threaded approach to renaming the files?

Was it helpful?

Solution

The move itself is efficient, since it is only at the metadata (i.e., inode) level, not at the data level. In other words, issuing a move (which is internally in Hadoop's code called a rename, not a move) is much faster than copying the data. You can take a look at the source code, in case you are interested in the details.

For this reason, you should not do a distcp, since that would be an actual copy of the data. If you want to parallelize it (since you are talking of millions of files), it should not be too hard using hadoop streaming:

  1. Write several files containing the list of files to rename (src + destination), one per line.
  2. Write a shell script to issue a rename (hdfs command mv) for each line it reads on stdin.
  3. Use streaming: your files with the files are the input, your shell script is the mapper.

Is there anything out there?

I am not aware, but there may be.

Do I really need to do this as mapred?

If you have millions of files, the latency of contacting the namenode will add up, even if the HDFS rename itself is efficient. BUT, if it is a one-time thing, I would rather issue a single-threaded approach and wait, as writing and debugging (even simple code) takes a while too. If you plan on doing this frequently (why?), then I would consider implementing the approach I described above.

OTHER TIPS

I came up with this if you want to copy a subset of files from a folder to another in HDFS:

import pandas as pd
import os
from multiprocessing import Process
from subprocess import Popen, PIPE
hdfs_path_1 = '/path/to/the/origin/'
hdfs_path_2 = '/path/to/the/destination/'

df = pd.read_csv("list_of_files.csv")  
to_do_list = list(df.tar) # or any other lists that you have
print(f'To go: {len(to_do_list)}')

def copyy(f):
    process = Popen(f'hdfs dfs -mv {hdfs_path_1}{f} {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
    std_out, std_err = process.communicate()
    if std_out!= b'':
        print(std_out)

ps = []
for f in to_do_list:
    p = Process(target=copyy, args=(f,))
    p.start()
    ps.append(p)
for p in ps:
    p.join()
print('done')

Also if you want to have a list of all files in a directory use this:

from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top