moving a lot of files from hdfs to hdfs

Question 1

The move itself is efficient, since it is only at the metadata (i.e., inode) level, not at the data level. In other words, issuing a move (which is internally in Hadoop's code called a rename, not a move) is much faster than copying the data. You can take a look at the source code, in case you are interested in the details.

For this reason, you should not do a distcp, since that would be an actual copy of the data. If you want to parallelize it (since you are talking of millions of files), it should not be too hard using hadoop streaming:

Write several files containing the list of files to rename (src + destination), one per line.
Write a shell script to issue a rename (hdfs command mv) for each line it reads on stdin.
Use streaming: your files with the files are the input, your shell script is the mapper.

Is there anything out there?

I am not aware, but there may be.

Do I really need to do this as mapred?

If you have millions of files, the latency of contacting the namenode will add up, even if the HDFS rename itself is efficient. BUT, if it is a one-time thing, I would rather issue a single-threaded approach and wait, as writing and debugging (even simple code) takes a while too. If you plan on doing this frequently (why?), then I would consider implementing the approach I described above.

Question 2

I came up with this if you want to copy a subset of files from a folder to another in HDFS:

import pandas as pd
import os
from multiprocessing import Process
from subprocess import Popen, PIPE
hdfs_path_1 = '/path/to/the/origin/'
hdfs_path_2 = '/path/to/the/destination/'

df = pd.read_csv("list_of_files.csv")  
to_do_list = list(df.tar) # or any other lists that you have
print(f'To go: {len(to_do_list)}')

def copyy(f):
    process = Popen(f'hdfs dfs -mv {hdfs_path_1}{f} {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
    std_out, std_err = process.communicate()
    if std_out!= b'':
        print(std_out)

ps = []
for f in to_do_list:
    p = Process(target=copyy, args=(f,))
    p.start()
    ps.append(p)
for p in ps:
    p.join()
print('done')

Also if you want to have a list of all files in a directory use this:

from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]