Multiple directories as Input format in hadoop map reduce

Question 1

You can create a file with list of all directories to process:

/path/to/directory1
/path/to/directory2
/path/to/directory3

Each mapper would process one directory, for example:

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSystem fs = FileSystem.get(context.getConfiguration());
            for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
                // process file
            }
        }

Question 2

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data. How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories? One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class. use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

Question 3

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?

No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.

There's no reason why you can't just open other files you may need directly from HDFS.