Domanda

I am trying to run a graph verifier app in distributed system using hadoop. I have the input in the following format:

Directory1

---file1.dot

---file2.dot

…..

---filen.dot

Directory2

---file1.dot

---file2.dot

…..

---filen.dot

Directory670

---file1.dot

---file2.dot

…..

---filen.dot

.dot files are files storing the graphs.

Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?

I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?

The files in each directory is dependent on each other for data(to be precise...

  • each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
  • so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.

  • similarly all the files will be processed and the combined output after processing each file in the directory is displayed,

  • same procedure goes for rest of the directories.)

Cutting long story short As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word. How can i split the task here(do i need to split by the way?)

How can i leverage hadoop power for this scenario, some sample code template will help for sure:)

È stato utile?

Soluzione 2

You can create a file with list of all directories to process:

/path/to/directory1
/path/to/directory2
/path/to/directory3

Each mapper would process one directory, for example:

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            FileSystem fs = FileSystem.get(context.getConfiguration());
            for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
                // process file
            }
        }

Altri suggerimenti

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data. How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories? One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class. use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?

No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.

There's no reason why you can't just open other files you may need directly from HDFS.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top