Hadoop Multi-node programming with Java

https://stackoverflow.com/questions/23553669

18-07-2023
|

Question

I am new in Hadoop, I want to use 2, 4, 6 nodes for each run to split the dataset to be sent to the mappers. but the code that I have written does not work properly. In fact it works for 2 nodes but as the number of nodes increase some output data lost in the output file. Would you please help me? Thank you

Here is the code:

public static void main(String[] args) throws Exception {


        System.out.println("MapReduce Started at:"+System.currentTimeMillis());
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        int numOfNodes = 2;  

        Job job = new Job(conf, "calculateSAAM"); 
        job.setJarByClass(calculateSAAM.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path("/home/helsene/wordcount/input"));
        String outputFile = "/home/helsene/wordcount/output/";

        long dataLength = fs.getContentSummary(new Path(outputFile)).getLength();
        FileInputFormat.setMaxInputSplitSize(job, (dataLength / numOfNodes));

        job.setNumReduceTasks(numOfNodes/2);
        Path outPath = new Path(outputFile);

        fs.delete(outPath, true);
        FileOutputFormat.setOutputPath(job, new Path(outputFile)); 

        job.waitForCompletion(true);
        System.out.println("MapReduce ends at:"+System.currentTimeMillis());
        }        
    }

Solution

Each reducer produces one output file, named by default part-xxxxx (part-00000 for the first reducer, part-00001 for the second reducer etc.).

With your code, when you have more than 3 nodes, you will have more than one reducers, so the output data will be split into parts (more than one files). This means that some word counts will be in the first file (part-00000), some word counts will be in the second file (part-00001), etc. You can later merge these parts by calling the getmerge command, like:

hadoop dfs -getmerge /HADOOP/OUTPUT/PATH /local/path/

and get one file in your specified local path with the merged results of all the partial files. This file will have the same results as the file that you get when you have two nodes and hence 2/2 = 1 reducer (producing one output file).

By the way, setting the number of reducers to numOfNodes/2 may not be the best option. See this post for more details.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow