Understanding Hadoop - Output After Java Main() completes

https://stackoverflow.com/questions/23451573

15-07-2023
|

Question

Explanations to the two questions I asked are at the end of this post.

I am trying to get a simple Wordcount program to run so I play around and see what does what.

I currently have an implementation which seems to run perfectly fine to the very end. Then after my last line in Main() (which is just println saying so) I get output which looks like a summary of the Hadoop job with a single exception.

In my Mapper and Reducer functions I also have a line which simply outputs arbitrary text to the screen just so I know it hits the line, but through out run time I never see either of these lines get hit. I believe this is causing the IOException mentioned above.

I have 2 questions:

Why would the classes I set to be my setMapperClass(), setCombinerClass() and setReducerClass() not get executed?
What does the output after the line "I hit the end of Main()" mean? Are these displayed as part of my Java application or is it a system message from the Hadoop installation from a job being processed? Since it is making references to the Hadoop API I use in my code I would assume it is part of my Java Application. If this is true, why is it executed after the last line of my main() function?

I've saved the output of running the job to a file:

Enter the Code to run the particular program.
Wordcount = 000:
Assignment 1 = 001:
Assignment 2 = 002:
000
/usr/dan/wordcount/
/usr/dan/wordcount/result.txt
May 04, 2014 2:22:28 PM org.apache.hadoop.metrics.jvm.JvmMetrics init
INFO: Initializing JVM Metrics with processName=JobTracker, sessionId=
May 04, 2014 2:22:29 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 2
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Running job: job_local_0001
May 04, 2014 2:22:29 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
INFO: Total input paths to process : 2
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: io.sort.mb = 100
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: data buffer = 79691776/99614720
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.MapTask <init>
INFO: record buffer = 262144/327680
May 04, 2014 2:22:29 PM org.apache.hadoop.mapred.LocalJobRunner run
WARNING: job_local_0001
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:845)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:541)
    at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
    at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
May 04, 2014 2:22:30 PM org.apache.hadoop.mapred.JobClient log
INFO: Counters: 0
Not Fail!
I hit the end of wordcount!
I hit the end of Main()

The way I have the application set up is the main class which based on user input sends the flow to the respective class. In case it will help I will post only class I'm working on at the moment. If you need to see more just ask.

package hadoop;

import java.io.IOException;
import java.util.Arrays;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 *
 * @author Dans Laptop
 */
public class Wordcount {

    public static class TokenizerMapper 
       extends org.apache.hadoop.mapreduce.Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, org.apache.hadoop.mapreduce.Reducer.Context context
                        ) throws IOException, InterruptedException {
            System.out.println("mapper!");
            StringTokenizer itr = new StringTokenizer(value.toString());

            while (itr.hasMoreTokens()) {
              word.set(itr.nextToken());
              context.write(word, one);
            }
        }
    }

    public static class IntSumReducer 
        extends org.apache.hadoop.mapreduce.Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, 
                           org.apache.hadoop.mapreduce.Reducer.Context context
                           ) throws IOException, InterruptedException {
            System.out.println("Reducer!");
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public void wordcount(String[] args) throws IOException, InterruptedException, ClassNotFoundException{
        System.out.println(args[0]);// Prints arg 1
        System.out.println(args[1]);// Prints arg 2

        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if (otherArgs.length != 2) {
          System.err.println("Usage: wordcount <in> <out>");
          System.exit(2);
        }

        Job job = new Job(conf, "wordcount");

        job.setJarByClass(Wordcount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setMapOutputKeyClass(Text.class); 
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        try{
            job.waitForCompletion(true);
            System.out.println("Not Fail!");
        }catch(Exception e){
            System.out.println(e.getLocalizedMessage());
            System.out.println(e.getMessage());
            System.out.println(Arrays.toString(e.getStackTrace()));
            System.out.println(e.toString());
            System.out.println("Failed!");
        }

        System.out.println("I hit the end of wordcount!");//Proves I hit the end of wordcount.
    }
}

The command being used to run the jar is (from the location /usr/dan):

hadoop -jar ./hadoop.jar /usr/dan/wordcount/ /usr/dan/wordcount/result.txt

Note: I expect the program to look at all of the files in /usr/dan/wordcount and then create a file /usr/dan/wordcount/result.txt which lists each word and the number of times it occurs. I am not getting this behavior yet, but I want to figure out these 2 questions I have so I can troubleshoot it the rest of the way.

Response to @Alexey:

I did not realize it was not possible to print to console directly during MapReduce job execution in hadoop. I had just assumed those line were not being executed. Now I know where to look for any output during a job. However following the directions on the question you linked to did not display any jobs for me to look at. Maybe because I have not had any jobs fully completed though.
I've switched from job.submit(); to job.waitForCompletion(true); but still am getting the output after it. I don't know if this indicates something else is wrong or not, but figured I'd document it.

I've added the lines you suggested (These set the output of the Map class only?):

job.setMapOutputKeyClass(Text.class); 
job.setMapOutputValueClass(IntWritable.class);

and left/removed the lines (These set the output of both the Map and Reduce classes?):

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

I am still getting the same Exception. From reading about it online, it seems the error has to do with the output types of the Map class not matching the input types of the Reduce class. It seems pretty explicit in my code that these two values match. The only thing that has me confused is where is it getting LongWritable from? I do not have that anywhere in my code. After looking at this question Hadoop type mismatch in key from map expected value Text received value LongWritable this was the same case, but the solution was to specify the Mapper and Reducer Classes which I am already doing. The other thing I notice in my code is the input key of my Mapper class is of type Object. Could this have any significance? The error says it is a mismatch in the key FROM Map though.

I've also went ahead and updated my code/results.

Thank you for your help so far, I've already picked up a lot of information based off of your response as it is.

EXPLANATIONS

The Mapper and Reducer function were being executed, but any output is not displayed directly to the screen. You have to view the hadoop logs in the browser to see them.
The extra output is something hadoop does to give you a bit of information about how the job executed (if at all).
Any of the additional questions I asked of @Alexey were out of the scope of the original question, but with determination I figured them out and learned to a much greater degree. Thank you @Alexey.

Solution

Your first question looks similar to this one How to print on console during MapReduce job execution in hadoop .
Line job.submit(); tells hadoop to run job, but not wait until job's completion. I think you may want to replace this line with job.waitForCompletion(true);, so there would be no output after "I hit the end of wordcount!".
To get rid of exception you should specify output key and value classes of your mapper:

job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow