Hadoop Word Count working but not summing words up

https://stackoverflow.com/questions/20925023

24-09-2022
|

Question

I'm using Hadoop 1.2.1 and for some reason my Word Count output looks strange:

input file:

this is sparta this was sparta hello world goodbye world

output in hdfs:

goodbye 1
hello   1
is  1
sparta  1
sparta  1
this    1
this    1
was 1
world   1
world   1

code:

public class WordCount {

 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
} 

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        context.write(key, new IntWritable(sum));
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setJarByClass(WordCount.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}

}

And here's some relevant console output:

14/01/04 16:17:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/04 16:17:37 INFO input.FileInputFormat: Total input paths to process : 1
14/01/04 16:17:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/04 16:17:37 WARN snappy.LoadSnappy: Snappy native library not loaded
14/01/04 16:17:38 INFO mapred.JobClient: Running job: job_201401041506_0013
14/01/04 16:17:39 INFO mapred.JobClient:  map 0% reduce 0%
14/01/04 16:17:45 INFO mapred.JobClient:  map 100% reduce 0%
14/01/04 16:17:52 INFO mapred.JobClient:  map 100% reduce 33%
14/01/04 16:17:54 INFO mapred.JobClient:  map 100% reduce 100%
14/01/04 16:17:55 INFO mapred.JobClient: Job complete: job_201401041506_0013
14/01/04 16:17:55 INFO mapred.JobClient: Counters: 26
14/01/04 16:17:55 INFO mapred.JobClient:   Job Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=6007
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/04 16:17:55 INFO mapred.JobClient:     Launched map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     Data-local map tasks=1
14/01/04 16:17:55 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9167
14/01/04 16:17:55 INFO mapred.JobClient:   File Output Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Written=77
14/01/04 16:17:55 INFO mapred.JobClient:   FileSystemCounters
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_READ=123
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_READ=169
14/01/04 16:17:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=122037
14/01/04 16:17:55 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=77
14/01/04 16:17:55 INFO mapred.JobClient:   File Input Format Counters 
14/01/04 16:17:55 INFO mapred.JobClient:     Bytes Read=57
14/01/04 16:17:55 INFO mapred.JobClient:   Map-Reduce Framework
14/01/04 16:17:55 INFO mapred.JobClient:     Map output materialized bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Map input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce shuffle bytes=123
14/01/04 16:17:55 INFO mapred.JobClient:     Spilled Records=20
14/01/04 16:17:55 INFO mapred.JobClient:     Map output bytes=97
14/01/04 16:17:55 INFO mapred.JobClient:     Total committed heap usage (bytes)=269619200
14/01/04 16:17:55 INFO mapred.JobClient:     Combine input records=0
14/01/04 16:17:55 INFO mapred.JobClient:     SPLIT_RAW_BYTES=112
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce input groups=7
14/01/04 16:17:55 INFO mapred.JobClient:     Combine output records=0
14/01/04 16:17:55 INFO mapred.JobClient:     Reduce output records=10
14/01/04 16:17:55 INFO mapred.JobClient:     Map output records=10

What can cause this? I'm very new to Hadoop, so i'm not sure where to look. Thanks!

Solution

You're using an old API signature. In 1.x+ the reduce method changed to use iterables instead of iterator (which was what the old 0.x API used, so you will see iterator in many examples in books and on the web).

http://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapreduce/Reducer.html#reduce%28KEYIN,%20java.lang.Iterable,%20org.apache.hadoop.mapreduce.Reducer.Context%29

Try

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) 
throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    context.write(key, new IntWritable(sum));
}

The @Override annotation tells your compiler to check that your reduce method is overriding the correct method signature in the parent class.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow