Mapreduce programming: details required

https://stackoverflow.com/questions/23152312

05-07-2023
|

Domanda

I need some help in understanding the mapreduce flow, not theoretically, rather in reference with normal Java code snippet. I've been reading tutorials, APIs, forums and other subject related matters regarding processing in hadoop; the mapreduce. I'm perfectly comfortable with the flow, the processing methodology of hadoop. The problem with my understanding is related to the mapreduce coding in Java. I've read and tried some already available codes out there; i.e., wordcount, etc. What I'm not able to understand is the flow of the raw data into the Mapper & the Reducer classes (strictly in terms of Java codes, not the theoretical flow).

I'll try to present it in terms of codes also; may be I could make myself more understandable that way. Let's say the classic wordcount program; the declaration of the map & reduce class I found somewhere as:

public static class Map extends MapReduceBase implements 
    Mapper<LongWritable, Text, Text, IntWritable>

public static class Reduce extends MapReduceBase implements 
    Reducer<Text, IntWritable, Text, IntWritable>

My queries for above code snippet:

How would I decide what all the 4 arguments' type should be here?
The APIs definition states it's something like

org.apache.hadoop.mapreduce.Mapper(KEYIN,VALUEIN,KEYOUT,VALUEOUT)
If I'm passing the file through the command line to my program, in what argument the whole content go? As per the understanding of hadoop flow, I say the first 2 arguments i.e., the KEYIN & VALUEIN will be used. The keys will be the words while the values will be there counts (whatever comes the output of the map phase).
How do I declare if my key should be LongWritable or not. What if I declare only Integer type the first argument? (I'm not talking about the difference between Integer & LongWritable type, but the basic decision).
How would I decide what part of my problem should be in my mapper class and what part should be in reducer class?
Should the declared argument types in mapper & reducer classes be the same or different? Like in above declarations, they are different. How does that count? The only answer I could guess is the map phase outputs intermediate values which might not be the same type as what was the input type to the map class. (Not sure on this, so please excuse this explanation if absurd).

For instance, I tried writing a small code to find out the greatest number in a small text file with comma separated integer values. First of all I couldn't decide where I should perform the processing; in mapper class or the reducer class? Looking at numerous codes around, I somehow reached to a conclusion that the processing should be in reducer class whilst I can apply some basic checks in mapper class to my input. This logic I just assumed all by myself, so you can have your fun time on this :) Can someone please tell me what's wrong in this code & may be help me clear my understanding through?

My code:

public class mapTest {
    public static class Map extends MapReduceBase implements 
    Mapper<Text, Text, Text, Reporter>{

        @Override
        public void map(Text text, Text value,
                OutputCollector<Text, Reporter> output, Reporter arg0)
                throws IOException {
            // TODO Auto-generated method stub
            String val = value.toString();

            if(val.equals(null)){
                System.out.println("NULL value found");
            }
            output.collect(value, arg0);

        }

    }

    public static class Reduce extends MapReduceBase implements
    Reducer<Text, Text, Text, Reporter>{

        public void reduce(Text key, Iterator<Text> value,OutputCollector<Text, Reporter> output, Reporter arg0)
                throws IOException {
            // TODO Auto-generated method stub
            Text max = value.next();

            while(value.hasNext()){
                Text current = value.next();
                if(current.compareTo(max) > 0)
                    max = current;
            }
            output.collect(key, (Reporter) max);
        }

    }



    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub

        JobConf conf = new JobConf();

        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        JobClient.runJob(conf);
    }

}

PS: The argument types, I just somehow randomly mentioned though I don't understand the importance of any except the Reporter type.

Before posting this question, I did all my research I could. Please help me understand this flow. I don't want to mess everything up by just picking up the codes from elsewhere and doing the cosmetics.

Thanks in advance-

Adil

Soluzione

KEYIN,VALUEIN for the mapper and KEYOUT,VALUEOUT for the reducer depend on your input/output formats respectively. The map output k/v pair must match the input for the reducer.
For a text file you probably want the TextInputFormat. The key is the byte offset of the file and the value is the line. From there you can parse out any data you want like a normal string.
LongWritable vs IntWritable is just like choosing an int vs a long. It all depends on your data.
Whether the bulk of the work should be done in the mapper or reducer is up for debate. You generally use more mappers than reducers so you can leverage more parallelism. You also reduce the amount of data that needs to be shuffled & sorted before reducing. However reducers have all values collocated based on key so the processing may make sense there too. Personally I try to minimize the data as much as I can as early as I can. Mapreduce is most efficient when everything can fit in memory.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow