Question

I've been trying to run a hadoop 0.20.205.0 MapReduce job (single-thread, locally) which is exhibiting all kinds of weird and unexpected behavior. I finally figured out why. This looks to me like a bug in hadoop, but maybe there's something I don't understand. Could someone give me some advice? My setMapOutputKeyClass class implements Configurable. The readFields method won't properly read unless setConf is called first (I believe that's the point of the Configurable interface) But looking at the code for WritableComparator I see that when the framework is sorting them, it instantiates its internal key objects with:

70      key1 = newKey();
71      key2 = newKey();

And newKey() uses a null Configuration to construct the keys:

83  public WritableComparable newKey() {
84    return ReflectionUtils.newInstance(keyClass, null);
85  }

Indeed when I run in debugger I find that at

91      key1.readFields(buffer);

conf inside key1 is null, so setConf has not been called.

Is this a bug in hadoop or am I supposed to be using something other than Configurable to configure the keys? And if this is a bug, does anybody know any workarounds?

EDIT: Here's a short (somewhat contrived) example of a job which fails for this reason:

// example/WrapperKey.java

package example;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;

/**
 * This class wraps a WritableComparable class to add one extra possible value
 * (namely null) to the range of values available for that class.
 */
public class WrapperKey<T extends WritableComparable> implements
        WritableComparable<WrapperKey<T>>, Configurable {
    private T myInstance;
    private boolean isNull;
    private Configuration conf;

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
        Class<T> heldClass = (Class<T>) conf.getClass("example.held.class",
                null, WritableComparable.class);
        myInstance = ReflectionUtils.newInstance(heldClass, conf);
    }

    @Override
    public Configuration getConf() {
        return conf;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeBoolean(isNull);
        if (!isNull)
            myInstance.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        isNull = in.readBoolean();
        if (!isNull)
            myInstance.readFields(in);
    }

    @Override
    public int compareTo(WrapperKey<T> o) {
        if (isNull) {
            if (o.isNull)
                return 0;
            else
                return -1;
        } else if (o.isNull)
            return 1;
        else
            return myInstance.compareTo(o.myInstance);
    }

    public void clear() {
        isNull = true;
    }

    public T get() {
        return myInstance;
    }

    /**
     * Should sort the KV pairs (5,0), (3,0), and (null,0) to [(null,0), (3,0), (5,0)], but instead fails
     * with a NullPointerException because WritableComparator's internal keys
     * are not properly configured
     */
    public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        conf.setClass("example.held.class", ByteWritable.class,
                WritableComparable.class);
        Path p = new Path("input");
        Path startFile = new Path(p, "inputFile");
        SequenceFile.Writer writer = new SequenceFile.Writer(
                p.getFileSystem(conf), conf, startFile, WrapperKey.class,
                ByteWritable.class);
        WrapperKey<ByteWritable> key = new WrapperKey<ByteWritable>();
        key.setConf(conf);
        ByteWritable value = new ByteWritable((byte) 0);
        key.get().set((byte) 5);
        writer.append(key, value);
        key.get().set((byte) 3);
        writer.append(key, value);
        key.clear();
        writer.append(key, value);
        writer.close();

        Job j = new Job(conf, "Example job");
        j.setInputFormatClass(SequenceFileInputFormat.class);
        j.setOutputKeyClass(WrapperKey.class);
        j.setOutputValueClass(ByteWritable.class);
        j.setOutputFormatClass(SequenceFileOutputFormat.class);
        FileInputFormat.setInputPaths(j, p);
        FileOutputFormat.setOutputPath(j, new Path("output"));
        boolean completed = j.waitForCompletion(true);
        if (completed) {
            System.out
                    .println("Successfully sorted byte-pairs by key (putting all null pairs first)");
        } else {
            throw new RuntimeException("Failed to sort");
        }
    }
}
Was it helpful?

Solution

WrapperKey is implementing the Configurable and implements the setConf. Just implementing an interface doesn't mean some other class is going to call this. Hadoop framework might not be calling the setConf method on the keys.

I don't think this is a bug. All the types I have seen implemented WritableComparable only and not Configurable. Not sure for a workaround for this, you may have to define concrete types in the key.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top