HBase, Map/Reduce, and SequenceFiles: mapred.output.format.class is incompatible with new map API mode

https://stackoverflow.com/questions/22138664

19-10-2022
|

문제

I am trying to produce mahout vectors from an HBase table. Mahout requires sequence files of vectors as its input. I am getting the impression that I can't write to a sequence file from a map-reduce job that uses HBase as a source. Here goes nothing:

public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
    JobConf jobConf = new JobConf();
    jobConf.setMapOutputKeyClass(LongWritable.class);
    jobConf.setMapOutputValueClass(VectorWritable.class);

    // we want the vectors written straight to HDFS,
    // the order does not matter.
    jobConf.setNumReduceTasks(0);

    jobConf.setOutputFormat(SequenceFileOutputFormat.class);

    Path outputDir = new Path("/home/cloudera/house_vectors");
    FileSystem fs = FileSystem.get(configuration);
    if (fs.exists(outputDir)) {
        fs.delete(outputDir, true);
    }

    FileOutputFormat.setOutputPath(jobConf, outputDir);

    // I want the mappers to know the max and min value
    // so they can normalize the data.
    // I will add them as properties in the configuration,
    // by serializing them with avro.
    String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
            maximumHouse));
    jobConf.set("minmax", minmax);

    Job job = Job.getInstance(jobConf);
    Scan scan = new Scan();
    scan.addFamily(Bytes.toBytes("data"));
    TableMapReduceUtil.initTableMapperJob("homes", scan,
            HouseVectorizingMapper.class, LongWritable.class,
            VectorWritable.class, job);

    job.waitForCompletion(true);
}

I have some test code to run it, but I get this:

java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
    at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
    at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
    at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
    at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

So I think my problem is that I am using the import org.apache.hadoop.mapreduce.Job, and the setOutputFormat method expects an instance of org.apache.hadoop.mapreduce.OutputFormat, which is a class. That class only has four implementations, and none of them are for a sequence file. Here is its javadocs:

http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/OutputFormat.html

I would use the old API version of the Job class if I could, but HBase's TableMapReduceUtil only accepts the new API's Job.

I suppose I could write my results out as text first, and then have a second map/reduce job that converts the output to sequence files, but that sounds very inefficient.

There is also the old org.apache.hadoop.hbase.mapred.TableMapReduceUtil, but it is deprecated for me.

My mahout jar is version 0.7-cdh4.5.0 My HBase jar is version 0.94.6-cdh4.5.0 all of my hadoop jars are 2.0.0-cdh4.5.0

Would somebody please tell me how to write to a SequenceFile from M/R in my situation?

해결책

Actually, SequenceFileOutputFormat is a descendant of the new OutputFormat. You have to look a bit further than just the direct subclasses in the javadoc to find it.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/SequenceFileOutputFormat.html

You've probably imported the wrong (old) one in your driver class. It's not possible to tell that from your question for sure, as you have not include the imports in your code samples.

다른 팁

This was the missing piece for me with similar issues using Oozie. From braindump:

<!-- New API for map -->
<property>
    <name>mapred.mapper.new-api</name>
    <value>true</value>
</property>

<!-- New API for reducer -->
<property>
    <name>mapred.reducer.new-api</name>
    <value>true</value>
</property>

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow