I am trying to produce mahout vectors from an HBase table. Mahout requires sequence files of vectors as its input. I am getting the impression that I can't write to a sequence file from a map-reduce job that uses HBase as a source. Here goes nothing:
public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
JobConf jobConf = new JobConf();
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(VectorWritable.class);
// we want the vectors written straight to HDFS,
// the order does not matter.
jobConf.setNumReduceTasks(0);
jobConf.setOutputFormat(SequenceFileOutputFormat.class);
Path outputDir = new Path("/home/cloudera/house_vectors");
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputDir)) {
fs.delete(outputDir, true);
}
FileOutputFormat.setOutputPath(jobConf, outputDir);
// I want the mappers to know the max and min value
// so they can normalize the data.
// I will add them as properties in the configuration,
// by serializing them with avro.
String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
maximumHouse));
jobConf.set("minmax", minmax);
Job job = Job.getInstance(jobConf);
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("data"));
TableMapReduceUtil.initTableMapperJob("homes", scan,
HouseVectorizingMapper.class, LongWritable.class,
VectorWritable.class, job);
job.waitForCompletion(true);
}
I have some test code to run it, but I get this:
java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
So I think my problem is that I am using the import org.apache.hadoop.mapreduce.Job, and the setOutputFormat method expects an instance of org.apache.hadoop.mapreduce.OutputFormat, which is a class. That class only has four implementations, and none of them are for a sequence file. Here is its javadocs:
http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/OutputFormat.html
I would use the old API version of the Job class if I could, but HBase's TableMapReduceUtil only accepts the new API's Job.
I suppose I could write my results out as text first, and then have a second map/reduce job that converts the output to sequence files, but that sounds very inefficient.
There is also the old org.apache.hadoop.hbase.mapred.TableMapReduceUtil, but it is deprecated for me.
My mahout jar is version 0.7-cdh4.5.0
My HBase jar is version 0.94.6-cdh4.5.0
all of my hadoop jars are 2.0.0-cdh4.5.0
Would somebody please tell me how to write to a SequenceFile from M/R in my situation?