Question

I want to run a Hadoop-job on a Wikipedia history-dump XML file with Rubydoop. So far I managed to load Cloud⁹'s XMLInputFormat Java class and mapped it into a Ruby class:

module Cloud9
  require 'java'

  require File.expand_path('../../cloud9-1.5.0.jar', __FILE__)
  require File.expand_path('../../hadoop-core-1.2.1.jar', __FILE__)
  require File.expand_path('../../commons-logging-1.1.1.jar', __FILE__)

  java_import 'edu.umd.cloud9.collection.XMLInputFormat'
end

module Wikipedia
  class XmlInputFormat < ::Cloud9::XMLInputFormat

  end
end

and added the XmlInputFormat to the Rubydoop configure job block:

input input_path, format: Wikipedia::XmlInputFormat

When running the Job I get the following Error after the splitting process by <page> and </page> tag has started:

java.lang.Exception: java.lang.IncompatibleClassChangeError:   
Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)

Caused by: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at edu.umd.cloud9.collection.XMLInputFormat$XMLRecordReader.initialize(XMLInputFormat.java:102)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

I'm running Hadoop 2.1.2 locally with the cloud9-1.5.0.jar and Rubydoop 1.1.0.

So the question is: Is this because of incompatible hadoop versions (old/new Hadoop API?) used by Cloud⁹ and Rubydoop or locally? How could it be fixed?

Was it helpful?

Solution

It is an incompatiblity error between Hadoop 1.2.1 and the Cloud⁹ lib version 1.5.0 since higher versions of Hadoop (2.x) use a TaskAttemptContext interface instead of class.

It works for me now with the cloud9-1.4.0.jar.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top