Error using elephant-bird input format with Hadoop Streaming

https://stackoverflow.com/questions/12937851

07-07-2021
|

Question

I'm trying to use an input format from Elephant Bird in my Hadoop Streaming script. In particular, I want to use the LzoInputFormat and eventually the LzoJsonInputFormat (working with Twitter data here). But when I try to do this, I keep getting an error that suggests that the Elephant Bird formats are not valid instances of the InputFormat class.

This is how I'm running the Streaming command:

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar \                                                                                                          
    -libjars /project/hanna/src/elephant-bird/build/elephant-bird-2.2.0.jar \                                                                                                              
    -D stream.map.output.field.separator=\t \                                                                                                                                              
    -D stream.num.map.output.key.fields=2 \                                                                                                                                                
    -D map.output.key.field.separator=\t \                                                                                                                                                 
    -D mapred.text.key.partitioner.options=-k1,2 \                                                                                                                                         
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterMap.py \                                                                                                   
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/filter/filterReduce.py \                                                                                                
    -file /home/a/ahanna/sandbox/hadoop-textual-analysis/streaming/data/latinKeywords.txt \                                                                                                
    -inputformat com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat \                                                                                                             
    -input /user/ahanna/lzotest \                                                                                                                                                          
    -output /user/ahanna/output \                                                                                                                                                          
    -mapper filterMap.py \                                                                                                                                                                 
    -reducer filterReduce.py \                                                                                                                                                             
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

And this is the error that I get:

Exception in thread "main" java.lang.RuntimeException: class com.hadoop.mapreduce.LzoTextInputFormat not org.apache.hadoop.mapred.InputFormat
    at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1078)
at org.apache.hadoop.mapred.JobConf.setInputFormat(JobConf.java:633)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:707)
at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:122)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

Solution

In sake of compatibility Hadoop supports two ways of writing map/reduce tasks in Java: the "old" one via the interfaces from the org.apache.hadoop.mapred package and the "new" via the abstract classes from the org.apache.hadoop.mapreduce package.

You need to know this even if you're using the streaming api since the streaming itself is written using the old approach, so when you want to alter some internals of the streaming mechanism with an external library, you should be sure that this library was written using the old school way too.

This is exactly what happened with you. In common case you would write a wrapper but fortunately Elephant Bird provides an old-styled InputFormat, so all you need is to replace com.twitter.elephantbird.mapreduce.input.LzoTextInputFormat with com.twitter.elephantbird.mapred.input.DeprecatedLzoTextInputFormat.

OTHER TIPS

In hadoop 2.4 I managed to get it run with:

-D org.apache.hadoop.mapreduce.lib.input.FileInputFormat=your.package.path.FileInputFormat

instead of the standard -inputformat

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow