I'm trying to use the CombineFileInputFormat
class using Yelp's MrJob tool for EMR. The jobflow is created using hadoop streaming, and MrJob's documentation indicates the CombineFileInputFormat
class must be bundled in a customized hadoop-streaming.jar
.
For context, please follow this question.
Specifically my question is: Where should the concrete class CombinedInputFormat.class
be bundled or referenced within the hadoop-streaming.jar
?
I have tried bundling the CombinedInputFormat.class
by adding it to a directory org/apache/hadoop/streaming
and executing:
jar uvf my-hadoop-streaming.jar org/apache/hadoop/streaming
If I do that, the streaming jobflow starts, with the option -inputformat CombinedInputFormat
the Job starts the first step and breaks, with error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/streaming/CombinedInputFormat (wrong name: CombinedInputFormat)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
...
If I just try to set it in the root path with:
jar uvf my-hadoop-streaming.jar CombinedInputFormat.class
The error I get is:
-inputformat : class not found : CombinedInputFormat
Streaming Job Failed!
How should I bundle the CombinedInputFormat.class so that it will be correctly taken and solve the NoClassDefFoundError
error?