Question

I'm trying to use the CombineFileInputFormat class using Yelp's MrJob tool for EMR. The jobflow is created using hadoop streaming, and MrJob's documentation indicates the CombineFileInputFormat class must be bundled in a customized hadoop-streaming.jar.

For context, please follow this question.

Specifically my question is: Where should the concrete class CombinedInputFormat.class be bundled or referenced within the hadoop-streaming.jar?

I have tried bundling the CombinedInputFormat.class by adding it to a directory org/apache/hadoop/streaming and executing:

jar uvf my-hadoop-streaming.jar org/apache/hadoop/streaming

If I do that, the streaming jobflow starts, with the option -inputformat CombinedInputFormat the Job starts the first step and breaks, with error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/streaming/CombinedInputFormat (wrong name: CombinedInputFormat)
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
        ...

If I just try to set it in the root path with:

jar uvf my-hadoop-streaming.jar CombinedInputFormat.class

The error I get is:

-inputformat : class not found : CombinedInputFormat
Streaming Job Failed!

How should I bundle the CombinedInputFormat.class so that it will be correctly taken and solve the NoClassDefFoundError error?

Was it helpful?

Solution

The class CombinedInputFormat explained here extends CombineFileInputFormat and isn't ported with hadoop. So what you need to do is, in the same package where you have you mapper/reducer job class, you have to CREATE a class and have the code stated in the earlier issue. Then create jar and it should run normally.

So basically, you need to write your own implementation of CombineFileInputFormat(which I did it for you) and you can name it anything you want, say ABCClass instead of CombinedInputFormat as I had named it.

OTHER TIPS

This is another easy way i found to get custom jar built and run in hadoop local or EMR http://www.applams.com/2014/05/using-custom-streaming-jar-using-custom.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top