Question

The current version of hadoop-streaming requires a Java class for the combiner, but i read somewhere that we can use a hack like the following:

hadoop jar ./contrib/streaming/hadoop-0.20.2-streaming.jar  -input /testinput -output /testoutput -mapper "python /code/triples-mapper.py | sort | python /code/triples-reducer.py" -reducer /code/triples-reducer.py 

However, this does not seem to work. What am i doing wrong?

Was it helpful?

Solution

I suspect that your problem lies somewhere in the following source:

http://svn.apache.org/viewvc/hadoop/common/tags/release-0.20.2/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java?view=markup

line 69 splitArgs() method which is tokenizing up the command you passed:

python /code/triples-mapper.py | sort | python /code/triples-reducer.py

into a command to run: /code/triples-mapper.py (line 131/132), and then a set of arguments to pass in. All the tokens are passed to ProcessBuilder (line 164)

Java Api for ProcessBuilder

So your pipes are not being interpreted by the OS, more passed in as arguments to your mapper (you should be able to confirm this by dumping the args passed inside your mapper code)

This is all for 0.20.2, so may have been 'fixed' to meet your purposes in later version of hadoop.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top