Managing dependencies with Hadoop Streaming?
-
30-09-2019 - |
Question
I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?
Solution
If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip
However, see this issue for a caveat:
OTHER TIPS
If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:
https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files