Question

The map portion of my Mapreduce job depends on Numpy. So, that means I need to have numpy installed as part of the bootstrap actions.

What I'm thinking of doing is building a custom Numpy package stored on S3 that is fetched and installed during the boostrap actions.

Is there a better way?

Was it helpful?

Solution

Numpy now comes installed on Amazon Elastic MapReduce instances, but if you want to use other modules, you can zip them up, distribute them to your workers with the DistributedCache (using "-cacheFile"), and then import them with Python's built-in "zipimport" module.

See: http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top