Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

Question

If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.

This is what I do:

Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
Create an archive of this: cd required && tar czvf required.tgz *
Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required

The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.

To actually do step #4 in boto, here is the code:

step = StreamingStep(name=jobName,
  mapper='...',
  reducer='...',
  ...
  cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])

And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:

sys.path.append('./required')
import myCustomPythonClass

# Mapper: do something!