Question

I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto.

What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers?

Was it helpful?

Solution

If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it.

This is what I do:

  1. Put all necessary Python files in a sub-directory, say, "required/" and test it locally.
  2. Create an archive of this: cd required && tar czvf required.tgz *
  3. Upload this archive to S3: s3cmd put required.tgz s3://yourBucket/required.tgz
  4. Add this command-line option to your steps: -cacheArchive s3://yourBucket/required.tgz#required

The last step will ensure that your archive file containing Python code will be in the same directory format as in your local dev machine.

To actually do step #4 in boto, here is the code:

step = StreamingStep(name=jobName,
  mapper='...',
  reducer='...',
  ...
  cache_archives=["s3://yourBucket/required.tgz#required"],
)
conn.add_jobflow_steps(jobID, [step])

And to allow for the imported code in Python to work properly in your mapper, make sure to reference it as you would a sub-directory:

sys.path.append('./required')
import myCustomPythonClass

# Mapper: do something!
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top