Question

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives" field of the mrjob.config file.

this makes dependency management messier than i would like, and am wondering if i can somehow use the same requirements.txt file i use for my virtualenv setup to bootstrap the emr instance with my dependencies. is it possible to set up virtualenv's on EMR instances and do something like:

pip install -r requirements.txt

as i would locally?

Was it helpful?

Solution

One way to accomplish this is using a bootstrap action. You can use these to run shell scripts.

If you have a setup python file that does something like:

requirements = open("requirements.txt", "r")
shell_script = open("pip.sh", "w+")
shell_script.write("sudo apt-get install python-pip\n")
for line in requirements:
    shell_script.write("sudo pip install -I " + line)

Then you can just run this as the bootstrap action without needing to upload your requirements.txt

OTHER TIPS

So, if you're using mrjob, I've had some success by just putting the pip calls straight into my .mrjob.conf file as a bootstrap action. It's not as elegant as using a requirements.txt file (it'll load the same modules for all your jobs). For example, my conf file looks like:

runners:
  emr:
    aws_access_key_id: xx
    aws_secret_access_key: xx
    ec2_key_pair: xx
    ec2_key_pair_file: xx
    ssh_tunnel_to_job_tracker: true
    bootstrap_cmds:
      - sudo apt-get install -y python-pip
      - sudo pip install pgnparser
      - sudo pip install boto

and that will load the pgnparser and boto modules for me to use in my mrjob scripts.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top