문제

i'm sending code to amazon's EMR via the mrjob/boto modules. i've got some external python dependencies (ie. numpy, boto, etc) and currently have to download the source of the python packages, and send them over as a tarball in the "python_archives" field of the mrjob.config file.

this makes dependency management messier than i would like, and am wondering if i can somehow use the same requirements.txt file i use for my virtualenv setup to bootstrap the emr instance with my dependencies. is it possible to set up virtualenv's on EMR instances and do something like:

pip install -r requirements.txt

as i would locally?

도움이 되었습니까?

해결책

One way to accomplish this is using a bootstrap action. You can use these to run shell scripts.

If you have a setup python file that does something like:

requirements = open("requirements.txt", "r")
shell_script = open("pip.sh", "w+")
shell_script.write("sudo apt-get install python-pip\n")
for line in requirements:
    shell_script.write("sudo pip install -I " + line)

Then you can just run this as the bootstrap action without needing to upload your requirements.txt

다른 팁

So, if you're using mrjob, I've had some success by just putting the pip calls straight into my .mrjob.conf file as a bootstrap action. It's not as elegant as using a requirements.txt file (it'll load the same modules for all your jobs). For example, my conf file looks like:

runners:
  emr:
    aws_access_key_id: xx
    aws_secret_access_key: xx
    ec2_key_pair: xx
    ec2_key_pair_file: xx
    ssh_tunnel_to_job_tracker: true
    bootstrap_cmds:
      - sudo apt-get install -y python-pip
      - sudo pip install pgnparser
      - sudo pip install boto

and that will load the pgnparser and boto modules for me to use in my mrjob scripts.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top