Question

Problem Statement:

I am trying to run a map-reduce job in Amazon EMR using python MRJob library, and I am having trouble with bootstrapping the nodes with the requisite libraries and packages.

Details:

my sample python mrjob code:

    import re
    from mrjob.job import MRJob
    from sentClassifier import sentClassify
    import nltk

    .. do something ..

There are some libraries like NLTK that need to be imported, and there are some of my local modules that I am importing like from sentClassifier import sentClassify

I would like to know what's the best way to bootstrap the EMR nodes so that these methods and packages are available. The code works fine on my local machine.

my sample mrjob.conf file:

    runners:
      emr:
        aws_access_key_id: ***
        aws_secret_access_key: ***
        ec2_core_instance_type: m1.large
        ec2_key_pair: mykey
        ec2_key_pair_file: mykey.pem
        num_ec2_core_instances: 5
        pool_wait_minutes: 2
        pool_emr_job_flows: true
        ssh_tunnel_is_open: true
        ssh_tunnel_to_job_tracker: true
      hadoop:
        setup:
          - virtualenv venv
          - . venv/bin/activate
          - pip install mr3po simplejson
          - sudo easy_install https://code.google.com/p/nltk/downloads/detail?name=nltk-2.0b9-py2.6.egg&can=2&q=

But the job fails.

I have read through the following references and tried all their various approaches, still no luck:

Error Log:

    Scanning SSH logs for probable cause of failure
    Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
    Traceback (most recent call last):
    File "obidroidMR.py", line 5, in <module>
       import nltk
       ImportError: No module named nltk
       (while reading from s3://mrjob-   51b9493c1a467671/tmp/obidroidMR.shreyas.20140503.012933.336228/files/STDIN)
       Attempting to terminate job...
       Job appears to have already been terminated
       Killing our SSH tunnel (pid 12909)
       Traceback (most recent call last):
         File "obidroidMR.py", line 107, in <module>
         ObidroidReview.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 494, in run
         mr_job.execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 147, in execute
         self.run_job()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/launch.py", line 208, in run_job
runner.run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/runner.py", line 458, in run
self._run()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 809, in _run
         self._wait_for_job_to_complete()
         File "/Users/shreyas/anaconda/envs/obidroid/lib/python2.7/site-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
         raise Exception(msg)
         Exception: Job on job flow j-2R8G1Q3RIE9ED failed with status WAITING: Waiting after step failed
         Probable cause of failure (from ssh://ec2-54-86-50-115.compute-1.amazonaws.com!172.31.19.60/mnt/var/log/hadoop/userlogs/job_201405030101_0006/attempt_201405030101_0006_m_000002_3/stderr):
         Traceback (most recent call last):
         File "obidroidMR.py", line 5, in <module>
         import nltk
         ImportError: No module named nltk

Any help would be really appreciated

Was it helpful?

Solution

In mrjob.conf the required lines for installing the packages may not be where they should be. Things that should be applied for a job that runs on EMR should be listed under emr: and not hadoop: (which is the config for when running job on your local Hadoop installation.

If it is a simple Linux command like pip or apt-get, then you should be able to install the packages like this:

runners:
  emr:
    aws_access_key_id: ***
    ... all the other stuff ...
    bootstrap_cmds:
    - sudo apt-get install -y python-boto
    - sudo pip install simplejson

I have never tried to install NLTK specifically, so I cannot help you there, but you should be able to install along this line.

For a potentially more complicated installation, I would recommend sshing onto your master node with the EMR CLI:

$ ./elastic-mapreduce -j JOB_FLOW_ID --ssh

and actually try installing the package. If you find a sequence of shell commands that successfully installs the package, then you can simply copy and paste that into your mrjob.conf.

OTHER TIPS

Given that Amazon Elastic Map Reduce uses AMI based on Amazon Linux, I verified that I can install nltk on Amazon Linux AMI 2014.03.1 - ami-fb8e9292 (64-bit) using the following

sudo easy_install -U pip
sudo easy_install -U distribute
sudo pip install -U pyyaml nltk

you might try incorporate those 3 lines into your mrjob.conf

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top