How do you use Python UDFs with Pig in Elastic MapReduce?

https://stackoverflow.com/questions/9300509

25-10-2019
|

Pergunta

I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged:

ERROR 2998: Unhandled internal error. org/python/core/PyException

java.lang.NoClassDefFoundError: org/python/core/PyException
        at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
        at org.apache.pig.PigServer.registerCode(PigServer.java:568)
        at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:437)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 14 more

What do you need to do to use Python UDFs for Pig in Elastic MapReduce?

Solução 2

After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.

Once I made that realization, it was fairly easy to get things setup to use Python UDFS:

Install Jython
- sudo apt-get install jython -y -qq
Set the HADOOP_CLASSPATH environment variable.
- export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
  - jython.jar ensures that Hadoop can find the PyException class
  - antlr-runtime-3.2.jar ensures that Hadoop can find the CharStream class
Create the cache directory for Jython (this is documented in the Jython FAQ)
- sudo mkdir /usr/share/java/cachedir/
- sudo chmod a+rw /usr/share/java/cachedir

I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:

Setting the CLASSPATH and PIG_CLASSPATH environment variables doesn't seem to do anything.
The .py file containing the UDF does not need to be included in the HADOOP_CLASSPATH environment variable.
The path to the .py file used in the Pig register statement may be relative or absolute, it doesn't seem to matter.

Outras dicas

Hmm...to clarify some of what I just read here, at this point using a python UDF in Pig running on EMR stored on s3, it's as simple as this line in your pig script:

That is, no classpath modifications necessary. I'm using this in production right now, though with the caveat that I'm not pulling in any additional python modules in my udf. I think that may affect what you need to do to make it work.

I faced the same problem recently. Your answer can be simplified. You don't need to install jython at all or create the cache directory. You do need to include the jython jar in the EMR bootstrap script (or do something similar). I wrote an EMR bootstrap script with the following lines. One can simplify this even further by not using s3cmd at all, but by using your job flow (to place the files in a certain directory). Getting the UDF via s3cmd is definitely inconvenient, however, I was unable to register a udf file on s3 when using the EMR version of pig.

If you are using CharStream, you have to include that jar as well to the piglib path. Depending on the framework you use, you can pass these bootstrap scripts as options to your job, EMR supports this via their elastic-mapreduce ruby client. A simple option is to place the bootstrap scripts on s3.

If you are using s3cmd in the bootstrap script, you need another bootstrap script that does something like this. This script should be placed before the other in bootstrap order. I am moving away from using s3cmd, but for my successful try, s3cmd did the trick. Also, the s3cmd executable is already installed in the pig image for amazon (e.g. ami version 2.0 and hadoop version 0.20.205.

Script #1 (Seeding s3cmd)

#!/bin/bash
cat <<-OUTPUT > /home/hadoop/.s3cfg
[default]
access_key = YOUR KEY
bucket_location = US
cloudfront_host = cloudfront.amazonaws.com
cloudfront_resource = /2010-07-15/distribution
default_mime_type = binary/octet-stream
delete_removed = False
dry_run = False
encoding = UTF-8
encrypt = False
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/local/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %  (passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase = YOUR PASSPHRASE
guess_mime_type = True
host_base = s3.amazonaws.com
host_bucket = %(bucket)s.s3.amazonaws.com
human_readable_sizes = False
list_md5 = False
log_target_prefix =
preserve_attrs = True
progress_meter = True
proxy_host =
proxy_port = 0
recursive = False
recv_chunk = 4096
reduced_redundancy = False
secret_key = YOUR SECRET
send_chunk = 4096
simpledb_host = sdb.amazonaws.com
skip_existing = False
socket_timeout = 10
urlencoding_mode = normal
use_https = False
verbosity = WARNING
OUTPUT

Script #2 (seeding jython jars)

#!/bin/bash
set -e

s3cmd get <jython.jar>
# Very useful for extra libraries not available in the jython jar. I got these libraries from the 
# jython site and created a jar archive.
s3cmd get <jython_extra_libs.jar>
s3cmd get <UDF>

PIG_LIB_PATH=/home/hadoop/piglibs

mkdir -p $PIG_LIB_PATH

mv <jython.jar> $PIG_LIB_PATH
mv <jython_extra_libs.jar> $PIG_LIB_PATH
mv <UDF> $PIG_LIB_PATH

# Change hadoop classpath as well.
echo "HADOOP_CLASSPATH=$PIG_LIB_PATH/<jython.jar>:$PIG_LIB_PATH/<jython_extra_libs.jar>" >>    /home/hadoop/conf/hadoop-user-env.sh

As of today, using Pig 0.9.1 on EMR, I found the following is sufficient:

env HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/path/to/jython.jar pig -f script.pig

where script.pig does a register of the Python script, but not jython.jar:

register Pig-UDFs/udfs.py using jython as mynamespace;

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow