I found a proper solution to my problem. My attempt to copy datafiles from S3 to EMR using hadoop fs
commands has been futile. I have just learned about S3DistCp
command available in EMR for file transfer so I am skipping the $HADOOP_CMD
method. For those who care how S3DistCp
works Link to AWS EMR Docs. I still do not understand why bootstrap script will not accept an environment variable in subsequent statements.
Environment variables set in bootstrap does not take effect in AWS EMR
-
04-07-2023 - |
Question
I am setting an environment variable in my bootstrap code
export HADOOP_HOME=/home/hadoop
export HADOOP_CMD=/home/hadoop/bin/hadoop
export HADOOP_STREAMING=/home/hadoop/contrib/streaming/hadoop_streaming.jar
export JAVA_HOME=/usr/lib64/jvm/java-7-oracle/
This is followed by usage of one of the variables defined above -
$HADOOP_CMD fs -mkdir /home/hadoop/contents
$HADOOP_CMD fs -put /home/hadoop/contents/* /home/hadoop/contents/
The execution fails with the error message -
/mnt/var/lib/bootstrap-actions/2/cycle0_unix.sh: line 3: fs: command not found
/mnt/var/lib/bootstrap-actions/2/cycle0_unix.sh: line 4: fs: command not found
cycle0.sh is the name of my bootstrap script.
Any comments as to what is happening here?
Solution
OTHER TIPS
To get back to the topic of the question, it seems that environment variables can't be set from any bootstrap code, they can only be set or updated from a script that must be named
hadoop-user-env.sh
More details here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-config_hadoop-user-env.sh.html
I think you don't need the environment variable. just change
fs
to
hadoopfs
You configure such Spark-specific (and other) environment variables with classifications, see https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
Another (rather dirty) option is to enrich bashrc
with some export FOO=bar
in the bootstrap action.