I found a proper solution to my problem. My attempt to copy datafiles from S3 to EMR using hadoop fs
commands has been futile. I have just learned about S3DistCp
command available in EMR for file transfer so I am skipping the $HADOOP_CMD
method. For those who care how S3DistCp
works Link to AWS EMR Docs. I still do not understand why bootstrap script will not accept an environment variable in subsequent statements.
Environment variables set in bootstrap does not take effect in AWS EMR
-
04-07-2023 - |
문제
I am setting an environment variable in my bootstrap code
export HADOOP_HOME=/home/hadoop
export HADOOP_CMD=/home/hadoop/bin/hadoop
export HADOOP_STREAMING=/home/hadoop/contrib/streaming/hadoop_streaming.jar
export JAVA_HOME=/usr/lib64/jvm/java-7-oracle/
This is followed by usage of one of the variables defined above -
$HADOOP_CMD fs -mkdir /home/hadoop/contents
$HADOOP_CMD fs -put /home/hadoop/contents/* /home/hadoop/contents/
The execution fails with the error message -
/mnt/var/lib/bootstrap-actions/2/cycle0_unix.sh: line 3: fs: command not found
/mnt/var/lib/bootstrap-actions/2/cycle0_unix.sh: line 4: fs: command not found
cycle0.sh is the name of my bootstrap script.
Any comments as to what is happening here?
해결책
다른 팁
To get back to the topic of the question, it seems that environment variables can't be set from any bootstrap code, they can only be set or updated from a script that must be named
hadoop-user-env.sh
More details here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-config_hadoop-user-env.sh.html
I think you don't need the environment variable. just change
fs
to
hadoopfs
You configure such Spark-specific (and other) environment variables with classifications, see https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
Another (rather dirty) option is to enrich bashrc
with some export FOO=bar
in the bootstrap action.