Does an EMR master node know its cluster ID?

Question 1

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID.

You can use the pre-installed json parser (jq) to get the jobflow id.

cat /mnt/var/lib/info/job-flow.json | jq -r ".jobFlowId"

(updated as per @Marboni)

Question 2

You can use Amazon EC2 API to figure out. The example below uses shell commands for simplicity. In real life you should use appropriate API to do this steps.

First you should find out your instance ID:

 INSTANCE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`

Then you can use your instance ID to find out the cluster id :

ec2-describe-instances $INSTANCE | grep TAG | grep aws:elasticmapreduce:job-flow-id

Hope this helps.

Question 3

As been specifed above, the information is in the job-flow.json file. This file has several other attributes. So, knowing where it's located, you can do it in a very easy way:

cat /mnt/var/lib/info/job-flow.json | grep jobFlowId | cut -f2 -d: | cut -f2 -d'"'

Edit: This command works in core nodes also.

Question 4

Another option - query the metadata server:

curl -s http://169.254.169.254/2016-09-02/user-data/ | sed -r 's/.*clusterId":"(j-[A-Z0-9]+)",.*/\1/g'

Question 5

Apparently the Hadoop MapReduce job has no way to know which cluster it is running on - I was surprised to find this out myself.

BUT: you can use other identifiers for each map to uniquely identify the mapper which is running, and the job that is running.

These are specified in the environment variables passed on to each mapper. If you are writing a job in Hadoop streaming, using Python, the code would be:

import os

if 'map_input_file' in os.environ:
    fileName = os.environ['map_input_file']
if 'mapred_tip_id' in os.environ:
    mapper_id = os.environ['mapred_tip_id'].split("_")[-1]
if 'mapred_job_id' in os.environ:
    jobID = os.environ['mapred_job_id']

That gives you: input file name, the task ID, and the job ID. Using one or a combination of those three values, you should be able to uniquely identify which mapper is running.

If you are looking for a specific job: "mapred_job_id" might be what you want.