Does Amazon EMR framework copy data from S3 before consumption in MapReduce jobs

https://stackoverflow.com/questions/19373279

hadoop
amazon-s3
hdfs
amazon-emr

30-06-2022
|

Question

I have noticed that there is a long wait between invoking an EMR job and the actual start of the mapreduce processing of the data when the input location are files in S3. My question is, does the EMR directly run on the data residing in the native S3 filesystem OR does it copy the data over to the HDFS cluster of the provisioned EC2 machines (in the EMR cluster) in which case it is going to take large amount of time for copying the data?

Solution

S3 is a storage mechanism, so definitely it can't process the data. So, the data has to be copied to the EC2 nodes before processing in the MR jobs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow