Question

I have noticed that there is a long wait between invoking an EMR job and the actual start of the mapreduce processing of the data when the input location are files in S3. My question is, does the EMR directly run on the data residing in the native S3 filesystem OR does it copy the data over to the HDFS cluster of the provisioned EC2 machines (in the EMR cluster) in which case it is going to take large amount of time for copying the data?

Was it helpful?

Solution

S3 is a storage mechanism, so definitely it can't process the data. So, the data has to be copied to the EC2 nodes before processing in the MR jobs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top