I have noticed that there is a long wait between invoking an EMR job and the actual start of the mapreduce processing of the data when the input location are files in S3. My question is, does the EMR directly run on the data residing in the native S3 filesystem OR does it copy the data over to the HDFS cluster of the provisioned EC2 machines (in the EMR cluster) in which case it is going to take large amount of time for copying the data?

有帮助吗?

解决方案

S3 is a storage mechanism, so definitely it can't process the data. So, the data has to be copied to the EC2 nodes before processing in the MR jobs.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top