Accessing Data in S3 within EMR

https://stackoverflow.com/questions/21125557

28-09-2022
|

Question

I have a large text file stored in S3 and can access it from within EMR (say PIG) directly using the 's3:///folder/folder/file' format on a multi-node cluster.

My question is about the efficiency of the data transfer to the data nodes. I believe that the data in S3 is stored in blocks in a similar way to HDFS.

When reading the file, how is it split and send to each of the data nodes?
Is the allocation to the data nodes controlled by the Master Node/Job Tracker?
Is it more efficient to copy the file into HDFS and then access it?

Solution

Generally there's no difference between read from hdfs and s3 when spliting. S3FileSystem class (which is store class for s3 input) will use common location and offset to get s3 file block(use HTTP request, including location, offset info in header). More details, you can check code in hadoop release.
Yes, as same as HDFS procedure.
It depends on the workflow. If you read once, query many times, you may want copy files to HDFS which will benefit from local I/O. Otherwise you can just use s3 as your storage. S3 is more stable, has unlimited storage, though that maybe a little slower than HDFS. (I know Netflix uses s3 as emr storage for many cases, which works just fine as they say)

PS: S3DistCp can help you to do quick copy between HDFS and S3.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow