S3 to EC2 Performance for fetching large numbers of small files

https://stackoverflow.com/questions/804333

03-07-2019
|

Question

I have a large collection of data chunks sized 1kB (in the order of several hundred million), and need a way to store and query these data chunks. The data chunks are added, but never deleted or updated. Our service is deployed on the S3, EC2 platform.

I know Amazon SimpleDB exists, but I want a solution that is platform agnostic (in case we need to move out of AWS for example).

So my question is, what are the pro's and con's of these two options for storing and retrieving data chunks. How would the performance compare?

Store the data chunks as files on S3 and GET them when needed
Store the data chunks on a MySQL Server cluster

Would there be that much of a performance difference?

Solution

Do you need to provide access to these data chunks directly to the users of your application? If not, then S3 and HTTP GET requests are an overkill. Having also in mind that S3 is a secured service, the overhead for every GET request (for just 1KB of data) will be considerably large.

MySQL server cluster would be a better idea, but to run in EC2 you need to employ Elastic Block Storage. Finally, do not rule out SimpleDB. It is perhaps the best solution for your problem. Design your system carefully and you would be able to easily migrate in other database systems (distributed or relational) in the future.

OTHER TIPS

I tried using S3 as a sort of "database" using tiny XML files to hold my structured data objects, and relying on the S3 "keys" to look up these objects.

The performance was unacceptable, even from EC2 - the latency to S3 is just too high.

Running MySQL on an EBS device will be an order of magnitude faster, even with so many records.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow