Question

We have a process uploading files to S3. In fact, it's indirect. We use Amazon Elastic MapReduce (EMR), and Hadoop commits the files to S3, from many different task nodes. Then, after that Hadoop job has completed successfully, another part of the process uses Hadoop's FileSystem.createNewFile() to create some files from the master node.

The files that are created from these various machines have timestamps in S3. We assume the timestamps of the files committed from the task nodes are before the files created from the master node.

I believe that is sometimes untrue, but why?

What assigns the timestamp to an S3 file? Is it the Amazon EMR Hadoop client, or some S3 machine?

If I have two machines uploading to S3 whose local clock differs by 30 minutes, will the timestamps be 30 minutes apart?

Was it helpful?

Solution

You are unable to set the Last-Modified values yourself. S3 decides them:

https://forums.aws.amazon.com/thread.jspa?messageID=209241

OTHER TIPS

The only timestamp in S3 appears to be the "Last Modified" meta-data. I believe that the last modified date/time is updated by the S3 system itself, and reflects the time when the file completed uploading fully to S3 (S3 will not show incomplete transfers.)

So it shouldn't matter which node you upload a file from, the "last modified" timestamp on S3 should be consistently the same when you list it on S3.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top