Question

I have written an archival system with Python Boto that tar's several dirs of files and uploads to Glacier. This is all working great and I am storing all of the archive ID's.

I wanted to test downloading a large archive (about 120GB). I initiated the retrieval, but the download took > 24 hours and at the end, I got a 403 since the resource was no longer available and the download failed.

If I archived straight from my server to Glacier (skipping S3), is it possible to initiate a restore that restores an archive to an S3 bucket so I can take longer than 24 hours to download a copy? I didn't see anything in either the S3 or Glacier Boto docs.

Ideally I'd do this with Boto but would be open to other scriptable options. Does anyone know how given an archiveId, I might go about moving an archive from AWS Glacier to an S3 Bucket? If this is not possible, are there other options to give my self more time to download large files?

Thanks!

http://docs.pythonboto.org/en/latest/ref/glacier.html http://docs.pythonboto.org/en/latest/ref/s3.html

Was it helpful?

Solution

The direct Glacier API and the S3/Glacier integration are not connected to each other in a way that is accessible to AWS users.

If you upload directly to Glacier, the only way to get the data back is to fetch it back directly from Glacier.

Conversely, if you add content to Glacier via S3 lifecycle policies, then there is no exposed Glacier archive ID, and the only way to get the content is to do an S3 restore.

It's essentially as if "you" aren't the Glacier customer, but rather "S3" is the Glacier customer, when you use the Glacier/S3 integration. (In fact, that's a pretty good mental model -- the Glacier storage charges are even billed differently -- files stored through the S3 integration are billed together with the other S3 charges on the monthly invoice, not with the Glacier charges).

The way to accomplish what you are directly trying to accomplish is to do range retrievals, where you only request that Glacier restore a portion of the archive.

Another reason you could choose to perform a range retrieval is to manage how much data you download from Amazon Glacier in a given period. When data is retrieved from Amazon Glacier, a retrieval job is first initiated, which will typically complete in 3-5 hours. The data retrieved is then available for download for 24 hours. You could therefore retrieve an archive in parts in order to manage the schedule of your downloads. You may also choose to perform range retrievals in order to reduce or eliminate your retrieval fees.

http://aws.amazon.com/glacier/faqs/

You'd then need to reassemble the pieces. That last part seems like a big advantage also, since Glacier does charge more, the more data you "restore" at a time. Note this isn't a charge for downloading the data, it's a charge for the restore operation, whether you download it or not.

One advantage I see of the S3 integration is that you can leave your data "cooling off" in S3 for a few hours/days/weeks before you put it "on ice" in Glacier, which happens automatically... so you can fetch it back from S3 without paying a retrieval charge, until it's been sitting in S3 for the amount of time you've specified, after which it automatically migrates. The potential downside is that it seems to introduce more moving parts.

OTHER TIPS

Using document lifecycle policies you can move files directly from S3 to Glacier and you can also restore those object back to S3 using the restore method of the boto.s3.Key object. Also, see this section of the S3 docs for more information on how restore works.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top