Guarantee of data integrity with Python Boto and Amazon Glacier concurrent.ConcurrentUploader?

https://stackoverflow.com/questions/21654021

08-10-2022
|

Question

I am using Python and Boto in a script to copy several files from my local disks, turn them into .tar files and upload to AWS Glacier.

I based my script on: http://www.withoutthesarcasm.com/using-amazon-glacier-for-personal-backups/#highlighter_243847

Which uses the concurrent.ConcurrentUploader

I am just curious how sure I can be that the data is all in Glacier after successfully getting an ID back? Does the concurrentUploader do any kind of hash checking to ensure all the bits arrived?

I want to remove files from my local disk but fear I should be implementing some kind of hash check... I am hoping this is happening under the hood. I have tried and successfully retrieved a couple of archives and was able to un-tar. Just trying to be very cautions.

Does anyone know if there is checking under the hood that all pieces of the transfer were successfully uploaded? If not, does anyone have any python example code of how to implement an upload with hash checking?

Many thanks!

Boto Concurrent Uploader Docs: http://docs.pythonboto.org/en/latest/ref/glacier.html#boto.glacier.concurrent.ConcurrentUploader

UPDATE: Looking at the actual Boto Code (https://github.com/boto/boto/blob/develop/boto/glacier/concurrent.py) line 132 appears to show that the hashes are computed automatically but I am unclear what the

[None] * total_parts

means. Does this mean that the hashes are indeed calculated or is this left to the user to implement?

Solution

Glacier itself is designed to try to make it impossible for any application to complete a multipart upload without an assurance of data integrity.

http://docs.aws.amazon.com/amazonglacier/latest/dev/api-multipart-complete-upload.html

The API call that returns the archive id is sent with the "tree hash" -- a sha256 of the sha256 hashes of each MiB of the uploaded content, calculated as a tree coalescing up to a single hash -- and the total bytes uploaded. If these don't match what was actually uploaded in each part (which were also, meanwhile, being also validated against sha256 hashes and sub-tree-hashes as they were uploaded) then the "complete multipart" operation will fail.

It should be virtually impossible by the design of the Glacier API for an application to "successfully" upload a file that isn't intact and yet return an archive id.

OTHER TIPS

Yes, the concurrent uploader does compute a tree hash of each part and a linear hash of the entire uploaded payload. The line:

[None] * total_parts

just creates a list containing total_parts number of None values. Later, the None values are replaced by the appropriate tree hash for a particular part. Then, finally, the list of tree hash values are used to compute the final linear hash of the entire upload.

So, there are a lot of integrity checks happening as required by the Glacier API.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow