Question

I find if you create a compute engine (CentOS or Debian) machine and using gsutil to download (cp) a tgz file will cause a crcmod error...

$ gsutil cp gs://mybucket/data.tgz .
Copying gs://mybucket/data.tgz...
CommandException:
Downloading this composite object requires integrity checking with CRC32c, but
your crcmod installation isn't using the module's C extension, so the the hash
computation will likely throttle download performance. For help installing the
extension, please see:
  $ gsutil help crcmod
To download regardless of crcmod performance or to skip slow integrity checks,
see the "check_hashes" option in your boto config file.

Currently I use "check_hashes = never" to bypass the check...

$ vi /etc/boto.cfg
[GSUtil]
default_project_id = 429100748693
default_api_version = 2
check_hashes = never
...

But, what is the root cause? and is there any good solution to solve the problem?

Was it helpful?

Solution

The object you're trying to download is a composite object, which basically means it was uploaded in parallel chunks. gsutil automatically does this when uploading objects larger than 150M (a configurable threshold), to provide better performance.

Composite objects only have a crc32c checksum (no MD5), so in order to validate data integrity when downloading composite objects, gsutil needs to perform a crc32c checksum. Unfortunately, the libraries distributed with Python don't include a compiled crc32c implementation, so unless you install a compiled crc32c, gsutil will use a non-compiled Python implementation of crc32c that's quite slow. That warning is printed to let you know there's a way to fix that performance problem: Please run:

gsutil help crcmod

and follow the instructions there for installing a compiled crc32c. It's pretty easy to do it, and worth the effort.

One other note: I strongly recommend against setting check_hashes = never in your boto config file. That will disable integrity checking, which means it's possible your download could get corrupted and you wouldn't know it. You want data integrity checking enabled to ensure you're working with correct data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top