Frage

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.

It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)

For example:

<  404104811  2014-04-08T14:13:44Z  gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
>  404104811  2014-04-08T14:43:48Z  gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2

The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.

Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!

War es hilfreich?

Lösung

You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.

gsutil cp -Rn <disk-top-directory> <bucket>

From the help (gsutil help cp)

-n            No-clobber. When specified, existing files or objects at the
              destination will not be overwritten. Any items that are skipped
              by this option will be reported as being skipped. This option
              will perform an additional HEAD request to check if an item
              exists before attempting to upload the data. This will save
              retransmitting data, but the additional HTTP requests may make
              small object transfers slower and more expensive.

Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.

Andere Tipps

If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:

gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz

Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top