質問

I'm running GSUTIL v3.42 from a Windows CMD script on a Windows server 2008 R2 using Python 2.7.6. Files to be uploaded arrive in an "outgoing" directory and are uploaded in parallel by GSUTIL to an "incoming" bucket. The script requests a listing of the "incoming" bucket after uploading has finished and then compares the files listed with those it attempted to upload, in order to detect any upload failures. Another separate script moves files from the "incoming" bucket to a "processed" bucket afterwards.

If I attempt to upload the identical file (same name/size/content/date etc.) a second time, it doesn't upload, although I get no errors and nothing in my logging to indicate failure. I am not using the "no clobber" option, so I would expect gsutil to just upload the file.

In the scenario below, assume that the file has been successfully uploaded and moved to the "processed" bucket already on that day. In case timings matter, the second upload is being attempted within half an hour of the first.

  1. File A arrives in "outgoing" directory.
  2. I get a file listing of "outgoing" and write this to dirListing.txt
  3. I perform a GSUTIL upload using

    type dirListing.txt | python gsutil -m cp -I -L myGsutilLogFile.txt gs://myIncomingBucket

  4. I then perform a GSUTIL listing

    python gsutil ls -l -h gs://myIncomingBucket > bucketListing.txt

  5. File match dirListing.txt and bucketListing.txt to detect mismatches and hence upload failures.

On the second run, File A isn't being uploaded in step 3 and consequently isn't returned in step 4, causing a mismatch in step 5. [I've checked the content of all of the relevant files and it's definitely in dirListing.txt and not in bucketListing.txt]

I need the ability to re-process a file in case the separate script that moves the file from the "incoming" to the "processed" bucket fails for some reason or doesn't do what it should do. I have to upload in parallel because there are normally hundreds of files on each run.

Is what I've described above expected behaviour from GSUTIL? (I haven't seen anything in the documentation that suggests this) If so, is there any way of forcing GSUTIL to re-attempt the upload? Or am I missing something obvious, please? I have debug output from GSUTIL if that's necessary/useful.

役に立ちましたか?

解決

From the above, it looks like you're uploading using "-L" to log to a manifest file. If you're using the same manifest file, and the file has already been uploaded once, then gsutil will not try to re-upload the file. From the docs on "-L" in "gsutil help cp":

If the log file already exists, gsutil will use the file as an input to the copy process, and will also append log items to the existing file. Files/objects that are marked in the existing log file as having been successfully copied (or skipped) will be ignored.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top