Question

I'm running into the following error when running an export to CSV job on AppEngine using the new Google Cloud Storage library (appengine-gcs-client). I have about ~30mb of data I need to export on a nightly basis. Occasionally, I will need to rebuild the entire table. Today, I had to rebuild everything (~800mb total) and I only actually pushed across ~300mb of it. I checked the logs and found this exception:

/task/bigquery/ExportVisitListByDayTask java.lang.RuntimeException: Unexpected response code 200 on non-final chunk: Request: PUT https://storage.googleapis.com/moose-sku-data/visit_day_1372392000000_1372898225040.csv?upload_id=AEnB2UrQ1cw0-Jbt7Kr-S4FD2fA3LkpYoUWrD3ZBkKdTjMq3ICGP4ajvDlo9V-PaKmdTym-zOKVrtVVTrFWp9np4Z7jrFbM-gQ x-goog-api-version: 2 Content-Range: bytes 4718592-4980735/*

262144 bytes of content

Response: 200 with 0 bytes of content ETag: "f87dbbaf3f7ac56c8b96088e4c1747f6" x-goog-generation: 1372898591905000 x-goog-metageneration: 1 x-goog-hash: crc32c=72jksw== x-goog-hash: md5=+H27rz96xWyLlgiOTBdH9g== Vary: Origin Date: Thu, 04 Jul 2013 00:43:17 GMT Server: HTTP Upload Server Built on Jun 28 2013 13:27:54 (1372451274) Content-Length: 0 Content-Type: text/html; charset=UTF-8 X-Google-Cache-Control: remote-fetch Via: HTTP/1.1 GWA

at com.google.appengine.tools.cloudstorage.oauth.OauthRawGcsService.put(OauthRawGcsService.java:254)
at com.google.appengine.tools.cloudstorage.oauth.OauthRawGcsService.continueObjectCreation(OauthRawGcsService.java:206)
at com.google.appengine.tools.cloudstorage.GcsOutputChannelImpl$2.run(GcsOutputChannelImpl.java:147)
at com.google.appengine.tools.cloudstorage.GcsOutputChannelImpl$2.run(GcsOutputChannelImpl.java:144)
at com.google.appengine.tools.cloudstorage.RetryHelper.doRetry(RetryHelper.java:78)
at com.google.appengine.tools.cloudstorage.RetryHelper.runWithRetries(RetryHelper.java:123)
at com.google.appengine.tools.cloudstorage.GcsOutputChannelImpl.writeOut(GcsOutputChannelImpl.java:144)
at com.google.appengine.tools.cloudstorage.GcsOutputChannelImpl.waitForOutstandingWrites(GcsOutputChannelImpl.java:186)
at com.moose.task.bigquery.ExportVisitListByDayTask.doPost(ExportVisitListByDayTask.java:196)

The task is pretty straightforward, but I'm wondering if there is something wrong with the way I'm using waitForOutstandingWrites() or the way I'm serializing my outputChannel for the next task run. One thing to note, is that each task is broken into daily groups, each outputting their own individual file. The day tasks are scheduled to run 10 minutes apart concurrently to push out all 60 days.

In the task, I create a PrintWriter like so: OutputStream outputStream = Channels.newOutputStream( outputChannel ); PrintWriter printWriter = new PrintWriter( outputStream );

and then write data out to it 50 lines at a time and call the waitForOutstandingWrites() function to push everything over to GCS. When I'm coming up to the open-file limit (~22 seconds) I put the outputChannel into Memcache and then reschedule the task with the data iterator's cursor.

 printWriter.print( outputString.toString() );
 printWriter.flush();
 outputChannel.waitForOutstandingWrites();

This seems to be working most of the time, but I'm getting these errors which is creating ~corrupted and incomplete files on the GCS. Is there anything obvious I'm doing wrong in these calls? Can I only have one channel open to GCS at a time per application? Is there some other issue going on?

Appreciate any tips you could lend!

Thanks!

Evan

Was it helpful?

Solution

A 200 response indicates that the file has been finalized. If this occurs on an API other than close, the library throws an error, as this is not expected.

This is likely occurring do to the way you are rescheduling the task. It may be that when you reschedule the task, the task queue is duplicating the delivery of the task for some reason. (This can happen) and if there are no checks to prevent this, there could be two instances attempting to write to the same file at the same time. When one closes the file the other sees an error. The net result is a corrupt file.

The simple solution is not to re-schedule the task. There is no time limit on how long a file can be held open with the GCS client. (Unlike the deprecated Files API.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top