Question

I have 1.6 Million entities in a Google App Engine app that I would like to download. I tried using the built in bulkloader mechanism but found that it is terribly slow. While I can only download ~30 entities/second via the bulkloader, I can do ~500 entities/second by querying the datastore via a backend. A backend is necessary to circumvent the 60 second request limit. In addition, datastore queries can only live for up to 30 seconds so you need to break up your fetches across multiple queries using query cursors.

The code on the server side fetches an 1000 entities and returns a query cursor:

cursor = request.get('cursor')
devices = Pushdev.all()

if (cursor and cursor!=''):
    devices.with_cursor(cursor)

next1000 = devices.fetch(1000)

for d in next1000:
    t = int(time.mktime(d.created.timetuple()))
    response.out.write('%s/%s/%d\n'%(d.name,d.alias,t))

response.out.write(devices.cursor())

On the client side, I have a loop that invokes the handler on the server with a null cursor to begin with and then starts to pass the cursor received by the previous invocation. It terminates when it gets an empty result.

PROBLEM: I am only able to fetch a fraction - ~20% of the entities using this method. I get a response with empty data even though the full set of entities has not been traversed. Why does this method not fetch everything comprehensively?

Was it helpful?

Solution

I couldn't find anything to confirm or deny this in the docs, but my guess is that all() has a non-deterministic ordering such that eventually one of your fetch(1000)'s will hit the "last element" and devices.cursor() will return nothing.

Try this:

devices = Pushdev.all().order('__key__')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top