Question

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.

How can I count large entities?

Était-ce utile?

La solution

To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.

Here is an example of how you could use the deferred library in your app:

import logging

def count_large_query(query):
  total = query.count()
  logging.info('Total entities: %d' % total)

Then you can call the above function from within your app like:

from google.appengine.ext import deferred

# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())

While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):

LIMIT = 1024
def count_large_query(query):
  cursor = None
  more = True
  total = 0
  while more:
    ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
    total += len(ndbs)

  logging.info('Total entitites: %d' % total)

To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.


As Guido mentioned in the comment this will not going to scale either:

This still doesn't scale (though it may postpone the problem). A task has a 10 minute instead of 1 minute, so maybe you can count 10x as many entities. But it's pretty expensive! Have a search for sharded counters if you want to solve this properly (unfortunately it's a lot of work).

So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.

Autres conseils

This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.

It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.

Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).

It is better to use google app engine backends. Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely. Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top