Best method to determine which of a set of keys exist in the datastore

https://stackoverflow.com/questions/1607126

05-07-2019
|

Question

I have a few hundred keys, all of the same Model, which I have pre-computed:

candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]

Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.

It is not necessary to know the data within the entities, just whether they exist.

One solution would be to use db.get():

keys_with_entities = set()
for entity in db.get(candidate_keys):
  if entity:
    keys_with_entities.add(entity.key())

However this procedure would fetch all entity data from the store which is unnecessary and costly.

A second idea is to use a Query with an IN filter on key_name, manually fetching in chunks of 30 to fit the requirements of the IN pseudo-filter. However keys-only queries are not allowed with the IN filter.

Is there a better way?

Solution

IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!

Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:

foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
  test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
  return [x is not None for x in db.get(test_keys)]

OTHER TIPS

At this point, the only solution I have is to manually query by key with keys_only=True, once per key.

for key in candidate_keys:
  if MyModel.all(keys_only=True).filter('__key__ =', key).count():
    keys_with_entities.add(key)

This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API quota.

How not to do it (update based on Nick Johnson's answer):

I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN filter.

class MyModel(db.Model):
  """Some model"""
  # ... all the old stuff
  the_key = db.StringProperty(required=True) # just a duplicate of the key_name

#... meanwhile back in the example

for key_batch in batches_of_30(candidate_keys):
  key_names = [x.name() for x in key_batch]
  found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
  keys_with_entities.update(found_keys)

The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow