Question

In my Python AppEngine application, I have an ndb class which I am running a MapReduce job over in order to remove some old objects that are no longer needed.

The class is as follows:

class UserModel(ndb.Model):
    is_backup_object = ndb.BooleanProperty(default=False)
    etc.

And the MapReduce job is as follows:

from mapreduce import operation as op
# Note: userobject is an instantiation of UserModel
def mapreduce_update_userobject(userobject):
    # This will remove "backup" userobjects from the database, while leaving
    # "normal" userobjects alone
    if userobject.is_backup_object:
        yield op.db.Delete(userobject)

When I run the MapReduce job, I have many userobjects that I want to delete (is_backup_object = True), and some of these objects are not being deleted even though the is_backup_object value is True.

Questions:

  1. Are the mapreduce yield datastore mutation operations functions operation.db.Put and operation.db.Delete designed to work with NDB objects?
  2. Would the NDB automatic caching interfere with the removal of the yielded objects (or perhaps show outdated objects in the datastore viewer)?
  3. Is there a specific way that we should be yielding NDB objects that is different than standard database objects?
  4. Is there any other possible explanation for this strange behaviour that I have witnessed?
  5. If I am doing something incorrectly, then what is the best way to efficiently batch process NDB database entities with mapreduce?
Was it helpful?

Solution

To answer your questions:

  1. Yes, I use operation.db.Put in my own MapReduce pipeline and ndb models are fine.
  2. No, caching does not seem to interfere with db operations.
  3. No, it is the same for db and ndb.
  4. It could be due to eventual consistency. Since you are iterating over your entities using MapReduce, you are probably not using ancestor queries. Therefore, you cannot be sure to see your entities deleted at once. There could be other factors. See below.
  5. MapReduce is excellent for batch processing so you are on the right path.

The problem you are having with entities that are seemingly not being deleted could be due to a number of reasons. Here are a few:

  • Eventual consistency -- as I mentioned -- such that it only appears that entities are not deleted when they are in fact deleted later.
  • MapReduce is not touching all entities. Could be due to a bad filter or namespace in the beginning of the MapReduce pipeline.
  • Errors in the pipeline. This should show up in your logs.
  • Weird caching issues. To confirm or de-confirm this would require rigorous testing.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top