Question

I am trying to move my python code from using dynamodb to dynamodb2 to have access to the global secondary index capability. One concept that to me is a lot less clear in ddb2 compared to ddb is that of a batch. Here's once version of my new code which was basically modified from my original ddb code:

item_pIds = []
batch = table.batch_write()
count = 0
while True:
   m = inq.read()
   count = count + 1
   mStr = json.dumps(m)
   pid = m['primaryId']
   if pid in item_pIds:
       print "pid=%d already exists in the batch, ignoring" % pid
       continue
   item_pIds.append(pid)
   sid = m['secondaryId']
   item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
   batch.put_item(data=item_data)

   if count >= 25:
       batch = table.batch_write()
       count = 0
       item_pIds = []

So what I am doing here is I am getting (JSON) messages from a queue. Each message has a primaryId and a secondaryId. The secondaryId is not unique in that I might get several messages at about the same time that have the same. The primaryId is sort of unique. That is, if I get a set of messages at about the same time that have the same primaryId, it's bad. However, from time to time, say once in a few hours I may get a message that need to override an existing message with the same primaryId. So this seems to align well with the statement from the dynamodb2 documentation page similar to that of ddb:

DynamoDB’s maximum batch size is 25 items per request. If you attempt to put/delete more than that, the context manager will batch as many as it can up to that number, then flush them to DynamoDB and continue batching as more calls come in.

However, what I noticed is that a large chunk of messages that I get through the queue never make it to the database. That is, when I try to retrieve them later, they are not there. So I was told that a better way of handling batch writes is by doing something like this:

with table.batch_write() as batch:
   while True:
        m = inq.read()
        mStr = json.dumps(m)
        pid = m['primaryId']
        sid = m['secondaryId']
        item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
        batch.put_item(data=item_data)

That is, I only call batch_write() once similar to how I would open a file only once and then write into it continuously. But in this case, I don't understand what the "rule of 25 max" means. When does a batch start and end? And how do I check for duplicate primaryIds? That is, remembering all messages that I ever received through the queue is not realistic since (i) I have too many of them (the system runs 24/7) and (ii) as I stated before, occasional repeated ids are OK.

Sorry for the long message.

Was it helpful?

Solution

A batch will start whenever the request is sent and end when the last request in the batch is completed.

As with any RESTful API, every request comes with a cost, meaning how much/many resources it will take to complete said request. With the batch_write() class in DynamoDB2, they are wrapping the requests in a group and creating a queue to process them, which will reduce the cost as they are no longer individual requests.

The batch_write() class returns a context manager that handles the individual requests and what you get back slightly resembles a Table object but only has the put_item and delete_item requests.

DynamoDB's max batch size is 25, just like you've read. From the comments in the source code:

DynamoDB's maximum batch size is 25 items per request. If you attempt to put/delete more than that, the context manager will batch as many as it can up to that number, then flush them to DynamoDB & continue batching as more calls come in.

You can also read about migrating, batches in particular, from DynamoDB to DynamoDB2 here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top