Serverless event-sourced architecture using AWS offerings

https://softwareengineering.stackexchange.com/questions/376325

07-02-2021
|

Question

This is my first post on the Software Engineering Stack Exchange so let me know if something is wrong with it.

I'm looking into the serverless offerings of Amazon to try to figure out if that is the way to go for a few new projects I have in mind. I'm particularly interested in an event-sourced, CQRS model, as I find the purported advantages of such a model very attractive in this instance. But I'm having a little bit of trouble understanding all of the services Amazon have to offer, what their pros and cons are, and how it all fits together. I'll give some pretext first and state my questions afterwards.

I'll use an example application to illustrate what I'm after:

It's a simple (static) web application, hosted in S3 and served over cloudflare.

It has two actions: One command and one query (in CQRS terms).

The command posts an event onto the event stream to increment a counter.

The query gets the current state of the counter, i.e. how many times it has been incremented.

That's it, so how do I implement this using serverless AWS technology? Here's what I'm thinking so far:

To send the command to increment the counter, the web application sends AJAX requests to a lambda L1 (through an API gateway). This lambda L1 posts an event to the event stream.

Another lambda L2 listens to the event stream and stores a record of the event/command so that it can be replayed at a later date if need be.

Yet another lambda L3 listens to the event stream and executes the command. In other words, it fetches the current state of the counter, increments it and persists the new state atomically.

To send the query, the web application sends an AJAX request to lambda L4 (through an API gateway), which queries the state and returns the result.

This seems like it should be a fairly straight forward, minimal project. Here are my concerns so far:

First of all, what should my event stream look like? I have seen many suggestions floating around, each one more convoluted and contrived than the last. Various fanning out strategies, mixtures of SNS, SQS, Kinesis, DynamoDB streams, you name it... I fear I will end up with too many moving parts, a cost-ineffective system that's difficult to scale in the sense that the complexity makes it difficult to develop for.

Second, can I achieve atomicity? The event stream services I mentioned above typically have some sort of "at-least-once delivery" property, which needs to be handled by the consumer. One suggestion I have seen is to make every event idempotent, but that does not seem feasible in my example application. Two clients could increment the counter at the same time, and one of the increments could get "lost" because both of the commands would say "the counter is now at 17 (for example)". You could argue that this is correct behavior, both client saw the number as 16 and wanted to increment it to 17, but let's say in this situation we would like both increments to count toward the total. We want our command to represent only a delta between the two states. Is there any way to achieve this?

Third, lambdas L3 and L4 both need to be able to access some sort of persistence layer. Ideally I would like this to be a relational database (SQL) so that I can perform advanced queries on the current application state. It's not necessary for my incrementing counter example, but will be necessary for the projects I have in mind. I think this only leaves me with one option if I want to stay serverless: Serverless Aurora. That's fine by me, but it's my understanding that Aurora needs to run in a VPC, and that lambdas need to run in the same VPC to have access to Aurora. I'm very concerned about performance here, as L3 is the single congestion point in my example (everything else is append-only or read-only). My understanding is that VPCs incur a pretty hefty performance cost (throughput, number of connections, bandwidth), and that lambdas in VPCs can have cold starts of upwards of 10 seconds. How can I tackle these problems? Alarm bells are going off in my head, that this just introduces more problems than it solves. I would probably have to ping L4 continuously so that it never cold starts (10 second load time is unacceptable), and at that point, am I really going serverless? If this is a bad idea, are there any better alternatives? Do I have to persist state in DynamoDB as well, losing querying capabilities?

This post is already pretty lengthy so I'll leave it at these three concerns for now. Aside from answering my questions directly, if you could help me clear up any misunderstandings, offer alternative solutions, etc. I would be grateful!

Solution

I don't think you are going to find any authoritative answers to this, although there are a fair number of articles about.

You should probably anticipate (in the first pass, anyway) that any mutable state in your solution could have more than one writer. So you should anticipate that all mutable writes use some sort of predicate/validator to support a conditional PUT.

Both DynamoDB and S3 support conditional puts, so they are options -- but they aren't necessarily free, in the sense that you'll need to think through your storage strategy, and implement the appropriate semantics on top of them

During the breakout on event sourcing at Re:Invent 2017, DynamoDB was the primary persistence choice in the discussion. After some discussion, the conclusion was that batch writes would not work in a multiple writer scenario - each conditional write would insert a single row.

You also need to think about what reliability guarantees that you need -- is it safe to broadcast events before you store them? Should L1 be reporting success before the persistent changes have been verified?

My best guess at an MVP would be to have your L1 endpoints reading and writing to Dynamo, and then chain other idempotent consumers off the back of that (ie a lambda that reads events out of dynamo and writes them to SNS, Kinesis, etc).

Depending on your guarantees, you might be able to simplify the writers somewhat via L1 -> SNS -> L2 -> Dynamo -> etc.

when you say mutable writes should use a predicate, do you mean I should write my events like "the state has been updated from 16 to 17", and then make L2 validate that the state is currently 16 before updating?

Yes -- think "compare and swap", If-Match, JSON Patch test operation, and so on. If multiple writers need to be enforcing an invariant, then you need some way to ensure that the state being overwritten is the correct one.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange