How to handle data inconsistency in microservice architecture?

https://softwareengineering.stackexchange.com/questions/416035

16-03-2021
|

Question

Let's assume we have two services - user and auth with a message broker in between. user service handles CRUD action on user entity and auth service handles authentication. When a user is created/updated/deleted the user service publishes an event that the auth service consumes and writes in its own database.

If something happens, like pushing a new version of user service that publishes events in a format that the auth service can't process, when a user is deleted in the user service, it won't be in auth service and the user will still be able to authenticate.

What can one do to handle or recover from this situation?

Solution

Microservices aim to be loosely coupled and independently deployable.

Could there be something wrong?

If every time you change your user service, the event message might change in a way your auth service can no longer proceed, there must be something wrong in your design:

either the two services are in reality tightly coupled: since they are not independent, they should belong to a same microserservice. Maybe consider another decomposition strategy (see decomposition patterns here);
or the interface between the services is at the wrong level of abstraction: the interface (here the message format) leaks details that create a dependency that shouldn't be there. Then consider to rethink your interface.

Or could it be about facilitating deployment?

If it's very rare, and caused by a major evolution of one of the service, you need engineer into your service a transition strategy to support the independent deployment over the long run.

You could for example consider:

Format versioning: the message format should be versioned so that any service consuming them can verify message version compatibility dynamically. Take the real-life example of SAML which has xmlns:saml allowing to determine which version is used (1.1 is backwards compatible with 1.0, but 2.0 is not backwards compatible). SemVer like versioning could facilitate compatiblity checking.
Backwards compatibility: design a an evolutionary format for your event message, with a format version number, the old part remaining unchanged, and the new information being added for the services that can use it.
Transitional compatibility: it's a variant of the backward-compatibility where the format multiplexes different format-versions. Once all the user consumers support the new format, you can release a new version that drops the old stuff. A real-world example could be multipart MIME which allows for alternative subparts. But in your context, I'd see this more as a workaround: evolution would be best managed with backward compatibility, whereas disruption caused by a new major version can be much more than the disruption in reading a message format, and may therfore be better managed with the next proposal:
Side-by-side, also called the Darwinistic approach: Your new major version is released with new events. Old and new services coexist as long as there are still subscribers to the old. Service discovery can help the other services to find the most suitable user service to rely on. You could consider to make a bridge service to keep old and new in sync for a while, if relevant. You could also divert new incoming user registrations to the new version of the service using a "Blue/green" deployment

OTHER TIPS

I believe there's three typical ways you could handle this:

Guarantee delivery of your messages
Run a reconciliation process
Switch to a pulled events approach

A little detail on each approach...

Guarantee delivery of your messages

This is for when you really want to make sure that your messages get through, and as soon as possible. Firstly, use the transactional outbox pattern in the user service to ensure that all messages that should be sent as a result of a database transaction are successfully sent to the message broker. Secondly, design and deploy your message broker to have high availability and (more importantly) very high durability (i.e. 99.99....?% of messages are not lost). Thirdly, ensure that messages are not ACK'd by the auth service until they've been processed and the results committed to your db (the receive-side analog of the transactional outbox). If it's really, really important that you never lose messages, you might also want to keep a Sent Messages log file at the user service. In the case of data loss in your message broker, you can then replay messages from the log.

Run a reconciliation process

If you're okay with the "eventual" of your eventual consistency being a little longer, and the amount of data that's shared between the services is not prohibitively large, you can run a reconciliation process. This would typically be done either by the user service regularly exporting a dump, or by the auth service regularly requesting all the data owned by the user service which it is caching[1]. Either way, auth regularly receives user service's full picture of the world and can either update itself if that's relatively easy to do or alert humans to intervene if it detects an inconsistency that can't be automatically handled. It's a good idea to have a method of ensuring that you're not overwriting changes in auth from recently received messages with stale data from user that is from before the message was sent.

Switch to a pulled events approach

This one's an alternative to using a message broker, so a little orthogonal to your question, but it's worth considering. Instead of pushing events through a message broker, you can provide an events/ endpoint on your user service. The auth service then becomes responsible for knowing where it is up to in the events stream, and for processing messages in order and calling developers for help if it can't understand the data it receives.

Avoiding the problem in the first place

You do want to design for the day when this kind of error case to occur. But it's a good idea to also put into place practices that will greatly lower the likelihood. One such practice you probably want to look into is consumer-driven contracts, which essentially try to catch such errors at build time by breaking the build.

Handling the problem well

Also, when the problem does happen, it's nice to handle it gracefully. This is typically done using what's called a Dead Letter Queue, where any message that causes an error at the receiver gets removed from the inbox and placed in a separate queue. The messages are then kept in that queue until something is changed in the system to allow them to be processed again. In your scenario, you would probably push all the DLQ messages back into the inbox after deploying a new version of the auth service that has been updated to understand the new message format.

[1] I say "caching" here because the data should only ever be owned by one service. It seems user service owns this data, and auth service is caching it for its own purposes. While this duplicates data, it's a good pattern because it increases service autonomy.

How would you recover from that if this was paperwork between departments?

Perhaps someone at the auth department would notice the error and request the paperwork be redone.

Perhaps someone at the user department would notice that the auth department failed to replied in the time frame expected (or replied with negative acknowledgement), and raise a flag.

Additionally, we might have versioning approaches that only allow optional fields to be added, and that allow the optional fields to be ignored.

Further, there might be some kind of exhaustive testing perhaps driven off of a detailed description of the message schema, so both message senders and message receivers could be independently tested to work to spec (above and beyond just working with each other).

Despite using a message broker that addresses some delivery problems, we might also digitally sign or encrypt messages so the cannot be tampered with, e.g. accidentally by dropping or flipping bits, or otherwise.

And lastly, perhaps user and auth should be the same service.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange