Question

We are working on an event driven system that works a streaming technology (eventhub/kinesis/kafka). Imagine some system is generating events that are sent to the event stream. Then there are multiple diferent processors that are doing something different with the events and based on the events they update some internal state (persisted in DB). Question is what if one of the processor has bug and the persisted state in DB is therefore wrong. What are the best practices for recovering from these bugs?

PS: I kinda assume if the bug was on the event producer I would do some compensation events but what if one of processors (consumer) is wrong

Was it helpful?

Solution

A Bug!

Fix the bug.

Otherwise the problem will manifest again.

And it corrupted the database!!!

Have you been maintaining good database hygenie?

  • Backups
  • Transaction logs
  • Reconciliation with other systems
  • Copies of events on the event stream

Then:

  • If you have a known good backup, and all of the events that happenned since. You could rebuild the database after having fixed the bug and swap over to that recomputed database.

  • If you do not have a known good backup, but you do have a list of transactions you can at least identify the bad state and manually correct it. Just because it is in the database does not make it beyond manual modification. (Of course the modification should be scripted and tested).

  • If you have comprehensive reconcilliation, use it to guide your fixes.

  • If the data is a copy from an upstream system, purge and resync, or verify and recover that data. This could be done manually by a Database tool, or through an automated batch process.

Otherwise - You are in a tough spot:

  • No known good point to return to
  • No knowledge of what that falty processor did
  • Nothing to reconcile the database with, so you cannot tell if there is a mismatch (assuming you fixed it).
  • No Authoritative source.
Licensed under: CC-BY-SA with attribution
scroll top