Ops in event-driven paradigm

https://softwareengineering.stackexchange.com/questions/411649

12-03-2021
|

Question

TL;DR event-driven system seems to focus on a highlevel view of the system ("error rate is 0.5%"). How are IT operations supposed to locate and remedy individual issues in such systems?

In today's mainstream push towards distributed systems the even-driven architecture is often regarded very well. Among the benefits of this naturally async architecture are often quoted:

ability to achieve loose coupling
clean design of one-to-many calls
ability to apply backpressure/feedback
often natural horizontal scalability.

As drawbacks are mentioned things like:

lack of transactional processing
no promise of event order
only-once event delivery being hard.

All this totally makes sense to me. However, I can't really figure out the operations side of this.

In a non-trivial business application there are often dozens of services connected with dozens of queues/topics/? Each event travels through a subset of application's services and queues/topics/? during the processing. The ops typically need to know and have ability to react when there has been an error in the processing of an event.

A typical approach seems to be employing event observability - each even carries a unique ID as a correlation ID through the processing. This way one can log the lineage of an event and obtain KPIs like error ratio, average throughput, etc. But this is a very highlevel view for typical ops tasks.

Ops have to handle classic tasks like: "What's the status of invoice no. X?" or "Why wasn't user John Doe able to order product Y?". How to find and remedy those cases in even driven application?

The first issue is to somehow locate the right event(s). From the observability logs ops have to be able to find the right event ID - so one have to log virtually every single attribute of any event. That doesn't sound right.

Then they need to locate the event in the system - it might be in almost any dead-letter-queue (DLQ), it might be stuck in any broken/slow queue/topic/?, etc. Is there a common ability to somehow query such systems?

Lastly, after the fix is in place ops need to replay the event. It doesn't seem to be common to manually pick events from queues/topics/? and placing or rerouting them somewhere else. Is this widely supported?

Solution

Tooling

You are going to need to get the developers on board with providing you with tools to help aggregate and make a picture out of the data, along with the ability to replay events (or generate new messages to get the system repaired).

You are also going to need to push back on the business to get the funds/bandwidth to get these tools and keep them current with new work.

Why would the Business invest in Tooling?

The business is not paying to build software, but to operate it.

Contrast X developers * Y hours vs. A incidents * B operators * C hours + Overtime + Downtime.

The upfront development cost has expense, but running costs will almost always swamp those.

As a bonus if done well that same tooling can also be used to test integration systems. Which makes testing (an expensive component of development) to be done with greater certainty and lower cost.

Quantifying

This depends on the desires the business has, and the structure of your own team.

The first step is research, find out:

what the sources of incidents are, frequency, total time, overtime taken, and what opportunity cost they presented (failed to be online during high sales period, during big sporting event, for a press conference, etc...).
what the sources of friction are in resolving incidents, frequency, confusions, overtime taken, and opportunity costs.

That should give you a picture of grievances that can be resolved/reduced.

This is where knowing the businesses desires come in. Pick a short list of incidents/frictions with common factors, like having to locate a misplaced event.

Ask the development team to estimate two pieces of work.
- The first is to address the cause of incident/friction
- The second is to provide a tool that reduces the time required to identify the corrective action.
Using your own research:
- Work out the average overtime spend (in time, and money)
- Work out the average normal time spend (in time, and money)
- Work out the approx. opportunity cost (in time, and money)
- Work out how much quicker the incidents can be resolved.
- Project how many days/months till break even (where development cost = saved operational cost)
  - Only use overtime spend and opportunity cost if you are not reducing head count on the team.

Now you can go to the business with an argument.

If you spend time on fixing/providing us with this tooling, it will pay for itself in this many months, and during these high visibility events we can restore service/resolve issues this much quicker.

No guarantees that you will get your demands, but at least the demand will have teeth.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange