Miroservices default value for Fallback scenarios

https://softwareengineering.stackexchange.com/questions/363051

25-01-2021
|

Pergunta

Default values are often suggested to be part of failover mechanism for microservices. At a high level, for any services (say microservices here) the nature of the operation can be broadly classified as Read Or Write.

Having default values for Write operations doesn't sound reliable.
Return values for Read operations in terms of data size can possibly(??) be categorized as follows:
- Read returning Small/Medium Size Data
- Read returning Huge amount of data

Let's assume the source of data is a Highly Available Cache [used for performance, round-trip avoidance etc and has it's own refresh cycle].

Now when the cache is down failover plan can be:

When the data size is small - Going back to actual system to fetch the data, (assuming time taken is range milli secs) over a real time invocation, seems ok.
When the data size is huge and a real time invocation takes time several mins, doing it over a synchronous call, doesn't seem to be correct.

The solutions I can think of are as follows:

Keep the actual data in a persistent storage which is backed by High Availability and use it as a fallback. So the data availability will now be controlled by HA Policy of the persistent storage
Use some kind of Caching for request. The Cache can have a fixed upper limit on size and can be used to keep latest request only. The cache can be reset periodically (with the same frequency of the HA Cache refresh) after checking health of the HA Cache. If HA Cache is available, the Request Cache can be reset, else it can retain it's last state. This essentially moves the data availability assurance to platform hosting the microservice(s)

Would be really helpful to know from the community, which among the above is better fit OR is there is any other better way of handling the problem described in (2)?

Solução

You're overthinking it.

A distributed cache's goal is to optimize performance, nothing more. If you expect data to always be cached, your design is flawed. You may not have the data for a bunch of reasons other (and much more current) than the unavailability of the cache service:

The data was not cached yet,
The data became obsolete and should be regenerated,
The cache service was low on memory and removed the item from the cache.

For this reason, you have to consider the scenario of the data not being in cache anyway. In your case, this means that you should explicitly handle the case of a request taking minutes (by doing it asynchronously).

The only additional problem you get with the cache service being potentially down is not worth long requests, but many short ones. If you expect to do a few hundred of requests (1 ms. each) per second to the cache and the cache stops responding, meaning that every request now takes 500 ms. (timeout), you'll possibly exhaust the pool of HTTP connections, not counting the consequences on your users' experience. To protect yourself from this scenario, use microservices' circuit breaker pattern.

Following your comment, it seems further explanation is needed. I think a difficulty with your question is that you're talking about huge amounts of data, assuming that it will take a matter of seconds or milliseconds to get this data from cache, but a matter of minutes to get it without cache. If it's only question of data size, it will still take minutes to get the data from cache as well.

Let's imagine another scenario. The response is relatively small and can be downloaded in a matter of milliseconds, but it takes minutes to generate the response in the first place. Here, cache represents a huge performance improvement.

In this case, the service may respond with a HTTP 202 Accepted indicating that it started generating the response, the response will be available later on. It effectively means that the client will have to handle two cases: the one where the answer is ready (HTTP 200), and the one where the response should be regenerated (HTTP 202) and act accordingly.

Does persistent storage work as a fallback mechanism? Sure. Personally, I would rather prefer not using one, since it makes the whole system quite complex. As soon as you store data in two different systems (primary cache and fallback storage), maintenance tends to become too costly, and you'll have to deal with invalidation properly (for instance, what happens if the primary cache is invalidated, but when attempting to invalidate the data in persistent storage, it fails?) Moreover, how far could you get with that? Don't you need to handle the case where both the cache and the persistent storage are empty?

In my opinion, the goal should be to:

Make sure cache service is reliable. If it's down once per day for an hour, you have more important things to do than to think about fallback strategies.
Ensure the case where the cache service is down (or empty) is handled, i.e. that the user sees something other than “Server error”. Depending on the specific case, it may be as simple as an explicit and helpful error message explaining what happened and what the user can do next. Or it may be a mechanism where the user will have to wait for a few minutes to get the content regenerated. Or it may be a very complex fallback mechanism which ensures 99.99999% reliability. It's up to you to determine if it's worth the effort.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange