How to handle database connection drops?

https://softwareengineering.stackexchange.com/questions/414672

14-03-2021
|

Question

I'm currently writing some microservices, some of them communicating using RabbitMQ, RedisDB, Kafka and other communication streams.

When any of those connections drop, I can't know for sure if a query already executed.

For example, if I insert a new key into a database, and the connection drops, two scenarios can happen:

Key was inserted and only then the connection was dropped. In this case I don't need to insert again.
Key wasn't inserted, in which case I need to re-insert.

Always retrying the query can cause duplicate keys to be inserted.

Is there any general pattern to handle connection drops, that avoids this issue altogether?
What do I do with the user during this time? I bet large companies like google don't return 500 every time one of their servers go offline.

Solution

This is an area where you don't have canned patterns. So let's look at what your stated needs are:

You need to know if am insert was successful
The update could come from any of 3 different sources

Ideally, we would need a means of generating a unique key that is derived from the data you are receiving in some way.

If we only had one source of information, we would be able to use the message Id to identify if the record was inserted or not. Another option would be to codify the source and the message id together. Example: source is codified as 1,2 or 3, so you append the message Id to the 1, 2, or 3 prefix. It can work, assuming every message Id is unique. That may or may not be true.

Another option is to have a creation date, trace ID and trace source in the table you are writing to. This allows you to query before writing. In this case I would have a transaction:

Query to see if there is a record written since the message was authored that came from the same source and has the same message id.
- WHERE creationDate > ? AND messageSource = ? AND messageId = ? where the ? marks parameters for the query.
If nothing is found, write the update (including the source and trace id)--otherwise it has already been written
Complete the transaction

On the topic of connection drops

If you are having a connection dropped intermittently, but often enough where this is a real problem, then something is wrong. It could be that your configuration is set for tolerances that are unreasonable. It could also be that you need to change your approach. For example, a timeouts would be a symptom where you need to step back and take a stock of the larger picture.

Don't request a connection until you are ready to do something with the database
If it's going to be a while until you do the next thing, release the connection when you are done
Determine if the timeout is network related, record related, or due to some other resource contention

When you are getting timeouts due to a network something is very wrong. I was on a program where actions that were taking milliseconds suddenly started taking minutes. It turned out that the infrastructure team moved the DNS server in a way where our servers were not updated. In self defense we put entries in our HOSTS file so our servers could always find the other servers we deployed to, as well as fixing the IP address of the DNS server.

Sometimes it's not the network layer, and your database is suffering from severe record locking problems. This can happen if your database silently promotes record locking to page locking, or worse, page locking (here's looking at you MS SQL Server). Your options here are to offload queries from your database or ensure that queries are for snapshots of data (i.e. does not have to wait for transactions to resolve). In this case, make use of Redis when reading individual records, and ElasticSearch (or equivalent) when performing complex queries. The idea is that the database serves as gold master and everything else is a slave to that data. The more you can relieve contention from the database, the faster your system will feel.

Finally, there can be other types of resource contention. Examples include disk access during a security update, network bandwidth due to very chatty communications, etc.

It's always good to have a solution to ensure a write once semantic, but when you are constantly dealing with something that should not be a problem, sometimes you need to take a look at what's causing the issue. That's a pain, but the general process is the same:

Look for correlations (i.e. events happening at the same time)
Go through a process of elimination until you find the cause

OTHER TIPS

There's no perfect solution to exactly once messaging. But the impossiblity of the solution relies on the possibility of missing multiple messages, distributed processing and bad actors.

For normal senarios you can reduce the probablity to virtually zero.

Generate an id before you send, query it afterwards and store it to prevent duplicates.
Hold a sequential count, error and request resend if you receieve an out of sequence message

Generally these things are handled by the communication protocol and you dont need to worry about them, but with high volume and/or distributed systems you want to build in immutablity to everything and have a way to pickup errors after the fact so they can be repaired.

So in your example where the commit command errors on the client but the transaction has completed on the db, you have been super unlucky multiple times.

It should be such an infrequent occurance that simply writing the error and transaction to the log and having a human check the db manually in the morning is acceptable.

If you are designing something like the TCP protcol however, missed packets are common thing, you'll want to include acknowledgement and anti duplication methodologies

It sounds like you are using some sort of sequential key in your tables (like an identity column). If you change to a Universal Unique Identifier (UUID) which is generated by the sender then you can retry as many times as you'd like (as you will be able to check, if the UUID already exists in the database).

(You can also use a hybrid, if there is a reason for your sequental identifier)

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange