How to detect Oracle broken/stalled connection?

https://stackoverflow.com/questions/1431560

07-07-2019
|

Question

In our server/client-setup we're experiencing some weird behaviour. The client is a C/C++-application which uses OCI to connect to an Oracle server (using the OTL library).

Every now and then the DB server dies in a way (yes this is the core issue, but from application-side we're unable to solve it but have to deal with it anyway), that the machine does not respond anymore to new requests/connections but the existing ones, like the Oracle-connections, do not drop or time out. Queries sent to the DB just never return successfully anymore.

What possibilities (if any) are provided by Oracle to detect these stalled connections from the client-application side and recover in a more or less safe way?

Solution

This is a bug in Oracle ( or call it a feature ) till 11.1.0.6 and they said the patch on Oracle 11g release 1 ( patch 11.1.0.7 ) which has the fix. Need to see that. If it happens you will have to cancel ( kill ) the thread performing this action. Not good approach though

OTHER TIPS

In all my DB schema i have a table with one constant record. Just poll such table periodically by simple SQL request. All other methods unreliable.

There's a set_timeout API in OTL that might be useful for this.

Edit: Actually, ignore that. set_timeout doesn't work with OCI. Have a look at the set_timeout description from here where it describes a technique that can be used with OCI

Sounds like you need to fire off a query to the database (eg SELECT * FROM dual;), then if the database hasn't responded within a specified amount of time, assume the server has died and react accordingly. I'm afraid I don't know C/C++, but can you use multi-threading to fire off the statement then wait for the response, without hanging the application?

This works - I have done exactly what you are looking for. Have a parent process (A) create a child process (B). The child process (B) connects to the database, performs a query (something like "select 1 from a_table" - you will get better performance if you avoid using "dual" for this and create your own table). If (B) is successful then it writes out that it was successful and exits. (A) is waiting for a specified amount of time. I used 15 seconds. If (A) detects that (B) is still running - then it can assume that the database is hung - it Kills (B) and takes necessary actions (Like calling me on the phone with a SMS).

If you configure SQL*NET to use a timeout you will probably notice that large queries will fail because of it. The OCI set_timeout configuration will also cause this.

There is a manual way to avoid this. You can open a firewall and do something like ping database after every specified duration of time. In this way the database connection will not get lost.

idea

If (current_time - lastPingTime > configuredPingTime)
{
     //Dummy query
     select 1 from dual;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow