Will SQL Server go offline if it loses network connectivity to SAN where master and msdb system databases reside?

https://dba.stackexchange.com/questions/283570

14-03-2021
|

Question

I have a setup where 3 servers combined into Availability Group

All 3 servers have local directly attached SSD drives, and user database files are hosted on these drives

But the system databases (master and msdb) of each server in AG, are hosted on a SAN device that is accessed over the network

Did not move those to local SSD drives yet

Questions:

In a hypothetical situation where network connection between any of the servers and the SAN device is lost (bad cable, bad NIC, some temporary network glitch etc.),

Will SQL Server service on that server go offline or stop working properly immediately ?
Or it continues to work for some time if master and msdb were cached in RAM before network went down ?

Solution

From documentation

The Caveats section of the Availability group database level health detection failover option doc has some info that might improve our guesses on the question:

It is important to note that the Database Level Heath Detection option currently does not cause SQL Server to monitor disk uptime and SQL Server does not directly monitor database file availability. Should a disk drive fail or become unavailable, that alone will not necessarily trigger the availability group to automatically failover.

As an example, when a database is idle with no active transactions, and with no physical writes occurring, should some of the database files become inaccessible, SQL Server may not do any read or write IO to the files, and may not change the status for that database immediately, so no failover would be triggered. Later, when a database checkpoint occurs, or a physical read or write occurs for fulfilling a query, then SQL Server may then notice the file issue, and react by changing the database status, and subsequently the availability group with database level health detection set on would failover due to the database health change.

As another example, when the SQL Server database engine needs to read a data page to fulfill a query, if the data page is cached in the buffer pool memory, then no disk read with physical access may be required to fulfill the query request. Therefore, a missing or unavailable data file may not immediately trigger an automatic failover even when database health option is enabled, since database status is not immediately.

From a (close enough) lab test

I placed the master and msdb data and log files on a pen-drive (drive D:) - for the sake of brevity I'm not gonna describe this process;
Started the instance and ran some DML on a my lab database Lab;
Connected do the master database I ran select name, state_desc from sys.databases;;
Unplugged the pen-drive (no Safely Remove Hardware and Eject Media, just pulled it from desktop);
Ran some more DML on a my lab database Lab - all fine, I even updated a table;
SQL Server only noticed the problem when I tried to run CREATE DATABASE StorageOffline;. I got the following error message:

Msg 823, Level 24, State 2, Line 4 The operating system returned error 21(The device is not ready.) to SQL Server during a read at offset 0x0000000041c000 in file 'D:\MSSQL\master.mdf'. Additional messages in the SQL Server error log and operating system error log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
After I got the error, I repeated step 3 and the output was the same: the state for all databases were still ONLINE. So, despite the fact that SQL Server was aware of a problem on the file of the drive D:\ it didn't change the state of the databases nor took the instance offline;

I kept using the Lab database with no (apparent) major problem for a few minutes and the instance only stopped working while I was writing this answer. Of course it's not a reliable state to keep working in production, but it took sometime to go offline.

Conclusion

Based on that info, my thoughts are:

Will SQL Server service on that server go offline or stop working properly immediately?

I'd say no. I haven't worked with availability groups yet, but if the feature is meant to keep important databases online and it doesn't monitor disk uptime or database file availability for databases that are actively being monitored, it won't notice the problem faster on databases that are not part of the availability group.

Or it continues to work for some time if master and msdb were cached in RAM before network went down?

Yes, but it depends on how busy your environment is. The databases will keep online until SQL Server tries to read or write something on the master or msdb database files.

But I agree with J.D., you should not rely on that situation to give you enough time to take any action that would avoid your instance from going offline.

OTHER TIPS

You can't depend on the system databases being cached in memory, more likely they won't be because of their less frequent access relative to the user databases.

I think you'll end up in a quasi-functioning state, where your user databases will still be accessible but certain features of the server instance that rely on master and msdb will throw some weird errors, depending on what else your server is doing. The service for your SQL Server instance, should continue to remain online ("started" state). For example, if you have any Scheduled Agent Jobs, I would bet my money (but couldn't say for sure without testing) they'd encounter errors (either silently or apparently) when trying to run since most of their meta-data is stored in the msdb database.

If such was to occur, you'd be best to restore access to those system databases as soon as possible to guarantee 100% reliability in all features and functions.

Depends on what kind of offline. I had it get itself into a state where it had no idea what transactions were committed because the failure mode it was seeing was writes to the dbs were failing at block levels. It spammed the log nicely, but couldn't recover until I manually bounced it as it would believe the in-memory copy was correct after hitting the IO error.

I'm sure somebody's going to come by and say that's just nuts. I agree. It's terrible behavior. But I observed it in situ. When bringing the server back up, it appeared from a SELECT watcher that the database was rolled back. Note that while anybody running a COMMIT saw it error out, further SELECT statements could see the results of the failed commits as though they had suceeded by reading reading them with SELECT statements until I manually recycled it.

Yuck.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange