Scenario:

A client recently had an issue with an Azure VM where the (non-OS) 'premium' storage disks were erroneous removed without warning by Microsoft. This is the virtual equivalent to the hard drives being yanked out of a running server.

They have SQL Server 2014 running on this VM with the data and log files placed separately on the two 'failed' drives.

The live, production, non-clustered databases are being used fairly heavily via various services and applications - with the high probability of commands and transactions having bin in-progress or mid-commit when the issue occurred.

After the issue, the VM was forcibly powered-off, and only became 'operational' after a couple of reboots.

Question:

While seemingly no issues have thus far been identified, my question is what are the possible SQL failure modes this could present? Does the transaction log / SQL server architecture account for this type of whole disk failure? Is it possible to have data loss in this situation (without considering corruption of the disk itself). And what checks should/can be performed after such an event?

有帮助吗?

解决方案

what are the possible SQL failure modes this could present? Does the transaction log / SQL server architecture account for this type of whole disk failure?

SQL Server uses a write-ahead log (WAL), so any completed transactions should have at least been flushed to disk (unless you are using delayed durability). This means that when your database comes back online, all committed transactions will be replayed from the log file. In-progress transactions will be rolled back as well.

Is it possible to have data loss in this situation (without considering corruption of the disk itself)

Despite the WAL architecture, I'm sure Windows didn't appreciate the abrupt disk loss. It certainly seems possible that data could have been corrupted - for instance, if a single 8 KB page was halfway done being flushed to the physical log file, and the disk was removed at that instant, it seems like the log would be corrupted.

If you're using the Full recovery model, you could likely still do a point in time recovery up to right before the corruption. But there would be some data loss due to the partial write.

And what checks should/can be performed after such an event?

There's really not much to be done other than running DBCC CHECKDB. This will check the database to see if it was corrupted during this unusual event.

Another option would be to restore a backup (from before this event) to a separate database / server, and then use a comparison tool (like SQL Data Compare from Red Gate) to see if there is any data that's different between the backup and the current database.

许可以下: CC-BY-SA归因
不隶属于 dba.stackexchange
scroll top