Question

I have 14 production servers with SQL 2016 standard version, 128 gb of RAM, 16 CPU and SSD drives.

Microsoft SQL Server 2016 (SP2-CU15-GDR) (KB4583461) - 13.0.5865.1 (X64) Oct 31 2020 02:43:57 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows Server 2019 Standard 10.0 (Build 17763: )

All servers have the same hardware but two of my servers are installed on Windows Server 2019 as the others are on Windows Server 2012.

I have the same logon trigger enabled on all my databases. This logon trigger runs fine most of the time. But, on one of my servers that is on Windows Server 2019, the logon trigger fails while full backups are executing. (Note that the other server that is on Windows Server 2019 is new and still only has small databases on it so I can't really be sure it will run fine when we put bigger databases on it.)

When the full backups starts, latch wait time goes high like crazy, as it is normally around 0 (on SSD drives). This is the only server that has this problem with latch wait time during full backups. (all other stats are fine, CPU is fine, ect.)

latches

And while this happens, my logs are full of the errer

Logon failed for login 'login' due to trigger execution

After investigation, I've found out that if I disable the logon trigger on this server, the full backups are able to execute fine, with a small occasional peak to latch time, but stays fine most of the time.

latch without trigger

My questions are:

  • Could it be possible that the OS (Windows Server 2019 vs Windows Server 2012) makes a difference as my only server that has a problem is on Windows Server 2019?
  • Why would the logon trigger affect the full backups that much?

Right now my only solution is to disable the logon trigger on that server, as clients cannot connect on the server while full backups are taken, since logon trigger execution fails. But we use the logon trigger for security, to log people who login on the servers outside of the application.

Here is the definition of my logon trigger

CREATE TRIGGER [Logon_Trigger_Track_IP]
ON ALL SERVER FOR LOGON
AS
BEGIN
  INSERT INTO [master].[dbo].[TRACETABLE2]
   --the auditing snippet below works fine in a 
  --login trigger, 
  --database trigger 
  --or any stored procedure.
  SELECT 
    getdate()                                    AS EventDate,
    HOST_NAME()                                  AS HostName,
    SUSER_SNAME()                                AS sUserName
    WHERE convert(nvarchar(max),ConnectionProperty('local_net_address')) not like '192.168.0%'

  END
GO

The TRACETABLE2 table on master permission has been given to PUBLIC.

CREATE TABLE [dbo].[TRACETABLE2](
    [EVENTDATE] [datetime] NOT NULL,
    [HOSTNAME] [nvarchar](128) NULL,
    [USERNAME] [nvarchar](128) NULL
) ON [PRIMARY]
GO


GO
GRANT INSERT ON [master].[dbo].[TRACETABLE2] TO PUBLIC
GRANT SELECT on [master].[sys].[dm_exec_connections] TO PUBLIC

Collation is Latin1_General_CI_AS on all servers.

Here are the latches from the server that is having troubles.

Latches from query

Here is the Cost threshold for parallelism.

ctp

I forgot to mention all my backups are encrypted with a server certificate in AES_256


I did some tests based on the comment of rois and simplifying the trigger to this produces the same problem.

CREATE TRIGGER [Logon_Trigger_Track_IP]
ON ALL SERVER FOR LOGON
AS
   BEGIN
     DECLARE @i INT
  END
GO


Update 2021-03-04 : Microsoft was able to reproduce the problem! Waiting for news from them. Will keep you updated.

No correct solution

OTHER TIPS

This isn't an answer really, but it's too big for a comment. I hope it will be helpful.

Using version (got the same results with SQL Server 2016 on Windows 10):

Microsoft SQL Server 2019 (RTM-CU9) (KB5000642) - 15.0.4102.2 (X64)
Jan 25 2021 20:16:12 Copyright (C) 2019 Microsoft Corporation Developer Edition (64-bit) on Windows 10 Pro 10.0 (Build 19042: )

I see peculiar behaviour with the trigger in question. I used SqlQueryStress to simulate some workload on my local instance (you probably need to build it from newest commit, older versions were too slow for me). As query I used: SELECT 1, Number of iterations: 100000, Number of Threads: 10. Everything else left as default. I started the workload with GO button and after few seconds stopped it with Cancel. When the trigger is enabled, the moment I stop the workload, I get following pair(s) of messages in error log:

Logon failed for login '***' due to trigger execution. [CLIENT: ]

The client was unable to reuse a session with SPID 73, which had been reset for connection pooling. The failure ID is 1. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

This isn't the same issue, because it seems I can use the instance normally. I also haven't observed any latch problems. The only queries affected are the ones I cancelled. But the underlying cause may be the same. This happens only under certain workload (1k+ batch requests in my case). It could be some race condition in code, which is exposed on newer Windows versions. I would speculate it manifests itself differently under different workloads. Personally I think you should contact Microsoft.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top