Question

I am having a strange issue and am hoping you guys might be able to help!

Problem: I have a 2 node SQL Server 2012 Availability Group Cluster utilising a FSW. Both nodes are using the same DBEngine Service account (as is normal in these setups).

There are currently 3 Availability Groups on this cluster, and it's been working fine for quite some time.

Today I restarted the passive node DBEngine account (as I was adding an SSL Certificate for a new application, new availability group). When the node came back up, it was no longer synchronising with node 1. The state of the replica was disconnected, and I could see lots of login failures on Node 1 (active node) SQL Logs.

I found that the DBEngine service account had locked. I had it unlocked, but it soon locked again.

I removed the SSL Certificate and restarted the passive node again (so it was in the same state I started with) and it started up fine, but within a minute, the service account had locked out again.

Steps I tried:

  • created a new service account to rule out the account being used elsewhere, started both nodes under the new account.... account locked out when node 2 started

  • unlocked the account, stopped node 2. restarted node 1. Account fine... waited.. account still fine. Started node 2 service... account locked out.

  • recreated mirroring endpoints on both nodes and reapplied connect permissions to the dbengine service account. - this didn't fix it.

  • restarted both VM's

  • removed the node 2 replica from the availability group, removed all databases (from node 2) and dropped the mirroring endpoint on node 2. restarted node 2 service. - at this point both nodes were happily running under the same service account.

  • tried re-adding node 2 as a replica using the wizard. It added it, backed up the database, restored to node 2, and got to the very last step where it connects it, and the password locked out again!

Has anyone got any ideas? Any input would be greatly received!

Sam

Was it helpful?

Solution

So just in case someone wants to know what caused this, It was a group policy!

Some time ago, unbeknown to me, the domain controllers for the domain in question had been upgraded to Server 2012. Along with this came a whole bunch of Windows Server 2012 group policies. Additional policies had been added to one of the parent server OU's with a filter applied.

Unfortunately the filter had a typo, and so it was being applied to all of my database servers on this domain.

I had almost ruled out it being a GP issue, as I seemed to have all the permissions I needed, and I could see connections coming in on the correct ports between each node. The servers were happily running as single nodes!

I asked the server team to move the servers (just to see) to the 'computers' OU, and after a forced gpupdate, bingo!

Unfortunately, I am unsure exactly which policy caused the problem as there were a lot! There was a number to do with NTLM usage and authentication, and I'm convinced the issue was related to this. I will at some point set up a test lab to try and replicate the issue.

So there you go! Always check group policies, even if you believe (like I did) that nothing had changed!

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top