سؤال

Roughly a year ago, i had this problem with an 2008R2 cluster. I asked about it here, but i didn't grab enough logs in the process. And now i have the same problem on my 2012 failover cluster. So i'm making a separate question on the "new" problem.

I have a hard time thinking that its just a coincident that both cluster has the same problem. But i can't find an solution, plus it's alot of planning to take down time to test solutions. But i'm throwing it out here and see if someone has any ideas.

The cluster is two physical node Windows Server 2012 R2 Standard with SQLServer 2012 SP2. The SQLServer contains 101 DBs with sizes spanning from 2 mb to 150 gb. Most DBs are around 200-300 mb, are in simple mode and have a low use. (The 2008 cluster is very similar to this, but with 150ish DBs)

When i install SP3 on the passive node, it works fine, no errors. But when i failover, it takes online the storage, servername, File server and DTC resources, SQL Server is online pending, SQL Server Agent is down. After 10 minutes it changes SQL Server resource to Failed and does a fail back to the other node

Log Name:      System
Source:        Microsoft-Windows-Security-Kerberos
Date:          -
Event ID:      4
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      ACTIVE_NODE.domain.se
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server PASSIVE_NODE$. The target name used was RPCSS/CLUSTER_NAME.domain.se. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (DOMAIN.SE) is different from the client domain (DOMAIN.SE), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Security-Kerberos" Guid="{98E6CFCB-EE0A-41E0-A57B-622D4E1B30B1}" EventSourceName="Kerberos" />
    <EventID Qualifiers="16384">4</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T20:21:01.000000000Z" />
    <EventRecordID>1806734</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>System</Channel>
    <Computer>ACTIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="Server">PASSIVE_NODE$</Data>
    <Data Name="TargetRealm">DOMAIN.SE</Data>
    <Data Name="Targetname">RPCSS/CLUSTER_NAME.domain.se</Data>
    <Data Name="ClientRealm">domain.SE</Data>
    <Binary>
    </Binary>
  </EventData>
</Event>

And this:

Log Name:      System
Source:        Microsoft-Windows-Security-Kerberos
Date:          -
Event ID:      4
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ACTIVE_NODE$. The target name used was cifs/CLUSTER_NAME.domain.se. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (DOMAIN.SE) is different from the client domain (DOMAIN.SE), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Security-Kerberos" Guid="{98E6CFCB-EE0A-41E0-A57B-622D4E1B30B1}" EventSourceName="Kerberos" />
    <EventID Qualifiers="16384">4</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T20:19:57.000000000Z" />
    <EventRecordID>1735401</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>System</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data Name="Server">ACTIVE_NODE$</Data>
    <Data Name="TargetRealm">domain.SE</Data>
    <Data Name="Targetname">cifs/CLUSTER_NAME.domain.se</Data>
    <Data Name="ClientRealm">domain.SE</Data>
    <Binary>
    </Binary>
  </EventData>
</Event>

I have added all the SPNs its complaining on with:

setspn -S cifs/CLUSTER_NAME.domain.se CLUSTER_NAME Checking domain DC=domain,DC=se Registering ServicePrincipalNames for CN=CLUSTER_NAME,OU=Clustername,OU=Servers, DC=domain,DC=se cifs/CLUSTER_NAME.domain.se Updated object

Other entrys in the errorlog:

Log Name:      Application
Source:        Application Error
Date:          -
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.ltdalarna.se
Description:
Faulting application name: rhs.exe, version: 6.3.9600.17396, time stamp: 0x5434e29b
Faulting module name: KERNELBASE.dll, version: 6.3.9600.18202, time stamp: 0x569e7eb1
Exception code: 0x80000003
Fault offset: 0x00000000000de0e2
Faulting process id: 0x206c
Faulting application start time: 0x01d16e778b9bb4fb
Faulting application path: C:\Windows\Cluster\rhs.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 4459c209-da6b-11e5-80d8-fc15b41e47f0
Faulting package full name: 
Faulting package-relative application ID: 
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Application Error" />
    <EventID Qualifiers="0">1000</EventID>
    <Level>2</Level>
    <Task>100</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T20:23:15.000000000Z" />
    <EventRecordID>502077</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>rhs.exe</Data>
    <Data>6.3.9600.17396</Data>
    <Data>5434e29b</Data>
    <Data>KERNELBASE.dll</Data>
    <Data>6.3.9600.18202</Data>
    <Data>569e7eb1</Data>
    <Data>80000003</Data>
    <Data>00000000000de0e2</Data>
    <Data>206c</Data>
    <Data>01d16e778b9bb4fb</Data>
    <Data>C:\Windows\Cluster\rhs.exe</Data>
    <Data>C:\Windows\system32\KERNELBASE.dll</Data>
    <Data>4459c209-da6b-11e5-80d8-fc15b41e47f0</Data>
    <Data>
    </Data>
    <Data>
    </Data>
  </EventData>
</Event>

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          -
Event ID:      1146
Task Category: Resource Control Manager
Level:         Critical
Keywords:      
User:          SYSTEM
Computer:      PASSIVE_NODE.domain.se
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
    <EventID>1146</EventID>
    <Version>0</Version>
    <Level>1</Level>
    <Task>3</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:36:32.356702900Z" />
    <EventRecordID>1735312</EventRecordID>
    <Correlation />
    <Execution ProcessID="3292" ThreadID="7588" />
    <Channel>System</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="NodeName">PASSIVE_NODE</Data>
  </EventData>
</Event>

Regarding this one, i tried to up the maximum failure value for the resource without luck:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          -
Event ID:      1254
Task Category: Resource Control Manager
Level:         Error
Keywords:      
User:          SYSTEM
Computer:      PASSIVE_NODE.ltdalarna.se
Description:
Clustered role 'SQL Server (MSSQLSERVER)' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
    <EventID>1254</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>3</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:13:16.839580300Z" />
    <EventRecordID>1735228</EventRecordID>
    <Correlation />
    <Execution ProcessID="3292" ThreadID="7432" />
    <Channel>System</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="ResourceGroup">SQL Server (MSSQLSERVER)</Data>
  </EventData>
</Event>

And then a bunch of errors on opening a logfile. I tried to add rights to that folder for the AD-account that the SQLServer resource runs under, no luck, still getting these:

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      490
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.chk" for read / write access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">490</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:33:49.000000000Z" />
    <EventRecordID>501908</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>msmdsrv</Data>
    <Data>5744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Api.chk</Data>
    <Data>-1032 (0xfffffbf8)</Data>
    <Data>5 (0x00000005)</Data>
    <Data>Access is denied. </Data>
  </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      489
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.log" for read only access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">489</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:33:59.000000000Z" />
    <EventRecordID>501909</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>msmdsrv</Data>
    <Data>5744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
    <Data>-1032 (0xfffffbf8)</Data>
    <Data>5 (0x00000005)</Data>
    <Data>Access is denied. </Data>
  </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          2016-02-23 20:33:59
Event ID:      455
Task Category: Logging/Recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) Error -1032 (0xfffffbf8) occurred while opening logfile C:\Windows\system32\LogFiles\Sum\Api.log.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">455</EventID>
    <Level>2</Level>
    <Task>3</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:33:59.000000000Z" />
    <EventRecordID>501910</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>msmdsrv</Data>
    <Data>5744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
    <Data>-1032 (0xfffffbf8)</Data>
  </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      489
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.log" for read only access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">489</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:34:09.000000000Z" />
    <EventRecordID>501911</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>msmdsrv</Data>
    <Data>5744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
    <Data>-1032 (0xfffffbf8)</Data>
    <Data>5 (0x00000005)</Data>
    <Data>Access is denied. </Data>
  </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      455
Task Category: Logging/Recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) Error -1032 (0xfffffbf8) occurred while opening logfile C:\Windows\system32\LogFiles\Sum\Api.log.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESENT" />
    <EventID Qualifiers="0">455</EventID>
    <Level>2</Level>
    <Task>3</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:34:09.000000000Z" />
    <EventRecordID>501912</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PASSIVE_NODE.domain.se</Computer>
    <Security />
  </System>
  <EventData>
    <Data>msmdsrv</Data>
    <Data>5744</Data>
    <Data>
    </Data>
    <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
    <Data>-1032 (0xfffffbf8)</Data>
  </EventData>
</Event>

These also show up, but they are showing up regardless of the Servicepack installation

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          -
Event ID:      10016
Task Category: None
Level:         Error
Keywords:      Classic
User:          DOMAIN\SQL_AD_ACCOUNT
Computer:      ACTIVE_NODE.domain.se
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{FDC3723D-1588-4BA3-92D4-42C430735D7D}
 and APPID 
{83B33982-693D-4824-B42E-7196AE61BB05}
 to the user LTDALARNA\sys309 SID (S-1-5-21-910452376-877226765-825688854-92084) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-DistributedCOM" Guid="{1B562E86-B7AA-4131-BADC-B6F3A001407E}" EventSourceName="DCOM" />
    <EventID Qualifiers="0">10016</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8080000000000000</Keywords>
    <TimeCreated SystemTime="2016-02-23T19:40:01.178905000Z" />
    <EventRecordID>1806578</EventRecordID>
    <Correlation />
    <Execution ProcessID="976" ThreadID="19656" />
    <Channel>System</Channel>
    <Computer>ACTIVE_NODE.domain.se</Computer>
    <Security UserID="S-1-5-21-910452376-877226765-825688854-92084" />
  </System>
  <EventData>
    <Data Name="param1">application-specific</Data>
    <Data Name="param2">Local</Data>
    <Data Name="param3">Activation</Data>
    <Data Name="param4">{FDC3723D-1588-4BA3-92D4-42C430735D7D}</Data>
    <Data Name="param5">{83B33982-693D-4824-B42E-7196AE61BB05}</Data>
    <Data Name="param6">DOMAIN</Data>
    <Data Name="param7">sys309</Data>
    <Data Name="param8">S-1-5-21-910452376-877226765-825688854-92084</Data>
    <Data Name="param9">LocalHost (Using LRPC)</Data>
    <Data Name="param10">Unavailable</Data>
    <Data Name="param11">Unavailable</Data>
  </EventData>
</Event>

I have also been looking thru the Windows cluster log (get-clusterlog), and can't find anything that stands out.

Having this problem on 2 servers with 100+ DBs, can it be something with the upgrade taking to long, and the windows cluster getting impatient and think it failed?

I looked into this artice: [https://blogs.msdn.microsoft.com/clustering/2013/01/24/understanding-how-failover-clustering-recovers-from-unresponsive-resources/] and tried to double the Deadlocktimeout value without luck.

Anyone with any idea? I'm treading water here.

هل كانت مفيدة؟

المحلول

I found the issue after a long time. It was due to 1 million+ files in the \MSSQL\log folder.

After setting up a job that cleared that folder. The failover after SP install worked fine.

The solution was confirmed on both this 2012 cluster, and the 2008R2 cluster we had the same problem on

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى dba.stackexchange
scroll top