Sql-Server-2012

SQL Server 2012 群集服務包 3 故障轉移節點失敗

  • June 21, 2016

大約一年前,我在使用 2008R2 集群時遇到了這個問題。我在這裡詢問過,但在此過程中我沒有獲取足夠的日誌。現在我在我的 2012 故障轉移集群上遇到了同樣的問題。所以我正在就“新”問題提出一個單獨的問題。

我很難想到兩個集群都有相同的問題只是巧合。但是我找不到解決方案,而且有很多計劃要花時間來測試解決方案。但我把它扔在這裡,看看是否有人有任何想法。

集群是兩個物理節點 Windows Server 2012 R2 Standard 和 SQLServer 2012 SP2。SQLServer 包含 101 個 DB,大小從 2 mb 到 150 gb 不等。大多數數據庫大約 200-300 mb,處於簡單模式並且使用率很低。(2008 年的集群與此非常相似,但有 150 個數據庫)

當我在被動節點上安裝 SP3 時,它工作正常,沒有錯誤。但是當我進行故障轉移時,它會使儲存、伺服器名、文件伺服器和 DTC 資源聯機,SQL Server 處於聯機狀態,SQL Server 代理已關閉。10 分鐘後,它將 SQL Server 資源更改為“失敗”並故障回復到另一個節點

Log Name:      System
Source:        Microsoft-Windows-Security-Kerberos
Date:          -
Event ID:      4
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      ACTIVE_NODE.domain.se
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server PASSIVE_NODE$. The target name used was RPCSS/CLUSTER_NAME.domain.se. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (DOMAIN.SE) is different from the client domain (DOMAIN.SE), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Microsoft-Windows-Security-Kerberos" Guid="{98E6CFCB-EE0A-41E0-A57B-622D4E1B30B1}" EventSourceName="Kerberos" />
   <EventID Qualifiers="16384">4</EventID>
   <Version>0</Version>
   <Level>2</Level>
   <Task>0</Task>
   <Opcode>0</Opcode>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T20:21:01.000000000Z" />
   <EventRecordID>1806734</EventRecordID>
   <Correlation />
   <Execution ProcessID="0" ThreadID="0" />
   <Channel>System</Channel>
   <Computer>ACTIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data Name="Server">PASSIVE_NODE$</Data>
   <Data Name="TargetRealm">DOMAIN.SE</Data>
   <Data Name="Targetname">RPCSS/CLUSTER_NAME.domain.se</Data>
   <Data Name="ClientRealm">domain.SE</Data>
   <Binary>
   </Binary>
 </EventData>
</Event>

和這個:

Log Name:      System
Source:        Microsoft-Windows-Security-Kerberos
Date:          -
Event ID:      4
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ACTIVE_NODE$. The target name used was cifs/CLUSTER_NAME.domain.se. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (DOMAIN.SE) is different from the client domain (DOMAIN.SE), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Microsoft-Windows-Security-Kerberos" Guid="{98E6CFCB-EE0A-41E0-A57B-622D4E1B30B1}" EventSourceName="Kerberos" />
   <EventID Qualifiers="16384">4</EventID>
   <Version>0</Version>
   <Level>2</Level>
   <Task>0</Task>
   <Opcode>0</Opcode>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T20:19:57.000000000Z" />
   <EventRecordID>1735401</EventRecordID>
   <Correlation />
   <Execution ProcessID="0" ThreadID="0" />
   <Channel>System</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data Name="Server">ACTIVE_NODE$</Data>
   <Data Name="TargetRealm">domain.SE</Data>
   <Data Name="Targetname">cifs/CLUSTER_NAME.domain.se</Data>
   <Data Name="ClientRealm">domain.SE</Data>
   <Binary>
   </Binary>
 </EventData>
</Event>

我已經添加了它抱怨的所有 SPN:

setspn -S cifs/CLUSTER_NAME.domain.se CLUSTER_NAME 檢查域 DC=domain,DC=se 為 CN=CLUSTER_NAME,OU=Clustername,OU=Servers, DC=domain,DC=se cifs/CLUSTER_NAME.domain.se 註冊 ServicePrincipalNames 已更新目的

錯誤日誌中的其他條目:

Log Name:      Application
Source:        Application Error
Date:          -
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.ltdalarna.se
Description:
Faulting application name: rhs.exe, version: 6.3.9600.17396, time stamp: 0x5434e29b
Faulting module name: KERNELBASE.dll, version: 6.3.9600.18202, time stamp: 0x569e7eb1
Exception code: 0x80000003
Fault offset: 0x00000000000de0e2
Faulting process id: 0x206c
Faulting application start time: 0x01d16e778b9bb4fb
Faulting application path: C:\Windows\Cluster\rhs.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll
Report Id: 4459c209-da6b-11e5-80d8-fc15b41e47f0
Faulting package full name: 
Faulting package-relative application ID: 
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Application Error" />
   <EventID Qualifiers="0">1000</EventID>
   <Level>2</Level>
   <Task>100</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T20:23:15.000000000Z" />
   <EventRecordID>502077</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>rhs.exe</Data>
   <Data>6.3.9600.17396</Data>
   <Data>5434e29b</Data>
   <Data>KERNELBASE.dll</Data>
   <Data>6.3.9600.18202</Data>
   <Data>569e7eb1</Data>
   <Data>80000003</Data>
   <Data>00000000000de0e2</Data>
   <Data>206c</Data>
   <Data>01d16e778b9bb4fb</Data>
   <Data>C:\Windows\Cluster\rhs.exe</Data>
   <Data>C:\Windows\system32\KERNELBASE.dll</Data>
   <Data>4459c209-da6b-11e5-80d8-fc15b41e47f0</Data>
   <Data>
   </Data>
   <Data>
   </Data>
 </EventData>
</Event>

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          -
Event ID:      1146
Task Category: Resource Control Manager
Level:         Critical
Keywords:      
User:          SYSTEM
Computer:      PASSIVE_NODE.domain.se
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
   <EventID>1146</EventID>
   <Version>0</Version>
   <Level>1</Level>
   <Task>3</Task>
   <Opcode>0</Opcode>
   <Keywords>0x8000000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:36:32.356702900Z" />
   <EventRecordID>1735312</EventRecordID>
   <Correlation />
   <Execution ProcessID="3292" ThreadID="7588" />
   <Channel>System</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security UserID="S-1-5-18" />
 </System>
 <EventData>
   <Data Name="NodeName">PASSIVE_NODE</Data>
 </EventData>
</Event>

關於這一點,我試圖在沒有運氣的情況下提高資源的最大失敗值:

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          -
Event ID:      1254
Task Category: Resource Control Manager
Level:         Error
Keywords:      
User:          SYSTEM
Computer:      PASSIVE_NODE.ltdalarna.se
Description:
Clustered role 'SQL Server (MSSQLSERVER)' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
   <EventID>1254</EventID>
   <Version>0</Version>
   <Level>2</Level>
   <Task>3</Task>
   <Opcode>0</Opcode>
   <Keywords>0x8000000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:13:16.839580300Z" />
   <EventRecordID>1735228</EventRecordID>
   <Correlation />
   <Execution ProcessID="3292" ThreadID="7432" />
   <Channel>System</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security UserID="S-1-5-18" />
 </System>
 <EventData>
   <Data Name="ResourceGroup">SQL Server (MSSQLSERVER)</Data>
 </EventData>
</Event>

然後在打開日誌文件時出現一堆錯誤。我試圖為執行 SQLServer 資源的 AD 帳戶添加該文件夾的權限,但運氣不好,仍然得到這些:

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      490
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.chk" for read / write access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="ESENT" />
   <EventID Qualifiers="0">490</EventID>
   <Level>2</Level>
   <Task>1</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:33:49.000000000Z" />
   <EventRecordID>501908</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>msmdsrv</Data>
   <Data>5744</Data>
   <Data>
   </Data>
   <Data>C:\Windows\system32\LogFiles\Sum\Api.chk</Data>
   <Data>-1032 (0xfffffbf8)</Data>
   <Data>5 (0x00000005)</Data>
   <Data>Access is denied. </Data>
 </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      489
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.log" for read only access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="ESENT" />
   <EventID Qualifiers="0">489</EventID>
   <Level>2</Level>
   <Task>1</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:33:59.000000000Z" />
   <EventRecordID>501909</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>msmdsrv</Data>
   <Data>5744</Data>
   <Data>
   </Data>
   <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
   <Data>-1032 (0xfffffbf8)</Data>
   <Data>5 (0x00000005)</Data>
   <Data>Access is denied. </Data>
 </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          2016-02-23 20:33:59
Event ID:      455
Task Category: Logging/Recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) Error -1032 (0xfffffbf8) occurred while opening logfile C:\Windows\system32\LogFiles\Sum\Api.log.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="ESENT" />
   <EventID Qualifiers="0">455</EventID>
   <Level>2</Level>
   <Task>3</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:33:59.000000000Z" />
   <EventRecordID>501910</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>msmdsrv</Data>
   <Data>5744</Data>
   <Data>
   </Data>
   <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
   <Data>-1032 (0xfffffbf8)</Data>
 </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      489
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) An attempt to open the file "C:\Windows\system32\LogFiles\Sum\Api.log" for read only access failed with system error 5 (0x00000005): "Access is denied. ".  The open file operation will fail with error -1032 (0xfffffbf8).
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="ESENT" />
   <EventID Qualifiers="0">489</EventID>
   <Level>2</Level>
   <Task>1</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:34:09.000000000Z" />
   <EventRecordID>501911</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>msmdsrv</Data>
   <Data>5744</Data>
   <Data>
   </Data>
   <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
   <Data>-1032 (0xfffffbf8)</Data>
   <Data>5 (0x00000005)</Data>
   <Data>Access is denied. </Data>
 </EventData>
</Event>

Log Name:      Application
Source:        ESENT
Date:          -
Event ID:      455
Task Category: Logging/Recovery
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PASSIVE_NODE.domain.se
Description:
msmdsrv (5744) Error -1032 (0xfffffbf8) occurred while opening logfile C:\Windows\system32\LogFiles\Sum\Api.log.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="ESENT" />
   <EventID Qualifiers="0">455</EventID>
   <Level>2</Level>
   <Task>3</Task>
   <Keywords>0x80000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:34:09.000000000Z" />
   <EventRecordID>501912</EventRecordID>
   <Channel>Application</Channel>
   <Computer>PASSIVE_NODE.domain.se</Computer>
   <Security />
 </System>
 <EventData>
   <Data>msmdsrv</Data>
   <Data>5744</Data>
   <Data>
   </Data>
   <Data>C:\Windows\system32\LogFiles\Sum\Api.log</Data>
   <Data>-1032 (0xfffffbf8)</Data>
 </EventData>
</Event>

這些也會顯示,但無論 Servicepack 安裝如何,它們都會顯示

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          -
Event ID:      10016
Task Category: None
Level:         Error
Keywords:      Classic
User:          DOMAIN\SQL_AD_ACCOUNT
Computer:      ACTIVE_NODE.domain.se
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID 
{FDC3723D-1588-4BA3-92D4-42C430735D7D}
and APPID 
{83B33982-693D-4824-B42E-7196AE61BB05}
to the user LTDALARNA\sys309 SID (S-1-5-21-910452376-877226765-825688854-92084) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
 <System>
   <Provider Name="Microsoft-Windows-DistributedCOM" Guid="{1B562E86-B7AA-4131-BADC-B6F3A001407E}" EventSourceName="DCOM" />
   <EventID Qualifiers="0">10016</EventID>
   <Version>0</Version>
   <Level>2</Level>
   <Task>0</Task>
   <Opcode>0</Opcode>
   <Keywords>0x8080000000000000</Keywords>
   <TimeCreated SystemTime="2016-02-23T19:40:01.178905000Z" />
   <EventRecordID>1806578</EventRecordID>
   <Correlation />
   <Execution ProcessID="976" ThreadID="19656" />
   <Channel>System</Channel>
   <Computer>ACTIVE_NODE.domain.se</Computer>
   <Security UserID="S-1-5-21-910452376-877226765-825688854-92084" />
 </System>
 <EventData>
   <Data Name="param1">application-specific</Data>
   <Data Name="param2">Local</Data>
   <Data Name="param3">Activation</Data>
   <Data Name="param4">{FDC3723D-1588-4BA3-92D4-42C430735D7D}</Data>
   <Data Name="param5">{83B33982-693D-4824-B42E-7196AE61BB05}</Data>
   <Data Name="param6">DOMAIN</Data>
   <Data Name="param7">sys309</Data>
   <Data Name="param8">S-1-5-21-910452376-877226765-825688854-92084</Data>
   <Data Name="param9">LocalHost (Using LRPC)</Data>
   <Data Name="param10">Unavailable</Data>
   <Data Name="param11">Unavailable</Data>
 </EventData>
</Event>

我也一直在查看 Windows 群集日誌 (get-clusterlog),但找不到任何突出的內容。

在具有 100 多個數據庫的 2 台伺服器上遇到此問題,是否可能是升級需要很長時間,並且 Windows 集群變得不耐煩並認為它失敗了?

我查看了這篇文章:[ https://blogs.msdn.microsoft.com/clustering/2013/01/24/understanding-how-failover-clustering-recovers-from-unresponsive-resources/]並試圖將 Deadlocktimeout 加倍沒有運氣的價值。

有任何想法的人嗎?我在這裡踩水。

找了很久才發現這個問題。這是因為 \MSSQL\log 文件夾中有超過 100 萬個文件。

設置清除該文件夾的作業後。SP 安裝後的故障轉移工作正常。

該解決方案在這個 2012 集群和我們遇到相同問題的 2008R2 集群上都得到了確認

引用自:https://dba.stackexchange.com/questions/130889