升級到 SQL Server 2017,現在可以看到 WSFC 租用超時
我最近將生產數據庫環境從 SQL Server 2016 標準版升級到 2017 企業版 (CU3)。我們在標準版上使用了基本可用性組,並在 SQL Server 就地升級後在 2017 Enterprise 上重新創建了 AG。
升級後,我們遇到了幾個 SQL Server 日誌顯示 Windows Server 故障轉移群集 WSFC 租約超時的實例。結果,我們通過偵聽器引用 AG 的應用程序失去了瞬時連接。
下面的螢幕截圖是我們的 SQL Server 日誌,從底部的第一條消息開始
錯誤:19421,嚴重性:16,狀態:1。
錯誤行寫到上午 11:01:17,表明一切在 7 秒內恢復正常:
該 AG 中的兩台 SQL Server 執行在 Windows Server 2012R2 上。在 SQL Server 升級期間,Windows Server 故障轉移群集沒有任何變化,而且在升級之前我們沒有遇到此問題。
SQL Server 引擎是否存在潛在問題?或者這是否指向 WSFC 的問題?您可以提供的任何指針將不勝感激!
3/15 更新 - 在事件期間從 PS 命令 Get-ClusterLog -UseLocalTime 添加集群日誌。
00006670.000045b0::2018/03/12-11:01:10.150 ERR [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Lease renewal failed because the existing lease is no longer valid. 00006670.000045b0::2018/03/12-11:01:10.151 ERR [RES] SQL Server Availability Group <AG_Membership>: [hadrag] The lease is expired. The lease should have been renewed by 2018/03/12-16:01:00.068 000027d4.00001804::2018/03/12-11:01:10.203 INFO [GUM] Node 1: executing request locally, gumId:44530, my action: /dm/update, # of updates: 1 000027d4.00001804::2018/03/12-11:01:10.219 INFO [GUM] Node 1: executing request locally, gumId:44531, my action: /dm/update, # of updates: 1 000027d4.00001804::2018/03/12-11:01:10.463 INFO [RCM] rcm::RcmApi::FailResource: (AG_Membership) 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'AG_Membership', gen(4) result 0/0. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] Res AG_Membership: Online -> ProcessingFailure( StateUnknown ) 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] TransitionToState(AG_Membership) Online-->ProcessingFailure. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AG_Membership, Online --> Pending) 000027d4.00004ce0::2018/03/12-11:01:10.464 ERR [RCM] rcm::RcmResource::HandleFailure: (AG_Membership) 000027d4.00002508::2018/03/12-11:01:10.464 INFO [RCM] ignored non-local state Pending for group AG_Membership 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] resource AG_Membership: failure count: 1, restartAction: 2 persistentState: 1. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] numDependents is zero, auto-returning true 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] Greater than restartPeriod time has elapsed since first failure of AG_Membership, resetting failureTime and failureCount. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] Will queue immediate restart (500 milliseconds) of AG_Membership after terminate is complete. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] Res AG_Membership: ProcessingFailure -> WaitingToTerminate( DelayRestartingResource ) 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] TransitionToState(AG_Membership) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource]. 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] Res AG_Membership: [WaitingToTerminate to DelayRestartingResource] -> Terminating( DelayRestartingResource ) 000027d4.00004ce0::2018/03/12-11:01:10.464 INFO [RCM] TransitionToState(AG_Membership) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource]. 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] Lease timeout detected, logging perf counter data collected so far 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] Date/Time, Processor time(%), Available memory(bytes), Avg disk read(secs), Avg disk write(secs) 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] 3/12/2018 16:0:21.0, 95.167191, 1079693312.000000, 0.004913, 0.004337 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] 3/12/2018 16:0:31.0, 98.749704, 1048801280.000000, 0.013747, 0.003205 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] 3/12/2018 16:0:41.0, 91.797399, 1201926144.000000, 0.013994, 0.002175 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] 3/12/2018 16:0:51.0, 72.669750, 1377648640.000000, 0.003069, 0.001442 00006670.0000194c::2018/03/12-11:01:10.465 WARN [RES] SQL Server Availability Group: [hadrag] 3/12/2018 16:1:1.0, 62.595431, 1232269312.000000, 0.005762, 0.001885 00006670.0000194c::2018/03/12-11:01:10.465 INFO [RES] SQL Server Availability Group: [hadrag] Stopping Health Worker Thread 00006670.00002f90::2018/03/12-11:01:10.465 INFO [RES] SQL Server Availability Group: [hadrag] Health worker was asked to terminate 000027d4.00004ce0::2018/03/12-11:01:10.741 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (AG_Membership, 1) 000027d4.00004ce0::2018/03/12-11:01:10.741 INFO [GUM] Node 1: executing request locally, gumId:44532, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000027d4.00004ce0::2018/03/12-11:01:10.741 INFO [RCM] rcm::RcmGum::AddPossibleOwner(AG_Membership,1) 000027d4.00004ce0::2018/03/12-11:01:10.741 ERR mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000027d4.00004ce0::2018/03/12-11:01:10.741 WARN [DM] Aborting group transaction 67:67:44313+1 000027d4.00004ce0::2018/03/12-11:01:10.741 ERR [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000027d4.0000569c::2018/03/12-11:01:10.741 INFO [RCM] rcm::RcmApi::MoveGroup: (AG_Membership, 1, 0, MoveType::Manual ) 000027d4.0000569c::2018/03/12-11:01:10.743 INFO [RCM] rcm::RcmApi::OnlineResource: (AG_Membership, 0) 000027d4.0000569c::2018/03/12-11:01:10.743 ERR [RCM] rcm::RcmApi::OnlineResource: (5023) because of 'The API call is not valid while resource is in the [Terminating to DelayRestartingResource] state.' 00006670.00002128::2018/03/12-11:01:10.949 INFO [RES] SQL Server Availability Group: [hadrag] SQLMoreResults() returns -1 with following information 00006670.00002f90::2018/03/12-11:01:10.950 INFO [RES] SQL Server Availability Group: [hadrag] Change diagnostics interval worker is stopped 00006670.00002128::2018/03/12-11:01:10.989 ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [HY008] [Microsoft][SQL Server Native Client 11.0]Operation canceled (0) 00006670.00002128::2018/03/12-11:01:10.989 ERR [RES] SQL Server Availability Group: [hadrag] ODBC Error: [01000] [Microsoft][SQL Server Native Client 11.0][SQL Server] (0) 00006670.00002128::2018/03/12-11:01:10.989 INFO [RES] SQL Server Availability Group: [hadrag] No more diagnostics results 00006670.00002128::2018/03/12-11:01:10.989 INFO [RES] SQL Server Availability Group: [hadrag] Diagnostics is stopped 00006670.00002128::2018/03/12-11:01:10.989 INFO [RES] SQL Server Availability Group: [hadrag] Disconnect from SQL Server 00006670.00002128::2018/03/12-11:01:11.218 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event logging is stopped 00006670.00002128::2018/03/12-11:01:11.228 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event target state: 00006670.00002128::2018/03/12-11:01:11.228 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event session summary: dropped buffers = 0, dropped events = 0 00006670.0000194c::2018/03/12-11:01:11.284 INFO [RES] SQL Server Availability Group: [hadrag] Stopping Change Diagnostics interval Worker Thread 00006670.0000194c::2018/03/12-11:01:11.341 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Connect to SQL Server ... 00006670.0000194c::2018/03/12-11:01:11.667 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] The connection was established successfully 00006670.0000194c::2018/03/12-11:01:11.688 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Disconnect from SQL Server 00006670.0000194c::2018/03/12-11:01:11.697 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Offline call successful in Terminate function 000027d4.00004ce0::2018/03/12-11:01:11.697 INFO [RCM] HandleMonitorReply: TERMINATERESOURCE for 'AG_Membership', gen(5) result 0/0. 000027d4.00004ce0::2018/03/12-11:01:11.697 INFO [RCM] Res AG_Membership: [Terminating to DelayRestartingResource] -> DelayRestartingResource( StateUnknown ) 000027d4.00004ce0::2018/03/12-11:01:11.697 INFO [RCM] TransitionToState(AG_Membership) [Terminating to DelayRestartingResource]-->DelayRestartingResource. 000027d4.00004ce0::2018/03/12-11:01:11.697 WARN [RCM] Queueing immediate delay restart of resource AG_Membership in 500 ms. 000027d4.000047cc::2018/03/12-11:01:12.197 INFO [RCM] Delay-restarting AG_Membership and any waiting dependents. 000027d4.000047cc::2018/03/12-11:01:12.197 INFO [RCM-rbtr] giving default token to group AG_Membership 000027d4.000047cc::2018/03/12-11:01:12.197 INFO [RCM-rbtr] giving default token to group AG_Membership 000027d4.000047cc::2018/03/12-11:01:12.198 INFO [RCM] Res AG_Membership: DelayRestartingResource -> OnlineCallIssued( StateUnknown ) 000027d4.000047cc::2018/03/12-11:01:12.198 INFO [RCM] TransitionToState(AG_Membership) DelayRestartingResource-->OnlineCallIssued. 000027d4.0000569c::2018/03/12-11:01:12.198 INFO rcm::RcmResource::OnlineWorker[RCM] Issuing Online(AG_Membership) to RHS. 000027d4.0000569c::2018/03/12-11:01:12.198 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for 'AG_Membership', gen(5) result 997/0. 000027d4.0000569c::2018/03/12-11:01:12.198 INFO [RCM] Res AG_Membership: OnlineCallIssued -> OnlinePending( StateUnknown ) 000027d4.0000569c::2018/03/12-11:01:12.198 INFO [RCM] TransitionToState(AG_Membership) OnlineCallIssued-->OnlinePending. 00006670.00004048::2018/03/12-11:01:12.200 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] The DeadLockTimeout property has a value of 300000 00006670.00004048::2018/03/12-11:01:12.204 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] The PendingTimeout property has a value of 180000 00006670.00004048::2018/03/12-11:01:12.207 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Connect to SQL Server ... 00006670.00004048::2018/03/12-11:01:12.283 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] The connection was established successfully 00006670.00004048::2018/03/12-11:01:12.331 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Current SQL Instance is not part of Failover clustering 00006670.00004048::2018/03/12-11:01:12.331 INFO [RES] SQL Server Availability Group: [hadrag] Starting Health Worker Thread 00006670.00005afc::2018/03/12-11:01:12.405 INFO [RES] SQL Server Availability Group: [hadrag] XEvent session MSSQLSERVER is created with RolloverCount 10, MaxFileSizeInMBytes 100, and LogPath 'D:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\' 00006670.00005afc::2018/03/12-11:01:12.406 INFO [RES] SQL Server Availability Group: [hadrag] Extended Event logging is started 00006670.00005afc::2018/03/12-11:01:12.406 INFO [RES] SQL Server Availability Group: [hadrag] Health worker started for instance SQLSERVER1 00006670.00005afc::2018/03/12-11:01:12.406 INFO [RES] SQL Server Availability Group: [hadrag] Connect to SQL Server ... 00006670.00005afc::2018/03/12-11:01:12.410 INFO [RES] SQL Server Availability Group: [hadrag] The connection was established successfully 00006670.00005afc::2018/03/12-11:01:12.438 INFO [RES] SQL Server Availability Group: [hadrag] Diagnostics is started 00006670.00005afc::2018/03/12-11:01:12.439 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'system' health state has been changed from '' to 'clean' at 2018-03-12 11:01:12.437 00006670.00005afc::2018/03/12-11:01:12.439 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'resource' health state has been changed from '' to 'clean' at 2018-03-12 11:01:12.437 00006670.00005afc::2018/03/12-11:01:12.444 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from '' to 'clean' at 2018-03-12 11:01:12.437 00006670.00005afc::2018/03/12-11:01:12.445 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'io_subsystem' health state has been changed from '' to 'clean' at 2018-03-12 11:01:12.437 00006670.00005afc::2018/03/12-11:01:12.447 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'events' health state has been changed from '' to 'unknown' at 2018-03-12 11:01:12.437 00006670.00004e90::2018/03/12-11:01:12.449 INFO [RES] SQL Server Availability Group: [hadrag] Change diagnostics interval worker is started 000027d4.00001804::2018/03/12-11:01:13.066 INFO [GUM] Node 1: executing request locally, gumId:44532, my action: /dm/update, # of updates: 1 000027d4.00001804::2018/03/12-11:01:13.101 INFO [GUM] Node 1: executing request locally, gumId:44533, my action: /dm/update, # of updates: 1 000027d4.000047cc::2018/03/12-11:01:13.149 INFO [RCM] rcm::RcmApi::OnlineGroup: (AG_Membership, 0) 000027d4.000047cc::2018/03/12-11:01:13.149 INFO [GUM] Node 1: executing request locally, gumId:44534, my action: /rcm/gum/SetGroupPersistentState, # of updates: 1 000027d4.000047cc::2018/03/12-11:01:13.149 INFO [RCM] rcm::RcmGum::SetGroupPersistentState(AG_Membership,1) 000027d4.000047cc::2018/03/12-11:01:13.149 INFO [DM] An empty single transaction is cancelled 67:67:44315+1::0 00006670.00004048::2018/03/12-11:01:13.159 INFO [RES] SQL Server Availability Group <AG_Membership>: [hadrag] Disconnect from SQL Server 00006670.00004048::2018/03/12-11:01:13.160 INFO [RHS] Resource AG_Membership has come online. RHS is about to report status change to RCM 000027d4.0000569c::2018/03/12-11:01:13.161 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for 'AG_Membership', gen(5) result 0/0. 000027d4.0000569c::2018/03/12-11:01:13.161 INFO [RCM] Res AG_Membership: OnlinePending -> Online( StateUnknown ) 000027d4.0000569c::2018/03/12-11:01:13.161 INFO [RCM] TransitionToState(AG_Membership) OnlinePending-->Online. 000027d4.0000569c::2018/03/12-11:01:13.161 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (AG_Membership, Pending --> Online) 000027d4.00002508::2018/03/12-11:01:13.161 INFO [RCM] ignored non-local state Online for group AG_Membership 000027d4.00004ce0::2018/03/12-11:01:13.161 INFO [RCM] moved 0 tasks from staging set to task set. TaskSetSize=0 000027d4.00004ce0::2018/03/12-11:01:13.161 INFO [RCM] rcm::RcmPriorityManager::StartGroups: [RCM] done, executed 0 tasks 000027d4.000047cc::2018/03/12-11:01:13.163 INFO [RCM] Group AG_Membership: done going through resources, returning true 000027d4.000047cc::2018/03/12-11:01:13.166 INFO [RCM] rcm::RcmApi::AddPossibleOwner: (AG_Membership, 1) 000027d4.000047cc::2018/03/12-11:01:13.166 INFO [GUM] Node 1: executing request locally, gumId:44535, my action: /rcm/gum/AddPossibleOwner, # of updates: 1 000027d4.000047cc::2018/03/12-11:01:13.166 INFO [RCM] rcm::RcmGum::AddPossibleOwner(AG_Membership,1) 000027d4.000047cc::2018/03/12-11:01:13.167 ERR mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.' 000027d4.000047cc::2018/03/12-11:01:13.167 WARN [DM] Aborting group transaction 67:67:44315+1 000027d4.000047cc::2018/03/12-11:01:13.167 ERR [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed' 000027d4.000047cc::2018/03/12-11:01:13.181 WARN [RCM] AG_Membership cannot be hosted on node 2 000027d4.000047cc::2018/03/12-11:01:13.181 WARN [RCM] Possible owners: <vector len='1'> 000027d4.000047cc::2018/03/12-11:01:13.181 WARN <item>1</item> 000027d4.000047cc::2018/03/12-11:01:13.181 WARN </vector> 000027d4.0000569c::2018/03/12-11:01:13.181 INFO [GUM] Node 1: executing request locally, gumId:44535, my action: /rcm/gum/SetGroupPreferredOwners, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:13.181 INFO [RCM] rcm::RcmGum::SetGroupPreferredOwners(AG_Membership,<vector len='1'> 000027d4.0000569c::2018/03/12-11:01:13.181 INFO <item>1</item> 000027d4.0000569c::2018/03/12-11:01:13.181 INFO </vector> 000027d4.0000569c::2018/03/12-11:01:13.181 INFO ) 000027d4.00004ce0::2018/03/12-11:01:13.270 INFO [GUM] Node 1: executing request locally, gumId:44536, my action: /dm/update, # of updates: 1 000047bc.00004d8c::2018/03/12-11:01:13.271 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:212d9bab-6bb4-4c46-8b56-20e1e60fd7ca:Netbios 000027d4.00001804::2018/03/12-11:01:15.802 INFO [RCM] rcm::RcmApi::OnlineResource: (AG_Membership, 0) 000027d4.0000569c::2018/03/12-11:01:17.362 INFO [GUM] Node 1: executing request locally, gumId:44537, my action: /dm/update, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:17.368 INFO [GUM] Node 1: executing request locally, gumId:44538, my action: /dm/update, # of updates: 1 000027d4.00001aac::2018/03/12-11:01:17.404 INFO [GUM] Node 1: executing request locally, gumId:44539, my action: /dm/update, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:17.411 INFO [GUM] Node 1: executing request locally, gumId:44540, my action: /dm/update, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:17.425 INFO [GUM] Node 1: executing request locally, gumId:44541, my action: /dm/update, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:17.430 INFO [GUM] Node 1: executing request locally, gumId:44542, my action: /dm/update, # of updates: 1 000027d4.00004ce0::2018/03/12-11:01:17.436 INFO [GUM] Node 1: executing request locally, gumId:44543, my action: /dm/update, # of updates: 1 000027d4.0000569c::2018/03/12-11:01:17.442 INFO [GUM] Node 1: executing request locally, gumId:44544, my action: /dm/update, # of updates: 1 000027d4.00001aac::2018/03/12-11:01:17.445 INFO [GUM] Node 1: executing request locally, gumId:44545, my action: /dm/update, # of updates: 1 000027d4.00001804::2018/03/12-11:01:17.456 INFO [GUM] Node 1: executing request locally, gumId:44546, my action: /dm/update, # of updates: 1 000027d4.00001804::2018/03/12-11:01:17.460 INFO [GUM] Node 1: executing request locally, gumId:44547, my action: /dm/update, # of updates: 1 000027d4.00006198::2018/03/12-11:01:17.473 INFO [GUM] Node 1: executing request locally, gumId:44548, my action: /dm/update, # of updates: 1 000047bc.00004d8c::2018/03/12-11:01:18.272 INFO [RES] Network Name: Agent: Sending request Netname/RecheckConfig to NN:212d9bab-6bb4-4c46-8b56-20e1e60fd7ca:Netbios 000047bc.00004d8c::2018/03/12-11:01:19.980 INFO [RES] Network Name <AG_Membership_LISTERNER01>: Dns: HealthCheck: LISTERNER01
根據更新的資訊,伺服器似乎出現了CPU 不足的情況。我說點什麼,因為它可能在 SQL Server 內部 - 例如錯誤的並行計劃(或多次呼叫它)、不正確的 MaxDOP 等 - 或者在 SQL Server 之外,我們僅從日誌中不知道哪些程序佔用了最多的CPU。
實際問題並沒有告訴我們太多,只是租約預計會提前 10 秒更新:
2018/03/12- 11:01:10.151錯誤
$$ RES $$SQL Server 可用性組:$$ hadrag $$租約到期。租約應在 2018/03/12- 16:01:00.068 之前續簽
它最有可能的原因(但不是 100%,因為如果這是一個 VM,它可能已被過度訂閱、暫停快照、暫停遷移等)是由於集群日誌中稍後的資訊:
日期時間 CPU%
3/12/2018 16:0:21.0, 95.167191
3/12/2018 16:0:31.0, 98.749704
3/12/2018 16:0:41.0, 91.797399
3/12/2018 16:0:51.0, 72.669750
3/12/2018 16:1:1.0, 62.595431
如果您看一下,租約在 16:01:00 超時,而前 30 秒的 CPU 使用率為 92-99%,這恰好與預設租約超時設置一致。
怎麼辦?
這是從 2016 年到 2017 年的升級**,**以及從標準版到企業版的版本升級。我首先會調查諸如 MAXDOP、數據庫兼容性級別(針對優化器和計劃)之類的事情,並觀察是否有任何查詢在消耗 CPU 時間時並行進行。我的猜測是它與此有關,否則它是 SQL Server 之外的東西導致高 CPU(您可以通過環形緩衝區在 SQL Server 中觀看)。