為什麼我的 Always On AG 鎖定了?
我在這裡。tl;博士; 做了一個收縮文件(我知道,我知道)(將一個 100gb 的文件和 10gb 的空白空間縮小 2gb),它執行了大約 15 分鐘,然後一切都壞了。我不明白,為什麼一切都壞了?我們在一個同步的 AlwaysOn 可用性組中有兩台伺服器。sql server 2014 sp1. 伺服器 2012 r2。主節點開始拒絕帶有日誌消息的事務
Remote harden of transaction 'user_transaction' (ID 0x000000887e6bc779 0000:002ede71) started at Mar 13 2018 4:20PM in database 'AAAA' at LSN (296:17783:1) failed.
為什麼單個數據庫上的收縮文件操作會鎖定輔助數據庫上的每個數據庫?我們嘗試的第一件事是重新啟動輔助 sql 實例。這沒有效果。
從 AG (ALTER REMOVE DATABASE) 中刪除數據庫允許在該特定數據庫上恢復事務。正如預期的那樣,這使數據庫在輔助節點上處於 RESTORING 狀態。為了將數據庫放回 AG,我們嘗試在輔助節點上應用最新的 tx 日誌備份。輔助節點拒絕了 tx 日誌備份,並顯示有關 LSN 不匹配的錯誤。不幸的是,我沒有寫下具體的消息,但這不是簡單的“您需要恢復較早的 tx 備份”消息。我們給它提供了 tx 日誌備份,它會旋轉 30 秒然後出錯。
注意的主要日誌條目
2018-03-13 16:45:05.650 Backup BACKUP failed to complete the command BACKUP LOG AAAA. Check the backup application log for detailed messages. 2018-03-13 16:42:34.830 spid40s A connection for availability group 'SQLDB' from availability replica 'SQLDB02' with id [90C6C7E0-E41B-45D6-92D0-1CDCD33F49BF] to 'SQLDB03' with id [659B20D9-F54D-4CE3-A492-B1CFD4D9BC22] has been successfully established. This is an informational message only. No user action is required. (we restarted the secondary sql service) 2018-03-13 16:42:08.820 spid28s A connection timeout has occurred on a previously established connection to availability replica 'SQLDB03' with id [659B20D9-F54D-4CE3-A492-B1CFD4D9BC22]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role. 2018-03-13 16:27:40.670 BACKUP failed to complete the command BACKUP LOG AAAAA. Check the backup application log for detailed messages. 2018-03-13 16:22:18.900 DBCC SHRINKFILE for file ID 1 is waiting for the snapshot transaction with timestamp 22194769242 and other snapshot transactions linked to timestamp 22194769242 or with timestamps older than 22194835703 to finish. 2018-03-13 16:21:13.460 A connection for availability group 'SQLDB' from availability replica 'SQLDB02' with id [90C6C7E0-E41B-45D6-92D0-1CDCD33F49BF] to 'SQLDB03' with id [659B20D9-F54D-4CE3-A492-B1CFD4D9BC22] has been successfully established. This is an informational message only. No user action is required. 2018-03-13 16:21:13.390 The state of the local availability replica in availability group 'SQLDB' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The state changed because the local replica has completed processing Online command from Windows Server Failover Clustering (WSFC). For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log. 2018-03-13 16:21:13.310 AlwaysOn: The local replica of availability group 'SQLDB' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required. 2018-03-13 16:20:55.550 (x 50) AlwaysOn Availability Groups connection with secondary database terminated for primary database '' on the availability replica '' with Replica ID: {659b20d9-f54d-4ce3-a492-b1cfd4d9bc22}. This is an informational message only. No user action is required. 2018-03-13 16:20:55.540 (x 1000) The client was unable to reuse a session with SPID 276, which had been reset for connection pooling. The failure ID is 46. This error may have been caused by an earlier operation failing. Check the error logs for failed AAAA immediately before this error message. 3/13/2018 4:20:44 PM: The state of the local availability replica in availability group 'SQLDB' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL' 3/13/2018 4:20:44 PM: AlwaysOn: The local replica of availability group 'SQLDB' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required. 3/13/2018 4:17:07 PM DBCC SHRINKFILE for file ID 1 is waiting for the snapshot transaction with timestamp 22194769242 and other snapshot transactions linked to timestamp 22194769242 or with timestamps older than 22194835703 to finish.
注意的輔助日誌條目
2018-03-13 17:02:00.990 spid63s The database 'AAAA' is marked RESTORING and is in a state that does not allow recovery to be run. 2018-03-13 17:02:00.980 spid63s Starting up database 'AAAA'. 2018-03-13 17:02:00.970 spid63s State information for database 'AAAA' - Hardended Lsn: '(2665414:190311:1)' Commit LSN: '(2665414:190298:27)' Commit Time: 'Mar 13 2018 4:20PM' 2018-03-13 17:02:00.940 spid63s State information for database 'AAAA' - Hardended Lsn: '(2665414:190311:1)' Commit LSN: '(2665414:190298:27)' Commit Time: 'Mar 13 2018 4:20PM' 2018-03-13 17:02:00.810 spid63s Nonqualified transactions are being rolled back in database AAAA for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required. 2018-03-13 17:02:00.810 spid112s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AAAA' on the availability replica 'SQLDB02' with Replica ID: {90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only. No user action is required.
此時,我們將 AAAA 從主節點上的 AG 中刪除。(5:02,不是 4:43)
2018-03-13 16:43:22.760 Logon Error: 976, Severity: 14, State: 1. 2018-03-13 16:43:22.760 Logon The target database, 'AAAA', is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online. 2018-03-13 16:42:02.740 spid27s CHECKDB for database 'AAAA' finished without errors on 2018-02-24 23:18:46.247 (local time). This is an informational message only; no user action is required. 2018-03-13 16:42:02.690 spid27s Recovery completed for database AAAA (database ID 6) in 26 second(s) (analysis 38 ms, redo 1040 ms, undo 0 ms.) This is an informational message only. No user action is required. 2018-03-13 16:42:02.650 spid27s 5977 transactions rolled forward in database 'AAAA' (6:0). This is an informational message only. No user action is required. 2018-03-13 16:41:34.400 spid27s State information for database 'AAAA' - Hardended Lsn: '(0:0:0)' Commit LSN: '(0:0:0)' Commit Time: 'Jan 1 1900 12:00AM' 2018-03-13 16:41:34.400 spid50s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AAAA' on the availability replica 'SQLDB02' with Replica ID: {90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only. No user action is required. 2018-03-13 16:41:34.400 spid27s State information for database 'AAAA' - Hardended Lsn: '(0:0:0)' Commit LSN: '(0:0:0)' Commit Time: 'Jan 1 1900 12:00AM' 2018-03-13 16:41:34.390 spid27s Nonqualified transactions are being rolled back in database AAAA for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required. 2018-03-13 16:41:34.110 spid22s The state of the local availability replica in availability group 'SQLDB' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The state changed because the local instance of SQL Server is starting up. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log. 2018-03-13 16:41:34.100 Server The SQL Server Network Interface library could not register the Service Principal Name (SPN) [ MSSQLSvc/$DB.$DOMAIN ] for the SQL Server service. Windows return code: 0x2098, state: 15. Failure to register a SPN might cause integrated authentication to use NTLM instead of Kerberos. This is an informational message. Further action is only required if Kerberos authentication is required by authentication policies and if the SPN has not been manually registered. ...normal startup stuff. 2018-03-13 16:41:09.050 spid9s SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required. 2018-03-13 16:20:43.880 (x 50) spid169s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AAAA' on the availability replica 'SQLDB02' with Replica ID: {90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only. No user action is required. 2018-03-13 16:20:43.880 (x 50) spid169s AlwaysOn Availability Groups connection with primary database terminated for secondary database 'AAAA' on the availability replica 'SQLDB02' with Replica ID: {90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only. No user action is required. 2018-03-13 16:20:43.880 (x 50) spid170s The availability group database "AAAA" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required. 2018-03-13 11:57:34.050 spid106 Server process ID is 2268.
為什麼一切都破裂了?
根據您提供給我們的精選錯誤日誌輸出,shrinkfile 正在等待目前打開的快照事務完成:
2018 年 3 月 13 日下午 4:17:07 文件 ID 1 的 DBCC SHRINKFILE 正在等待時間戳為 22194769242 的快照事務和連結到時間戳 22194769242 或時間戳早於 22194835703 的其他快照事務完成。
可以肯定的是,它以這種方式停留了一段時間,因為稍後有另一個條目具有相同的事務時間戳。
在第一條記錄的消息之後,我們看到…
2018 年 3 月 13 日下午 4:20:44:AlwaysOn:可用性組“SQLDB”的本地副本正準備轉換為解析角色,以響應來自 Windows Server 故障轉移群集 (WSFC) 群集的請求。這只是一條資訊性消息。無需使用者操作。
我們需要查看集群日誌以了解它認為發生了什麼。其他使用日誌將是日誌文件夾中 sp_server_diagnostics 的輸出、故障轉移策略級別以及可能的始終執行狀況擴展事件文件。
日誌的硬化也開始出現問題:
事務“user_transaction”(ID 0x000000887e6bc779 0000:002ede71)的遠端強化於 2018 年 3 月 13 日下午 4:20 在 LSN (296:17783:1) 的數據庫“AAAA”中開始失敗。
很難說為什麼會發生這種情況,但它應該離開初級是一個相當不穩定的狀態。
為什麼單個數據庫上的收縮文件操作會鎖定輔助數據庫上的每個數據庫?
我不確定為什麼它會“鎖定每個數據庫”,因為我們只看到單個數據庫的錯誤消息和問題,我希望一個長時間執行的收縮文件在數據庫文件上保存一個 EX 鎖存器,最終導致問題給予足夠的時間。
不幸的是,沒有足夠的數據來對發生的事情做出有效的假設,但至少這會給你一些需要調查/調查的東西。