SQL_SLAVE_SKIP_COUNTER = 1 失敗,設置 @@gtid_slave_pos 用於跳過給定的 GTID 位置
我最近打破了複製,當我試圖通過一個不正確的事務時。我得到了以下內容。
MariaDB [(none)]> STOP SLAVE; Query OK, 0 rows affected (0.05 sec) MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position. MariaDB [(none)]> select @@gtid_slave_pos; +---------------------------------------------+ | @@gtid_slave_pos | +---------------------------------------------+ | 0-1051-1391406,1-1050-1182069,57-1051-98897 | +---------------------------------------------+ 1 row in set (0.00 sec) MariaDB [(none)]> show variables like '%_pos%'; +----------------------+---------------------------------------------------------+ | Variable_name | Value | +----------------------+---------------------------------------------------------+ | gtid_binlog_pos | 0-1051-1391406,2-1051-4474,57-1051-98897 | | gtid_current_pos | 0-1051-1391406,1-1050-1182069,2-1051-4474,57-1051-98897 | | gtid_slave_pos | 0-1051-1391406,1-1050-1182069,57-1051-98897 | | wsrep_start_position | 00000000-0000-0000-0000-000000000000:-1 | +----------------------+---------------------------------------------------------+
我需要做什麼來解決這個問題。
更新 1
MariaDB [(none)]> show variables like '%gtid%'; +------------------------+------------------------------------------+ | Variable_name | Value | +------------------------+------------------------------------------+ | gtid_binlog_pos | 1-1050-4820789,2-1051-379101,3-1010-3273 | | gtid_binlog_state | 1-1050-4820789,2-1051-379101,3-1010-3273 | | gtid_current_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 | | gtid_domain_id | 3 | | gtid_ignore_duplicates | OFF | | gtid_seq_no | 0 | | gtid_slave_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 | | gtid_strict_mode | OFF | | last_gtid | | | wsrep_gtid_domain_id | 0 | | wsrep_gtid_mode | OFF | +------------------------+------------------------------------------+
我按照說明嘗試了以下設置@@gtid_slave_pos;
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: [redacted] Master_User: [redacted] Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000591 Read_Master_Log_Pos: 526511543 Relay_Log_File: tmsdb-relay-bin.001239 Relay_Log_Pos: 4 Relay_Master_Log_File: binary.000591 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493 Skip_Counter: 0 Exec_Master_Log_Pos: 60724897 Relay_Log_Space: 465787660 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493 Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 1-1050-4827753,2-1051-379101,3-1010-3273 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic 1 row in set (0.00 sec)
使用 gtid_slave_pos 變數
MariaDB [(none)]> select @@gtid_slave_pos\G; *************************** 1. row *************************** @@gtid_slave_pos: 1-1050-4819948,2-1051-379101,3-1010-3273 MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.21 sec) MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3274'; Query OK, 0 rows affected (0.10 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.21 sec)
當我在執行上述後檢查狀態時
Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'
MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.56.228.64 Master_User: maxscale Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000591 Read_Master_Log_Pos: 60724897 Relay_Log_File: tmsdb-relay-bin.001239 Relay_Log_Pos: 4 Relay_Master_Log_File: binary.000591 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 60724897 Relay_Log_Space: 249 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 1-1050-4819948,2-1051-379101,3-1010-3274 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic 1 row in set (0.00 sec)
我可以通過
MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.01 sec) MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3273'; Query OK, 0 rows affected (0.09 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.06 sec)
我在生產中發現 Parallel_Mode 是我問題的最可能原因。
我建議使用與
optimistic
MariaDB [(none)]> select @@slave_parallel_mode\G *************************** 1. row *************************** @@slave_parallel_mode: optimistic
如果您收到以下錯誤。
pt-slave-restart 2018-02-09T10:39:19 tmsdb-relay-bin.000388 4 1032 DBD::mysql::st execute failed: When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter can not be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position. [for Statement "SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1"] at /bin/pt-slave-restart line 5122.
在日誌中,我看到以下內容:
tail /var/log/mariadb.log 2018-02-09 10:35:46 139919003784960 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table [tablename]; Can't find record in '[tablename]', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binary.000953, end_log_pos 264325215, Gtid 1-1050-13462991, Internal MariaDB error code: 1032 2018-02-09 10:35:46 139919003784960 [Warning] Slave: Can't find record in '[tablename]' Error_code: 1032 2018-02-09 10:35:46 139919003784960 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binary.000953' position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273' 2018-02-09 10:35:46 139918776985344 [Note] Slave SQL thread exiting, replication stopped in log 'binary.000953' at position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273'
要在從站失敗後重新啟動從站,您可以執行以下操作。
停止所有
slave_parallel_threads
並禁用slave_parallel_mode
MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.35 sec) MariaDB [(none)]> set global slave_parallel_threads = 0; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = none; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> Start SLAVE; Query OK, 0 rows affected (0.00 sec)
我現在
pt-slave-restart
用來重新啟動從站,因為當我只想啟動從站時,我不必考慮序列號和一整套其他需要太長時間的事情。pt-slave-restart
將執行沒有錯誤,
ctrl-c
當你很高興你的奴隸已經趕上時,你可以關閉它。這並沒有太大的不同,但它會神奇地自動完成。
STOP SLAVE; SET GLOBAL sql_slave_skip_counter = 1; START SLAVE;
如果您需要並行執行緒,那麼您可以在從站趕上或通過導致問題的事件後重新啟用它們。我會嘗試不同
slave_parallel_mod
的保守派MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.01 sec) MariaDB [(none)]> set global slave_parallel_threads = 4; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = conservative; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.09 sec)
我發現以下內容對我有用。這不會將從屬恢復到與主控完全相同的狀態。會有數據差異。我將使用 pt-table-sync 來解決這些問題。
**1. 不使用 GTID 方法重新啟動複製
停止並行從執行緒
啟用 GTID 複製
使用 percona-toolkit pt-slave-restart 跳過所有錯誤。**
1. 重啟 Replication without GTID 方法 Using master binglog position
CHANGE MASTER TO MASTER_HOST='12.34.56.789',MASTER_USER='slave_user', MASTER_PASSWORD='password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS= 107;
這是有據可查的,請Google並查找說明。
2.停止並行從執行緒
如原始問題所示,這是問題的一部分。
ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position.
我希望能夠跳過事件,而不必擔心試圖找出或增加每個人的 GTID 位置。
MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.35 sec) MariaDB [(none)]> set global slave_parallel_threads = 0; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = none; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> Start SLAVE; Query OK, 0 rows affected (0.00 sec)
現在,如果我檢查並行從執行緒,我會看到
MariaDB [(none)]> show slave status \G *************************** 1. row *************************** .......... Parallel_Mode: none
完成後,我可以反轉此過程以重新啟用並行從屬執行緒。而且我知道 GTID 正在工作。
3.啟用GTID複製
我現在可以嘗試在啟用 GTID 的情況下重新啟動從站。
在主上
MariaDB [(none)]> SHOW MASTER STATUS\G *************************** 1. row *************************** File: mariadb-bin.000001 Position: 510 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec) SELECT BINLOG_GTID_POS('mariadb-bin.000001', 510); +--------------------------------------------+ | BINLOG_GTID_POS('mariadb-bin.000001', 510) | +--------------------------------------------+ | 1-101-1 | +--------------------------------------------+ 1 row in set (0.00 sec)
在奴隸上
STOP SLAVE; SET GLOBAL gtid_slave_pos = '1-101-1'; CHANGE MASTER TO master_use_gtid=slave_pos; START SLAVE;
現在,當我檢查從屬設備時,它有一些事件要跳過以恢復與主設備相同的狀態。
Last_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled.
MariaDB [(none)]> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: binary.000599 Read_Master_Log_Pos: 364810491 Relay_Log_File: tmsdb-relay-bin.001240 Relay_Log_Pos: 716 Relay_Master_Log_File: binary.000599 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1950 Last_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled. Skip_Counter: 0 Exec_Master_Log_Pos: 286447058 Relay_Log_Space: 78364447 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1950 Last_SQL_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled. Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1050-5005223,2-1051-379101,3-1010-3273 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: none 1 row in set (0.00 sec)
4. 使用 percona-toolkit pt-slave-restart 跳過所有錯誤
sudo yum install http://www.percona.com/downloads/percona-release/redhat/0.1-4/percona-release-0.1-4.noarch.rpm sudo yum search percona-toolkit
pt-slave-restart 將跳過使從屬設備進入工作狀態所需的所有事件。
# pt-slave-restart 2017-12-22T13:39:59 tmsdb-relay-bin.001240 716 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 69702 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 97912 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 98144 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 363903 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 364135 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 712776 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 713008 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 759737 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 827932 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 828164 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 934851 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 952088 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 952320 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1084249 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1084481 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1351188 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1351420 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1621561 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1693920 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1711677 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1711909 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1880931 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1881163 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1916544 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 2124672 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2124904 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2125136 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2452030 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2452262 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2819749 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2819981 1950
現在當我檢查我的奴隸狀態時
MariaDB [(none)]> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: masterhost Master_User: maxscale Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000600 Read_Master_Log_Pos: 37801368 Relay_Log_File: tmsdb-relay-bin.001242 Relay_Log_Pos: 37801653 Relay_Master_Log_File: binary.000600 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 37801368 Relay_Log_Space: 37801991 Until_Condition: None Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Master_Server_Id: 1050 Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1050-5014401,2-1051-379101,3-1010-3273 Parallel_Mode: none 1 row in set (0.00 sec)
最後我需要重新啟動伺服器並確保它重新啟動安全等。