Postgresql
事後分析:PostgreSQL 複製失敗
我們有一個 PostgreSQL 9.4.9 生產伺服器正在復製到從屬實例,但今天我發現該實例不同步!
顯而易見的操作是重新創建從節點,為複制活動設置指標和適當的警報,這樣我們就可以有效地監控主節點和從節點之間的同步狀態。
但是,由於同步失敗,我想首先診斷問題並嘗試找出問題的根本原因,因為這將是大約 6 個月內第二次發生這種情況。
問題:如何診斷複製過程中失敗的原因,以便這次可以以更好的方式完成?
版本細節:
PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
從從節點,
/var/log/postgresql/postgresql-9.4-main.log
我可以看到:2017-07-18 19:43:55 UTC [12816-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1 2017-07-18 19:43:55 UTC [12816-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed 2017-07-18 19:44:00 UTC [12817-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1 2017-07-18 19:44:00 UTC [12817-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed 2017-07-18 19:44:05 UTC [12821-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1 2017-07-18 19:44:05 UTC [12821-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed 2017-07-18 19:44:10 UTC [12825-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1 2017-07-18 19:44:10 UTC [12825-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed 2017-07-18 19:44:15 UTC [12826-1] LOG: started streaming WAL from primary at 125D/68000000 on timeline 1 2017-07-18 19:44:15 UTC [12826-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000125D00000068 has already been removed
新問題:我如何才能看到實際問題出現的位置?
大師
postgresql.conf
:https ://pastebin.com/NJX5ku6m奴隸
postgresql.conf
:https ://pastebin.com/CUZcyazC奴隸
recovery.conf
:standby_mode = on primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'
基於此,我會說您
wal_keep_segments
在主伺服器上沒有足夠的資源,沒有使用複制槽,並且已經hot_standby_feedback
關閉或連接斷開的時間足夠長,以便主伺服器刪除所需的 WAL。而且您可能沒有使用 WAL 歸檔(
archive_command
在主伺服器上,restore_command
在副本上)作為備份。因此,主刪除事務記錄了備用所需的日誌。
您需要重新創建備用數據庫。然後:
- 將備用伺服器設置為使用複制槽並啟用
hot_standby_feedback
;要麼- 啟用
archive_command
和restore_command