Postgresql

事後分析:PostgreSQL 複製失敗

  • July 19, 2017

我們有一個 PostgreSQL 9.4.9 生產伺服器正在復製到從屬實例,但今天我發現該實例不同步!

顯而易見的操作是重新創建從節點,為複制活動設置指標和適當的警報,這樣我們就可以有效地監控主節點和從節點之間的同步狀態。

但是,由於同步失敗,我想首先診斷問題並嘗試找出問題的根本原因,因為這將是大約 6 個月內第二次發生這種情況。

問題:如何診斷複製過程中失敗的原因,以便這次可以以更好的方式完成?

版本細節:

PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

從從節點,/var/log/postgresql/postgresql-9.4-main.log我可以看到:

2017-07-18 19:43:55 UTC [12816-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:43:55 UTC [12816-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:00 UTC [12817-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:00 UTC [12817-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:05 UTC [12821-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:05 UTC [12821-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:10 UTC [12825-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:10 UTC [12825-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

2017-07-18 19:44:15 UTC [12826-1] LOG:  started streaming WAL from primary at 125D/68000000 on timeline 1
2017-07-18 19:44:15 UTC [12826-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000125D00000068 has already been removed

新問題:我如何才能看到實際問題出現的位置?

大師postgresql.confhttps ://pastebin.com/NJX5ku6m

奴隸postgresql.confhttps ://pastebin.com/CUZcyazC

奴隸recovery.conf

standby_mode = on
primary_conninfo = 'host=10.1.1.65 port=5432 user=replicador password=replicador'

基於此,我會說您wal_keep_segments在主伺服器上沒有足夠的資源,沒有使用複制槽,並且已經hot_standby_feedback關閉或連接斷開的時間足夠長,以便主伺服器刪除所需的 WAL。

而且您可能沒有使用 WAL 歸檔(archive_command在主伺服器上,restore_command在副本上)作為備份。

因此,主刪除事務記錄了備用所需的日誌。

您需要重新創建備用數據庫。然後:

  • 將備用伺服器設置為使用複制槽並啟用hot_standby_feedback;要麼
  • 啟用archive_commandrestore_command

引用自:https://dba.stackexchange.com/questions/180062