Postgresql

從主伺服器到備用伺服器的 Postgres 複製失敗

  • January 24, 2022

我有一個主備 postgres 集群;對於可用性問題,我想添加一個新的備用伺服器。

所以我創建了一個新伺服器,安裝 postgres,創建 postgres 數據文件系統,然後啟動 pgbasebackup 到新的備用伺服器(嘗試了很多次,很多來自主伺服器,很多來自第一個備用伺服器,都失敗了)。

pg_basebackup -D - -h localhost -U 複製器 -Ft –compress=0 –progress | 豬-p $ THREADS | ssh -A postgres@ $ TARGETDB “pigz -dc - | tar xvf - –directory=/var/lib/pgsql/9.6/data/”

當它完成時,我啟動 postgres,它失敗了,缺少 WAL 和偏離的時間線,儘管我很確定這些請求的 WAL 和歷史文件,甚至在主節點和次節點上都不存在。

2022-01-24 11:32:00 GMT [17951]: [1-1] user=,db=,app=,client= LOG:  database system was interrupted while in recovery at log time 2022-01-24 11:10:02 GMT
2022-01-24 11:32:00 GMT [17951]: [2-1] user=,db=,app=,client= HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2022-01-24 11:32:01 GMT [17951]: [3-1] user=,db=,app=,client= LOG:  restored log file "00000007.history" from archive
ERROR: WAL file '00000008.history' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:01 GMT [17951]: [4-1] user=,db=,app=,client= LOG:  entering standby mode
2022-01-24 11:32:02 GMT [17951]: [5-1] user=,db=,app=,client= LOG:  restored log file "00000007.history" from archive
ERROR: WAL file '0000000700001C55000000F4' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:03 GMT [17951]: [6-1] user=,db=,app=,client= LOG:  restored log file "0000000600001C55000000F4" from archive
ERROR: WAL file '0000000700001C55000000F3' not found in server '****' (SSH host: 10.154.129.90)
2022-01-24 11:32:04 GMT [18134]: [1-1] user=postgres,db=postgres,app=[unknown],client=[local] FATAL:  the database system is starting up
2022-01-24 11:32:04 GMT [18145]: [1-1] user=postgres,db=postgres,app=[unknown],client=[local] FATAL:  the database system is starting up
2022-01-24 11:32:04 GMT [17951]: [7-1] user=,db=,app=,client= LOG:  restored log file "0000000600001C55000000F3" from archive
2022-01-24 11:32:04 GMT [17951]: [8-1] user=,db=,app=,client= FATAL:  requested timeline 7 is not a child of this server's history
2022-01-24 11:32:04 GMT [17951]: [9-1] user=,db=,app=,client= DETAIL:  Latest checkpoint is at 1C56/47CACF28 on timeline 6, but in the history of the requested timeline, the server forked off from that timeline at 1C3F/B7B96B90.
2022-01-24 11:32:04 GMT [17948]: [3-1] user=,db=,app=,client= LOG:  startup process (PID 17951) exited with exit code 1
2022-01-24 11:32:04 GMT [17948]: [4-1] user=,db=,app=,client= LOG:  aborting startup due to startup process failure
2022-01-24 11:32:04 GMT [17948]: [5-1] user=,db=,app=,client= LOG:  database system is shut down

此外,我們有一個酒保伺服器的 WAL 存檔系統,因此在備份期間失去 WAL 也不值得懷疑。

恢復.conf 文件

standby_mode = 'on'
primary_conninfo = 'user=replicator password=C0D5wallop host=$PRIMARYSERVER port=5432 sslmode=prefer sslcompression=1'
trigger_file = '/var/lib/pgsql/9.6/boo'
recovery_target_timeline='latest'
restore_command = 'ssh -o StrictHostKeyChecking=no barman@$BARMANSERVER barman get-wal db-44 %f > %p'

保持開放以獲取更多資訊。感謝你的幫助。

最後,我過去在一個新的數據庫實例上進行了****測試恢復,沒有刪除postgresql.conf文件中的archive_command,最終歸檔了 0000007.history 空時間線。

因此,當新伺服器嘗試從酒保那裡獲取存檔日誌時,它會找到虛擬時間線 0000007.history 文件,但在酒保伺服器中沒有實際的 xlog,導致生成上面的錯誤日誌。

解決方案

  • 連接到酒保伺服器。
  • 手動移動 0000007.history 文件。
  • 手動刪除調酒師伺服器 wals/ 目錄上 xlog.db 中的 00000007.history 行。
  • 重新啟動輔助節點上的 postgres。

建議:在做之前備份您在酒保伺服器中更改的任何內容。

引用自:https://dba.stackexchange.com/questions/306490