Postgresql

postgres:WAL 在線上備份結束之前結束

  • November 22, 2020

我們正在執行 Postgres 9.6,大小為 10+TB。已使用自製工具“pgrsync”進行備份,該工具使用 S3 作為儲存庫。備份文件和 WAL 存檔都儲存在 S3 上。

問題:在嘗試還原時,WAL 存檔還原對於某些備份隨機失敗。

在這種情況下,備份開始位置是00000002000544C60000006B,停止位置是00000002000545210000008D,(基於 pg_stop_backup() 輸出)但它在 005493D 之間停止並終止還原。如果我再次重做恢復,它會在同一點完全停止。更多備份的類似結果,而其他備份成功恢復。

這可能表明某些特定的 WAL 文件在備份/恢復過程中損壞了。這是正確的解釋嗎?

問題:

  1. 有什麼方法可以辨識 WAL 文件是否損壞?
  2. 有沒有辦法在不失去數據的情況下繼續前進?(我對使用很謹慎pg_resetxlog

首先是備份文件(來自 WAL 位置00000002000544C60000006B.0001E9A0.backup

START WAL LOCATION: 544C6/6B01E9A0 (file 00000002000544C60000006B)
STOP WAL LOCATION: 54521/8D235490 (file 00000002000545210000008D)
CHECKPOINT LOCATION: 544C6/BA84E0A8
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2020-11-04 03:24:14 UTC
LABEL: inc04nov
STOP TIME: 2020-11-04 09:21:38 UTC

然後是日誌文件:

2020-11-11 06:38:08 UTC [21731]: [22299-1] user=,db=LOG:  restored log file "000000020005451D00000080" from archive
2020-11-11 06:38:08 UTC [21731]: [22300-1] user=,db=LOG:  restored log file "000000020005451D00000081" from archive
2020-11-11 06:38:08 UTC [21731]: [22301-1] user=,db=LOG:  restored log file "000000020005451D00000082" from archive
2020-11-11 06:38:08 UTC [21731]: [22302-1] user=,db=LOG:  restored log file "000000020005451D00000083" from archive
2020-11-11 06:38:08 UTC [21731]: [22303-1] user=,db=LOG:  restored log file "000000020005451D00000084" from archive
2020-11-11 06:38:08 UTC [21731]: [22304-1] user=,db=LOG:  restored log file "000000020005451D00000085" from archive
2020-11-11 06:38:08 UTC [21731]: [22305-1] user=,db=LOG:  restored log file "000000020005451D00000086" from archive
2020-11-11 06:38:08 UTC [21731]: [22306-1] user=,db=LOG:  restored log file "000000020005451D00000087" from archive
2020-11-11 06:38:08 UTC [21731]: [22307-1] user=,db=LOG:  restored log file "000000020005451D00000088" from archive
2020-11-11 06:38:08 UTC [21731]: [22308-1] user=,db=LOG:  restored log file "000000020005451D00000089" from archive
2020-11-11 06:38:08 UTC [21731]: [22309-1] user=,db=LOG:  restored log file "000000020005451D0000008A" from archive
2020-11-11 06:38:08 UTC [21731]: [22310-1] user=,db=LOG:  restored log file "000000020005451D0000008B" from archive
2020-11-11 06:38:08 UTC [21731]: [22311-1] user=,db=LOG:  redo done at 5451D/8AFFE500
2020-11-11 06:38:08 UTC [21731]: [22312-1] user=,db=LOG:  last completed transaction was at log time 2020-11-04 09:10:42.219935+00
2020-11-11 06:38:11 UTC [21731]: [22314-1] user=,db=FATAL:  WAL ends before end of online backup
2020-11-11 06:38:11 UTC [21731]: [22315-1] user=,db=HINT:  All WAL generated while online backup was taken must be available at recovery.
2020-11-11 06:38:13 UTC [21728]: [3-1] user=,db=LOG:  startup process (PID 21731) exited with exit code 1
2020-11-11 06:38:13 UTC [21728]: [4-1] user=,db=LOG:  terminating any other active server processes
2020-11-11 06:38:16 UTC [4559]: [1-1] user=postgres,db=postgresFATAL:  the database system is in recovery mode
2020-11-11 06:38:16 UTC [4561]: [1-1] user=postgres,db=postgresFATAL:  the database system is in recovery mode
2020-11-11 06:38:16 UTC [4576]: [1-1] user=postgres,db=postgresFATAL:  the database system is in recovery mode
2020-11-11 06:38:25 UTC [21728]: [5-1] user=,db=LOG:  database system is shut down

看起來 WAL 文件可能已損壞。

這是恢復正常的正確 WAL 文件:

-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008A | head -2
rmgr: Heap        len (rec/tot):    151/   151, tx: 3501354263, lsn: 5451D/8A0001D8, prev 5451D/89FFE1E8, desc: INSERT off 15, blkref #0: rel 3435996123/765803221/4171942326 blk 15806513
rmgr: Btree       len (rec/tot):     72/    72, tx: 3501354263, lsn: 5451D/8A000270, prev 5451D/8A0001D8, desc: INSERT_LEAF off 2, blkref #0: rel 3435996123/765803221/4171944289 blk 3881149

這是實際停止歸檔恢復處理的 WAL 文件:

-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008B
pg_xlogdump: FATAL:  could not find a valid record after 5451D/8B000000

對此的唯一解釋是恢復過程從未見過BACKUP_ENDWAL 條目,也就是說,它從未讀取包含pg_stop_backup呼叫效果的 WAL 段。

現在您令人信服地爭辯說您已經執行了該函式,否則您將不會backup_label在非獨占備份中擁有由該函式生成的文件。

存檔恢復不允許您在恢復期間跳過 WAL 段,因此恢復不可能跳過該段。

這留下了一些解釋:

  1. 您使用了backup_label其他備份中的文件,因為有些東西混淆了。
  2. 您從不包含BACKUP_END該條目的不同集群中恢復了具有相同名稱的 WAL 段。
  3. 你把時間線弄混了,在備份期間有一個時間線切換,所以BACKUP_END實際上是00000003000545210000008D這樣。

(我不確定這是否可能,或者時間線切換是否會破壞線上備份;我沒有測試。)

如果一切都像您期望的那樣,則00000002000545210000008D必須包含一個BACKUP_END條目。驗證與

pg_waldump 00000002000545210000008D | grep BACKUP_END

處理此條目後,PostgreSQL 將發出日誌行

consistent recovery state reached

引用自:https://dba.stackexchange.com/questions/279578