Postgresql
postgres:WAL 在線上備份結束之前結束
我們正在執行 Postgres 9.6,大小為 10+TB。已使用自製工具“pgrsync”進行備份,該工具使用 S3 作為儲存庫。備份文件和 WAL 存檔都儲存在 S3 上。
問題:在嘗試還原時,WAL 存檔還原對於某些備份隨機失敗。
在這種情況下,備份開始位置是
00000002000544C60000006B
,停止位置是00000002000545210000008D
,(基於 pg_stop_backup() 輸出)但它在 005493D 之間停止並終止還原。如果我再次重做恢復,它會在同一點完全停止。更多備份的類似結果,而其他備份成功恢復。這可能表明某些特定的 WAL 文件在備份/恢復過程中損壞了。這是正確的解釋嗎?
問題:
- 有什麼方法可以辨識 WAL 文件是否損壞?
- 有沒有辦法在不失去數據的情況下繼續前進?(我對使用很謹慎
pg_resetxlog
)首先是備份文件(來自 WAL 位置
00000002000544C60000006B.0001E9A0.backup
)START WAL LOCATION: 544C6/6B01E9A0 (file 00000002000544C60000006B) STOP WAL LOCATION: 54521/8D235490 (file 00000002000545210000008D) CHECKPOINT LOCATION: 544C6/BA84E0A8 BACKUP METHOD: streamed BACKUP FROM: master START TIME: 2020-11-04 03:24:14 UTC LABEL: inc04nov STOP TIME: 2020-11-04 09:21:38 UTC
然後是日誌文件:
2020-11-11 06:38:08 UTC [21731]: [22299-1] user=,db=LOG: restored log file "000000020005451D00000080" from archive 2020-11-11 06:38:08 UTC [21731]: [22300-1] user=,db=LOG: restored log file "000000020005451D00000081" from archive 2020-11-11 06:38:08 UTC [21731]: [22301-1] user=,db=LOG: restored log file "000000020005451D00000082" from archive 2020-11-11 06:38:08 UTC [21731]: [22302-1] user=,db=LOG: restored log file "000000020005451D00000083" from archive 2020-11-11 06:38:08 UTC [21731]: [22303-1] user=,db=LOG: restored log file "000000020005451D00000084" from archive 2020-11-11 06:38:08 UTC [21731]: [22304-1] user=,db=LOG: restored log file "000000020005451D00000085" from archive 2020-11-11 06:38:08 UTC [21731]: [22305-1] user=,db=LOG: restored log file "000000020005451D00000086" from archive 2020-11-11 06:38:08 UTC [21731]: [22306-1] user=,db=LOG: restored log file "000000020005451D00000087" from archive 2020-11-11 06:38:08 UTC [21731]: [22307-1] user=,db=LOG: restored log file "000000020005451D00000088" from archive 2020-11-11 06:38:08 UTC [21731]: [22308-1] user=,db=LOG: restored log file "000000020005451D00000089" from archive 2020-11-11 06:38:08 UTC [21731]: [22309-1] user=,db=LOG: restored log file "000000020005451D0000008A" from archive 2020-11-11 06:38:08 UTC [21731]: [22310-1] user=,db=LOG: restored log file "000000020005451D0000008B" from archive 2020-11-11 06:38:08 UTC [21731]: [22311-1] user=,db=LOG: redo done at 5451D/8AFFE500 2020-11-11 06:38:08 UTC [21731]: [22312-1] user=,db=LOG: last completed transaction was at log time 2020-11-04 09:10:42.219935+00 2020-11-11 06:38:11 UTC [21731]: [22314-1] user=,db=FATAL: WAL ends before end of online backup 2020-11-11 06:38:11 UTC [21731]: [22315-1] user=,db=HINT: All WAL generated while online backup was taken must be available at recovery. 2020-11-11 06:38:13 UTC [21728]: [3-1] user=,db=LOG: startup process (PID 21731) exited with exit code 1 2020-11-11 06:38:13 UTC [21728]: [4-1] user=,db=LOG: terminating any other active server processes 2020-11-11 06:38:16 UTC [4559]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode 2020-11-11 06:38:16 UTC [4561]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode 2020-11-11 06:38:16 UTC [4576]: [1-1] user=postgres,db=postgresFATAL: the database system is in recovery mode 2020-11-11 06:38:25 UTC [21728]: [5-1] user=,db=LOG: database system is shut down
看起來 WAL 文件可能已損壞。
這是恢復正常的正確 WAL 文件:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008A | head -2 rmgr: Heap len (rec/tot): 151/ 151, tx: 3501354263, lsn: 5451D/8A0001D8, prev 5451D/89FFE1E8, desc: INSERT off 15, blkref #0: rel 3435996123/765803221/4171942326 blk 15806513 rmgr: Btree len (rec/tot): 72/ 72, tx: 3501354263, lsn: 5451D/8A000270, prev 5451D/8A0001D8, desc: INSERT_LEAF off 2, blkref #0: rel 3435996123/765803221/4171944289 blk 3881149
這是實際停止歸檔恢復處理的 WAL 文件:
-bash-4.2$ /usr/pgsql-9.6/bin/pg_xlogdump 000000020005451D0000008B pg_xlogdump: FATAL: could not find a valid record after 5451D/8B000000
對此的唯一解釋是恢復過程從未見過
BACKUP_END
WAL 條目,也就是說,它從未讀取包含pg_stop_backup
呼叫效果的 WAL 段。現在您令人信服地爭辯說您已經執行了該函式,否則您將不會
backup_label
在非獨占備份中擁有由該函式生成的文件。存檔恢復不允許您在恢復期間跳過 WAL 段,因此恢復不可能跳過該段。
這留下了一些解釋:
- 您使用了
backup_label
其他備份中的文件,因為有些東西混淆了。- 您從不包含
BACKUP_END
該條目的不同集群中恢復了具有相同名稱的 WAL 段。- 你把時間線弄混了,在備份期間有一個時間線切換,所以
BACKUP_END
實際上是00000003000545210000008D
這樣。(我不確定這是否可能,或者時間線切換是否會破壞線上備份;我沒有測試。)
如果一切都像您期望的那樣,則
00000002000545210000008D
必須包含一個BACKUP_END
條目。驗證與pg_waldump 00000002000545210000008D | grep BACKUP_END
處理此條目後,PostgreSQL 將發出日誌行
consistent recovery state reached