postgres 集群：master 的崩潰也會導致副本崩潰

December 16, 2020

環境：Postgres 版本：9.6 3 個伺服器的集群，帶有 Patroni 和 etcd

場景：當使用 16 個並行請求（在 16CPU 機器上）啟動表索引時，主伺服器上的 postgres 因 Linux OOM 殺手而崩潰。這是一台 124GB 的機器。我們知道產生如此多的並行請求需要更多的記憶體，我們已經解決了這個問題。

問題：然而，令人擔憂的是，當master由於OOM而崩潰時，所有的replica也都崩潰了。這是意料之外的事情，並且對集群的高可用性提出了質疑。我們可以輕鬆地模擬這一點，並且每次副本的行為都完全相同。

崩潰發生時master的日誌：

2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG:  checkpointer process (PID 30834) was terminated by signal 9: Killed
2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG:  terminating any other active server processes
2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL:  archive command was terminated by signal 3: Quit
2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING:  terminating connection because of crash of another server process
2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL:  the database system is in recovery mode
...
2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG:  redo done at 52712/BAFFD3C0
2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG:  database system is ready to accept connections

發生崩潰時副本的日誌：

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
               This probably means the server terminated abnormally
               before or while processing the request.
...
2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG:  restored log file "00000007000526C70000005D" from archive
...
2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG:  restored log file "0000000700052712000000C1" from archive
2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG:  started streaming WAL from primary at 52712/C2000000 on timeline 7

問題：master 上的 postgres 是否會崩潰（由於 OOM/損壞的共享記憶體），是否也必然會導致副本上的類似崩潰？有沒有辦法繞過這個？*

這是無害的。
當程序崩潰並且後端即將死亡（quickdiein postgres.c）時，該消息由伺服器發送到客戶端，以便可以開始崩潰恢復。
您只看到主伺服器發送給 WAL 接收器的消息。請注意WARNING：這甚至不是錯誤。這只會被記錄，因為log_min_messages在備用設置為warning或更低。
備用伺服器繼續執行——如您所見，它在主伺服器恢復時從存檔中趕上。一旦它讀取了所有檔案並且主伺服器再次啟動，它將重新連接並繼續流式傳輸。

引用自：https://dba.stackexchange.com/questions/281656

postgres 集群：master 的崩潰也會導致副本崩潰

相關問答

PostgreSQL 因設備空間不足而崩潰

查詢以查找受語言環境升級影響的索引不起作用

在 pg_hba.conf 中使用主機名？

儲存升級後 Postgres (9.6) 在啟動時掛起

將不同 JSON 鍵映射到相同目標列的最佳方法

PostgreSQL 查詢性能問題