Postgresql

postgres 集群:master 的崩潰也會導致副本崩潰

  • December 16, 2020

環境:Postgres 版本:9.6 3 個伺服器的集群,帶有 Patroni 和 etcd

場景:當使用 16 個並行請求(在 16CPU 機器上)啟動表索引時,主伺服器上的 postgres 因 Linux OOM 殺手而崩潰。這是一台 124GB 的機器。我們知道產生如此多的並行請求需要更多的記憶體,我們已經解決了這個問題。

問題:然而,令人擔憂的是,當master由於OOM而崩潰時,所有的replica也都崩潰了。這是意料之外的事情,並且對集群的高可用性提出了質疑。我們可以輕鬆地模擬這一點,並且每次副本的行為都完全相同。

崩潰發生時master的日誌:

2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG:  checkpointer process (PID 30834) was terminated by signal 9: Killed
2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG:  terminating any other active server processes
2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL:  archive command was terminated by signal 3: Quit
2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING:  terminating connection because of crash of another server process
2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT:  In a moment you should be able to reconnect to the database and repeat your command.
...
2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL:  the database system is in recovery mode
...
2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG:  redo done at 52712/BAFFD3C0
2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG:  database system is ready to accept connections

發生崩潰時副本的日誌:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
               This probably means the server terminated abnormally
               before or while processing the request.
...
2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG:  restored log file "00000007000526C70000005D" from archive
...
2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG:  restored log file "0000000700052712000000C1" from archive
2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG:  started streaming WAL from primary at 52712/C2000000 on timeline 7

問題:master 上的 postgres 是否會崩潰(由於 OOM/損壞的共享記憶體),是否也必然會導致副本上的類似崩潰?有沒有辦法繞過這個?*

這是無害的。

當程序崩潰並且後端即將死亡(quickdiein postgres.c)時,該消息由伺服器發送到客戶端,以便可以開始崩潰恢復。

您只看到主伺服器發送給 WAL 接收器的消息。請注意WARNING:這甚至不是錯誤。這只會被記錄,因為log_min_messages在備用設置為warning或更低。

備用伺服器繼續執行——如您所見,它在主伺服器恢復時從存檔中趕上。一旦它讀取了所有檔案並且主伺服器再次啟動,它將重新連接並繼續流式傳輸。

引用自:https://dba.stackexchange.com/questions/281656