Postgresql
postgres 集群:master 的崩潰也會導致副本崩潰
環境:Postgres 版本:9.6 3 個伺服器的集群,帶有 Patroni 和 etcd
場景:當使用 16 個並行請求(在 16CPU 機器上)啟動表索引時,主伺服器上的 postgres 因 Linux OOM 殺手而崩潰。這是一台 124GB 的機器。我們知道產生如此多的並行請求需要更多的記憶體,我們已經解決了這個問題。
問題:然而,令人擔憂的是,當master由於OOM而崩潰時,所有的replica也都崩潰了。這是意料之外的事情,並且對集群的高可用性提出了質疑。我們可以輕鬆地模擬這一點,並且每次副本的行為都完全相同。
崩潰發生時master的日誌:
2020-12-16 09:54:44 UTC [11619]: [9-1] user=,db=LOG: checkpointer process (PID 30834) was terminated by signal 9: Killed 2020-12-16 09:54:44 UTC [11619]: [10-1] user=,db=LOG: terminating any other active server processes 2020-12-16 09:54:44 UTC [30838]: [1-1] user=,db=FATAL: archive command was terminated by signal 3: Quit 2020-12-16 09:54:44 UTC [16870]: [1-1] user=postgres,db=mydbWARNING: terminating connection because of crash of another server process 2020-12-16 09:54:44 UTC [16870]: [2-1] user=postgres,db=mydbDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-12-16 09:54:44 UTC [16870]: [3-1] user=postgres,d=mydbHINT: In a moment you should be able to reconnect to the database and repeat your command. ... 2020-12-16 09:54:59 UTC [24609]: [1-1] user=postgres,db=mydbFATAL: the database system is in recovery mode ... 2020-12-16 09:55:04 UTC [22780]: [4-1] user=,db=LOG: redo done at 52712/BAFFD3C0 2020-12-16 09:55:07 UTC [11619]: [13-1] user=,db=LOG: database system is ready to accept connections
發生崩潰時副本的日誌:
WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-12-16 09:54:44 UTC [13293]: [2-1] user=,db=FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ... 2020-12-16 09:54:55 UTC [12843]: [19675-1] user=,db=LOG: restored log file "00000007000526C70000005D" from archive ... 2020-12-16 11:19:18 UTC [12843]: [38972-1] user=,db=LOG: restored log file "0000000700052712000000C1" from archive 2020-12-16 11:19:20 UTC [18104]: [1-1] user=,db=LOG: started streaming WAL from primary at 52712/C2000000 on timeline 7
問題:master 上的 postgres 是否會崩潰(由於 OOM/損壞的共享記憶體),是否也必然會導致副本上的類似崩潰?有沒有辦法繞過這個?*
這是無害的。
當程序崩潰並且後端即將死亡(
quickdie
inpostgres.c
)時,該消息由伺服器發送到客戶端,以便可以開始崩潰恢復。您只看到主伺服器發送給 WAL 接收器的消息。請注意
WARNING
:這甚至不是錯誤。這只會被記錄,因為log_min_messages
在備用設置為warning
或更低。備用伺服器繼續執行——如您所見,它在主伺服器恢復時從存檔中趕上。一旦它讀取了所有檔案並且主伺服器再次啟動,它將重新連接並繼續流式傳輸。