Postgresql

PostgreSQL slave 在 master 失敗後失敗;無法推廣

  • October 23, 2017

剛剛在PostgreSQL 9.6的暖備架構上遇到了一個問題:我們不小心把master停掉了,把它的數據銷毀了,所以,幾十分鐘後,當我們注意到這個問題時,我們嘗試提升standby server。master失敗後就停止了,我們無法啟動它,所以我們也沒有推廣它。我不明白為什麼,但我認為我在復製過程中遺漏了一些可以解釋這一點的東西。主故障時的日誌如下:

2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10    16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server: 
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
       TCP/IP connections on port 5433?
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
       TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
       TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting
       TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

據我了解這些行,備用伺服器將 WAL 流失去到主伺服器並嘗試重新連接;失敗時,它試圖檢索最後一個存檔,再次失敗,停止。第一個問題:在這種情況下,我認為待機會保持線上,等待指令,而不是自行關閉。我錯了嗎?如果沒有,我錯過了什麼,導致備用伺服器自行關閉?

然後,當我們嘗試重新啟動從站以提升它時,我們得到了以下幾行:

2017-10-17 10:51:39.182 CEST,,,32236,,59dcdafa.7dec,2,,2017-10-10 16:36:42 CEST,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL error: EOF detected",,,,,,,,,""
2017-10-17 10:51:39.540 CEST,,,32142,,59dcdaf4.7d8e,13,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"invalid resource manager ID 32 at 28/B0163E8",,,,,,,,,""
2017-10-17 10:51:39.642 CEST,,,8532,,59e5c49b.2154,1,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"started streaming WAL from primary at 28/B000000 on timeline 1",,,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,2,,2017-10-17 10:51:39 CEST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 1 at 28/C000098",,,,,,,,""
2017-10-17 10:51:53.760 CEST,,,8532,,59e5c49b.2154,3,,2017-10-17 10:51:39 CEST,,0,FATAL,XX000,"could not send end-of-streaming message to primary: no COPY in progress",,,,,,,,,""
2017-10-17 10:51:54.088 CEST,,,32142,,59dcdaf4.7d8e,14,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"record with incorrect prev-link 3F136/36 at 28/C000098",,,,,,,,,""
2017-10-17 10:51:54.113 CEST,,,8607,,59e5c4aa.219f,1,,2017-10-17 10:51:54 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
       TCP/IP connections on port 5433?
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (137.xx.xx.xx) and
accepting
       TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:51:59.133 CEST,,,8610,,59e5c4af.21a2,1,,2017-10-17 10:51:59 CEST,,0,FATAL,XX000,"could not connect to the primary server:
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (2001:41d0:xxxx:xxxx::1) and accepting
       TCP/IP connections on port 5433 ?
could not connect to server: Connection refused
       Is the server running on host « xxxxxx » (137.xx.xx.xx) and accepting
       TCP/IP connections on port 5433?",,,,,,,,,""
2017-10-17 10:52:03.969 CEST,,,32142,,59dcdaf4.7d8e,15,,2017-10-10 16:36:36 CEST,,0,FATAL,XX000,"could not restore file « 00000001000000280000000C » from archive: child process exited with exit code 255",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,2,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"startup process (PID 32142) exited with exit code 1",,,,,,,,,""
2017-10-17 10:52:03.977 CEST,,,32139,,59dcdaf4.7d8b,3,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2017-10-17 10:52:03.990 CEST,,,32139,,59dcdaf4.7d8b,4,,2017-10-10 16:36:36 CEST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

似乎伺服器抱怨 WAL 流突然停止,並且由於它無法重新連接到主伺服器,它無法檢查其(備用)數據是否是新的,因此它拒絕啟動。如果有,是否一致?我不明白為什麼應該在主伺服器故障時保留數據庫服務的備用伺服器由於主伺服器故障而無法啟動,所以我認為我做錯了什麼,但是什麼?否則,我應該怎麼做才能讓standby重新啟動並被提升?

首先,在你做一些不可逆轉的事情之前,對你留下的東西做一個完整的副本。

你真的是先停止master,然後銷毀它的數據,還是在它還在執行的時候銷毀它的數據,然後停止它?我想知道最後一個 WAL 是否在被複製的過程中被銷毀。

您無需啟動待機即可進行促銷。升級是您從熱備份到主伺服器的方式,不會中斷。由於您的待機已經中斷,您只需將 recovery.conf 重命名為不礙事,然後啟動它。

引用自:https://dba.stackexchange.com/questions/189089