MongoDB stepDown 在 PSA 架構中失敗

March 7, 2019

我已經使用 3 成員 Primary-Secondary-Arbiter 架構設置了一個 MongoDB 集群

環境：

LXC 容器
Linux Debian 伸展 (9.8)
MongoDB伺服器版本：4.0.6

MongoDB 容器：

lxc-mongodb-01（主要）
lxc-mongodb-02（二級）
lxc-mongodb-03（仲裁）

複製狀態

一切似乎都工作正常，複製工作正常：

np:PRIMARY&gt; rs.printSlaveReplicationInfo()
source: lxc-mongodb-02:27017
   syncedTo: Wed Mar 06 2019 12:08:27 GMT+0100 (CET)
   0 secs (0 hrs) behind the primary

切換失敗

但是，當我嘗試使用 rs.stepDown() 切換主要/次要時，它會失敗並顯示“沒有可選的次要被趕上”錯誤消息：

np:PRIMARY&gt; rs.stepDown(60, 30)
{
   "operationTime" : Timestamp(1551870647, 1),
   "ok" : 0,
   "errmsg" : "No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down.",
   "code" : 262,
   "codeName" : "ExceededTimeLimit",
   "$clusterTime" : {
       "clusterTime" : Timestamp(1551870647, 1),
       "signature" : {
           "hash" : BinData(0,"+/jQR8cG+y/bPtoF7gnv2Pmn2BY="),
           "keyId" : NumberLong("6653042051040411649")
       }
   }
}

請注意，這是一個非生產集群，因此沒有正在進行的事務。

來自 server01（主要）的日誌：

2019-03-06T12:08:07.709+0100 I ACCESS   [conn17] Successfully authenticated as principal root on admin
2019-03-06T12:10:49.140+0100 I COMMAND  [conn17] Attempting to step down in response to replSetStepDown command
2019-03-06T12:11:19.140+0100 I COMMAND  [conn17] command admin.$cmd appName: "MongoDB Shell" command: replSetStepDown { replSetStepDown: 60.0, secondaryCatchUpPeriodSecs: 30.0, lsid: { id: UUID("8941645a-c582-4353-b216-6e5ee91c08b0") }, $clusterTime: { clusterTime: Timestamp(1551870507, 1), signature: { hash: BinData(0, 484DDC04A03F9CBEDA0E5FA5E4F438F414E43E8F), keyId: 6653042051040411649 } }, $db: "admin" } numYields:0 ok:0 errMsg:"No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down." errName:ExceededTimeLimit errCode:262 reslen:385 locks:{ Global: { acquireCount: { r: 2, W: 2 } } } protocol:op_msg 29999ms

來自 server02（輔助）的日誌：

2019-03-06T12:10:52.278+0100 I REPL     [replication-1] Restarting oplog query due to error: InterruptedDueToReplStateChange: error in fetcher batch callback :: caused by :: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp(1551870647, 1), t: 8 }[-3124663669138993987]. Restarts remaining: 1
2019-03-06T12:10:52.278+0100 I REPL     [replication-1] Scheduled new oplog query Fetcher source: lxc-mongodb-01:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 6603 -- target:lxc-mongodb-01:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2019-03-06T12:10:52.279+0100 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source cannot be behind me, and if I am up-to-date with the sync source, it must have a higher lastOpCommitted. My last fetched oplog optime: { ts: Timestamp(1551870647, 1), t: 8 }, latest oplog optime of sync source: { ts: Timestamp(1551870647, 1), t: 8 }, my lastOpCommitted: { ts: Timestamp(1551870647, 1), t: 8 }, lastOpCommitted of sync source: { ts: Timestamp(1551870647, 1), t: 8 }
2019-03-06T12:10:52.279+0100 I REPL     [rsBackgroundSync] Clearing sync source lxc-mongodb-01:27017 to choose a new one.
2019-03-06T12:10:52.279+0100 I REPL     [rsBackgroundSync] could not find member to sync from
2019-03-06T12:10:57.276+0100 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to lxc-mongodb-01:27017: InvalidSyncSource: Sync source was cleared. Was lxc-mongodb-01:27017
2019-03-06T12:11:27.284+0100 I REPL     [rsBackgroundSync] sync source candidate: lxc-mongodb-01:27017
2019-03-06T12:11:27.286+0100 I REPL     [rsBackgroundSync] Changed sync source from empty to lxc-mongodb-01:27017
2019-03-06T12:11:28.833+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017

來自 server03 (arbitrer) 的日誌：

2019-03-06T12:11:29.428+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
2019-03-06T12:11:29.429+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017

查看文件和一些執行緒，我嘗試調整了一些設置，但沒有成功：

replication.enableMajorityReadConcern = false
writeConcernMajorityJournalDefault = false

問題

那麼，為了讓 stepDown 按預期工作，我缺少什麼？

編輯 07/03/2019

這是rs.status()主要的輸出：

np:PRIMARY&gt; rs.status()
{
   "set" : "np",
   "date" : ISODate("2019-03-07T08:08:17.623Z"),
   "myState" : 1,
   "term" : NumberLong(8),
   "syncingTo" : "",
   "syncSourceHost" : "",
   "syncSourceId" : -1,
   "heartbeatIntervalMillis" : NumberLong(2000),
   "optimes" : {
       "lastCommittedOpTime" : {
           "ts" : Timestamp(1551946089, 1),
           "t" : NumberLong(8)
       },
       "readConcernMajorityOpTime" : {
           "ts" : Timestamp(1551946089, 1),
           "t" : NumberLong(8)
       },
       "appliedOpTime" : {
           "ts" : Timestamp(1551946089, 1),
           "t" : NumberLong(8)
       },
       "durableOpTime" : {
           "ts" : Timestamp(1551946089, 1),
           "t" : NumberLong(8)
       }
   },
   "members" : [
       {
           "_id" : 0,
           "name" : "lxc-mongodb-01:27017",
           "health" : 1,
           "state" : 1,
           "stateStr" : "PRIMARY",
           "uptime" : 75954,
           "optime" : {
               "ts" : Timestamp(1551946089, 1),
               "t" : NumberLong(8)
           },
           "optimeDate" : ISODate("2019-03-07T08:08:09Z"),
           "syncingTo" : "",
           "syncSourceHost" : "",
           "syncSourceId" : -1,
           "infoMessage" : "",
           "electionTime" : Timestamp(1551870155, 1),
           "electionDate" : ISODate("2019-03-06T11:02:35Z"),
           "configVersion" : 4,
           "self" : true,
           "lastHeartbeatMessage" : ""
       },
       {
           "_id" : 1,
           "name" : "lxc-mongodb-03:27017",
           "health" : 1,
           "state" : 7,
           "stateStr" : "ARBITER",
           "uptime" : 75952,
           "lastHeartbeat" : ISODate("2019-03-07T08:08:16.005Z"),
           "lastHeartbeatRecv" : ISODate("2019-03-07T08:08:17.410Z"),
           "pingMs" : NumberLong(0),
           "lastHeartbeatMessage" : "",
           "syncingTo" : "",
           "syncSourceHost" : "",
           "syncSourceId" : -1,
           "infoMessage" : "",
           "configVersion" : 4
       },
       {
           "_id" : 2,
           "name" : "lxc-mongodb-02:27017",
           "health" : 1,
           "state" : 2,
           "stateStr" : "SECONDARY",
           "uptime" : 75952,
           "optime" : {
               "ts" : Timestamp(1551946089, 1),
               "t" : NumberLong(8)
           },
           "optimeDurable" : {
               "ts" : Timestamp(1551946089, 1),
               "t" : NumberLong(8)
           },
           "optimeDate" : ISODate("2019-03-07T08:08:09Z"),
           "optimeDurableDate" : ISODate("2019-03-07T08:08:09Z"),
           "lastHeartbeat" : ISODate("2019-03-07T08:08:16.008Z"),
           "lastHeartbeatRecv" : ISODate("2019-03-07T08:08:15.798Z"),
           "pingMs" : NumberLong(0),
           "lastHeartbeatMessage" : "",
           "syncingTo" : "lxc-mongodb-01:27017",
           "syncSourceHost" : "lxc-mongodb-01:27017",
           "syncSourceId" : 0,
           "infoMessage" : "",
           "configVersion" : 4
       }
   ],
   "ok" : 1,
   "operationTime" : Timestamp(1551946089, 1),
   "$clusterTime" : {
       "clusterTime" : Timestamp(1551946089, 1),
       "signature" : {
           "hash" : BinData(0,"ZPnNWVwjB1K9jdaSHlnfnmRPqqM="),
           "keyId" : NumberLong("6653042051040411649")
       }
   }
}

這是rs.conf()主要的輸出：

np:PRIMARY&gt; rs.conf()
{
   "_id" : "np",
   "version" : 4,
   "protocolVersion" : NumberLong(1),
   "writeConcernMajorityJournalDefault" : false,
   "members" : [
       {
           "_id" : 0,
           "host" : "lxc-mongodb-01:27017",
           "arbiterOnly" : false,
           "buildIndexes" : true,
           "hidden" : false,
           "priority" : 1,
           "tags" : {

           },
           "slaveDelay" : NumberLong(0),
           "votes" : 1
       },
       {
           "_id" : 1,
           "host" : "lxc-mongodb-03:27017",
           "arbiterOnly" : true,
           "buildIndexes" : true,
           "hidden" : false,
           "priority" : 0,
           "tags" : {

           },
           "slaveDelay" : NumberLong(0),
           "votes" : 1
       },
       {
           "_id" : 2,
           "host" : "lxc-mongodb-02:27017",
           "arbiterOnly" : false,
           "buildIndexes" : true,
           "hidden" : false,
           "priority" : 0,
           "tags" : {

           },
           "slaveDelay" : NumberLong(0),
           "votes" : 0
       }
   ],
   "settings" : {
       "chainingAllowed" : true,
       "heartbeatIntervalMillis" : 2000,
       "heartbeatTimeoutSecs" : 10,
       "electionTimeoutMillis" : 10000,
       "catchUpTimeoutMillis" : -1,
       "catchUpTakeoverDelayMillis" : 30000,
       "getLastErrorModes" : {

       },
       "getLastErrorDefaults" : {
           "w" : 1,
           "wtimeout" : 0
       },
       "replicaSetId" : ObjectId("5c545a7d4e358716c8129ac6")
   }
}

沒有可選的輔助節點，因為輔助節點的優先級設置為 0（參見rs.conf()原帖。感謝Mani的建議！）。
我更新了 lxc-mongodb-02 (_id = 2) 的優先級（和投票）：
cfg = rs.conf();
cfg.members[0].priority = 2;
cfg.members[2].votes = 1;
cfg.members[2].priority = 1;
rs.reconfig(cfg);
lxc-mongodb-02 現在可選擇作為 PRIMARY。
話雖如此，我剛剛意識到將通過更改優先級而不是使用rs.stepDown()命令來執行永久切換。
因此，為了將 lxc-mongodb-02 提升為主要，我執行：
cfg = rs.conf();
cfg.members[2].priority = 3;
np:PRIMARY&gt; rs.reconfig(cfg);
{
   "ok" : 1,
   "operationTime" : Timestamp(1551953687, 1),
   "$clusterTime" : {
       "clusterTime" : Timestamp(1551953687, 1),
       "signature" : {
           "hash" : BinData(0,"r4jVzPM1nUnJ44THZ3E+cJA1SDU="),
           "keyId" : NumberLong("6653042051040411649")
       }
   }
}

引用自：https://dba.stackexchange.com/questions/231453

MongoDB stepDown 在 PSA 架構中失敗

環境：

MongoDB 容器：

複製狀態

切換失敗

來自 server01（主要）的日誌：

來自 server02（輔助）的日誌：

來自 server03 (arbitrer) 的日誌：

問題

編輯 07/03/2019

相關問答

當數據中心發生故障時，MongoDB 如何選擇新的主節點？

從高可用性組中的輔助節點複製

是否可以在我的設置的副本集上有 2 個仲裁器？

MongoDB OplogStart 缺失

如何在 Mongodb 中的本地和 Azure 之間進行複制

MongoDB Replica Set 通過從另一個成員複製數據文件來同步數據