Replication
MongoDB stepDown 在 PSA 架構中失敗
我已經使用 3 成員 Primary-Secondary-Arbiter 架構設置了一個 MongoDB 集群
環境:
- LXC 容器
- Linux Debian 伸展 (9.8)
- MongoDB伺服器版本:4.0.6
MongoDB 容器:
- lxc-mongodb-01(主要)
- lxc-mongodb-02(二級)
- lxc-mongodb-03(仲裁)
複製狀態
一切似乎都工作正常,複製工作正常:
np:PRIMARY> rs.printSlaveReplicationInfo() source: lxc-mongodb-02:27017 syncedTo: Wed Mar 06 2019 12:08:27 GMT+0100 (CET) 0 secs (0 hrs) behind the primary
切換失敗
但是,當我嘗試使用 rs.stepDown() 切換主要/次要時,它會失敗並顯示“沒有可選的次要被趕上”錯誤消息:
np:PRIMARY> rs.stepDown(60, 30) { "operationTime" : Timestamp(1551870647, 1), "ok" : 0, "errmsg" : "No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down.", "code" : 262, "codeName" : "ExceededTimeLimit", "$clusterTime" : { "clusterTime" : Timestamp(1551870647, 1), "signature" : { "hash" : BinData(0,"+/jQR8cG+y/bPtoF7gnv2Pmn2BY="), "keyId" : NumberLong("6653042051040411649") } } }
請注意,這是一個非生產集群,因此沒有正在進行的事務。
來自 server01(主要)的日誌:
2019-03-06T12:08:07.709+0100 I ACCESS [conn17] Successfully authenticated as principal root on admin 2019-03-06T12:10:49.140+0100 I COMMAND [conn17] Attempting to step down in response to replSetStepDown command 2019-03-06T12:11:19.140+0100 I COMMAND [conn17] command admin.$cmd appName: "MongoDB Shell" command: replSetStepDown { replSetStepDown: 60.0, secondaryCatchUpPeriodSecs: 30.0, lsid: { id: UUID("8941645a-c582-4353-b216-6e5ee91c08b0") }, $clusterTime: { clusterTime: Timestamp(1551870507, 1), signature: { hash: BinData(0, 484DDC04A03F9CBEDA0E5FA5E4F438F414E43E8F), keyId: 6653042051040411649 } }, $db: "admin" } numYields:0 ok:0 errMsg:"No electable secondaries caught up as of 2019-03-06T12:11:19.140+0100Please use the replSetStepDown command with the argument {force: true} to force node to step down." errName:ExceededTimeLimit errCode:262 reslen:385 locks:{ Global: { acquireCount: { r: 2, W: 2 } } } protocol:op_msg 29999ms
來自 server02(輔助)的日誌:
2019-03-06T12:10:52.278+0100 I REPL [replication-1] Restarting oplog query due to error: InterruptedDueToReplStateChange: error in fetcher batch callback :: caused by :: operation was interrupted. Last fetched optime (with hash): { ts: Timestamp(1551870647, 1), t: 8 }[-3124663669138993987]. Restarts remaining: 1 2019-03-06T12:10:52.278+0100 I REPL [replication-1] Scheduled new oplog query Fetcher source: lxc-mongodb-01:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 6603 -- target:lxc-mongodb-01:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1551870647, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 8, readConcern: { afterClusterTime: Timestamp(1551870647, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms 2019-03-06T12:10:52.279+0100 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source cannot be behind me, and if I am up-to-date with the sync source, it must have a higher lastOpCommitted. My last fetched oplog optime: { ts: Timestamp(1551870647, 1), t: 8 }, latest oplog optime of sync source: { ts: Timestamp(1551870647, 1), t: 8 }, my lastOpCommitted: { ts: Timestamp(1551870647, 1), t: 8 }, lastOpCommitted of sync source: { ts: Timestamp(1551870647, 1), t: 8 } 2019-03-06T12:10:52.279+0100 I REPL [rsBackgroundSync] Clearing sync source lxc-mongodb-01:27017 to choose a new one. 2019-03-06T12:10:52.279+0100 I REPL [rsBackgroundSync] could not find member to sync from 2019-03-06T12:10:57.276+0100 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to lxc-mongodb-01:27017: InvalidSyncSource: Sync source was cleared. Was lxc-mongodb-01:27017 2019-03-06T12:11:27.284+0100 I REPL [rsBackgroundSync] sync source candidate: lxc-mongodb-01:27017 2019-03-06T12:11:27.286+0100 I REPL [rsBackgroundSync] Changed sync source from empty to lxc-mongodb-01:27017 2019-03-06T12:11:28.833+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
來自 server03 (arbitrer) 的日誌:
2019-03-06T12:11:29.428+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017 2019-03-06T12:11:29.429+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for np/lxc-mongodb-01:27017,lxc-mongodb-02:27017
查看文件和一些執行緒,我嘗試調整了一些設置,但沒有成功:
replication.enableMajorityReadConcern = false writeConcernMajorityJournalDefault = false
問題
那麼,為了讓 stepDown 按預期工作,我缺少什麼?
編輯 07/03/2019
這是
rs.status()
主要的輸出:np:PRIMARY> rs.status() { "set" : "np", "date" : ISODate("2019-03-07T08:08:17.623Z"), "myState" : 1, "term" : NumberLong(8), "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "readConcernMajorityOpTime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "appliedOpTime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "durableOpTime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) } }, "members" : [ { "_id" : 0, "name" : "lxc-mongodb-01:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 75954, "optime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "optimeDate" : ISODate("2019-03-07T08:08:09Z"), "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "infoMessage" : "", "electionTime" : Timestamp(1551870155, 1), "electionDate" : ISODate("2019-03-06T11:02:35Z"), "configVersion" : 4, "self" : true, "lastHeartbeatMessage" : "" }, { "_id" : 1, "name" : "lxc-mongodb-03:27017", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 75952, "lastHeartbeat" : ISODate("2019-03-07T08:08:16.005Z"), "lastHeartbeatRecv" : ISODate("2019-03-07T08:08:17.410Z"), "pingMs" : NumberLong(0), "lastHeartbeatMessage" : "", "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "infoMessage" : "", "configVersion" : 4 }, { "_id" : 2, "name" : "lxc-mongodb-02:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 75952, "optime" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "optimeDurable" : { "ts" : Timestamp(1551946089, 1), "t" : NumberLong(8) }, "optimeDate" : ISODate("2019-03-07T08:08:09Z"), "optimeDurableDate" : ISODate("2019-03-07T08:08:09Z"), "lastHeartbeat" : ISODate("2019-03-07T08:08:16.008Z"), "lastHeartbeatRecv" : ISODate("2019-03-07T08:08:15.798Z"), "pingMs" : NumberLong(0), "lastHeartbeatMessage" : "", "syncingTo" : "lxc-mongodb-01:27017", "syncSourceHost" : "lxc-mongodb-01:27017", "syncSourceId" : 0, "infoMessage" : "", "configVersion" : 4 } ], "ok" : 1, "operationTime" : Timestamp(1551946089, 1), "$clusterTime" : { "clusterTime" : Timestamp(1551946089, 1), "signature" : { "hash" : BinData(0,"ZPnNWVwjB1K9jdaSHlnfnmRPqqM="), "keyId" : NumberLong("6653042051040411649") } } }
這是
rs.conf()
主要的輸出:np:PRIMARY> rs.conf() { "_id" : "np", "version" : 4, "protocolVersion" : NumberLong(1), "writeConcernMajorityJournalDefault" : false, "members" : [ { "_id" : 0, "host" : "lxc-mongodb-01:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 1, "host" : "lxc-mongodb-03:27017", "arbiterOnly" : true, "buildIndexes" : true, "hidden" : false, "priority" : 0, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 2, "host" : "lxc-mongodb-02:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 0, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 0 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "catchUpTimeoutMillis" : -1, "catchUpTakeoverDelayMillis" : 30000, "getLastErrorModes" : { }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 }, "replicaSetId" : ObjectId("5c545a7d4e358716c8129ac6") } }
沒有可選的輔助節點,因為輔助節點的優先級設置為 0(參見
rs.conf()
原帖。感謝Mani的建議!)。我更新了 lxc-mongodb-02 (_id = 2) 的優先級(和投票):
cfg = rs.conf(); cfg.members[0].priority = 2; cfg.members[2].votes = 1; cfg.members[2].priority = 1; rs.reconfig(cfg);
lxc-mongodb-02 現在可選擇作為 PRIMARY。
話雖如此,我剛剛意識到將通過更改優先級而不是使用
rs.stepDown()
命令來執行永久切換。因此,為了將 lxc-mongodb-02 提升為主要,我執行:
cfg = rs.conf(); cfg.members[2].priority = 3; np:PRIMARY> rs.reconfig(cfg); { "ok" : 1, "operationTime" : Timestamp(1551953687, 1), "$clusterTime" : { "clusterTime" : Timestamp(1551953687, 1), "signature" : { "hash" : BinData(0,"r4jVzPM1nUnJ44THZ3E+cJA1SDU="), "keyId" : NumberLong("6653042051040411649") } } }