MongoDB-12-复制模块-时序图
1. 时序图概览
复制模块的时序图展示了MongoDB副本集各种核心场景的完整流程,包括初始同步、日常复制、选举过程、故障转移等关键操作。每个时序图都体现了分布式系统中节点间的协作和数据一致性保证。
2. 初始同步时序图
2.1 新节点加入副本集
sequenceDiagram
autonumber
participant NewNode as 新节点
participant Primary as 主节点
participant Secondary as 从节点
participant IC as InitialSyncer
participant DC as DatabaseCloner
participant CC as CollectionCloner
participant Storage as 存储引擎
Note over NewNode: 启动初始同步过程
NewNode->>IC: startup(opCtx, maxAttempts)
IC->>IC: chooseSimcSource()
IC->>Primary: 连接到同步源
Note over IC: 第一阶段:获取开始时间点
IC->>Primary: rs.status()
Primary-->>IC: currentOpTime
IC->>IC: recordBeginTimestamp(opTime)
Note over IC: 第二阶段:克隆所有数据
IC->>Primary: listDatabases
Primary-->>IC: databases[]
loop 为每个数据库克隆
IC->>DC: cloneDatabase(dbName)
DC->>Primary: listCollections(dbName)
Primary-->>DC: collections[]
loop 为每个集合克隆
DC->>CC: cloneCollection(nss)
CC->>Primary: find(nss).limit(batchSize)
Primary-->>CC: documents[]
CC->>Storage: insertDocuments(docs)
Storage-->>CC: insert完成
Note over CC: 克隆索引
CC->>Primary: listIndexes(nss)
Primary-->>CC: indexes[]
CC->>Storage: createIndexes(indexes)
Storage-->>CC: 索引创建完成
end
DC-->>IC: 数据库克隆完成
end
Note over IC: 第三阶段:应用增量oplog
IC->>Primary: find(oplog).gte(beginTimestamp)
Primary-->>IC: oplogEntries[]
loop 应用oplog条目
IC->>Storage: applyOplogEntry(entry)
Storage-->>IC: 应用完成
end
Note over IC: 第四阶段:最终一致性检查
IC->>Primary: getLastCommittedOpTime()
Primary-->>IC: finalOpTime
IC->>IC: verifyConsistency(finalOpTime)
IC->>Storage: setInitialSyncComplete()
Storage-->>IC: 标记完成
IC-->>NewNode: initialSyncCompleted(finalOpTime)
NewNode->>NewNode: transition to SECONDARY
Note over NewNode: 开始正常复制
NewNode->>Primary: 开始心跳和oplog拉取
2.1.1 图意概述
新节点初始同步采用四阶段流程:确定起始点、克隆数据、应用增量、验证一致性,确保新节点与副本集数据完全同步。
2.1.2 关键字段/接口
startup:启动初始同步过程chooseSimcSource:选择同步源节点cloneDatabase:数据库级别的克隆applyOplogEntry:应用单个oplog条目
2.1.3 边界条件
- 时间窗口: 初始同步期间的oplog时间窗口管理
- 网络断连: 同步过程中的网络中断处理
- 存储空间: 目标节点的存储容量检查
- 并发限制: 同时进行初始同步的节点数量限制
2.1.4 异常与回退
- 同步源不可用时自动选择其他节点
- 网络中断时支持断点续传
- 存储空间不足时暂停并报告错误
- 数据不一致时重新开始初始同步
2.2 增量数据同步
sequenceDiagram
autonumber
participant Primary as 主节点
participant Secondary as 从节点
participant OF as OplogFetcher
participant OA as OplogApplier
participant RC as ReplicationCoordinator
participant Storage as 存储引擎
Note over Primary: 主节点执行写操作
Primary->>Primary: insert({name:"Alice", age:25})
Primary->>Storage: writeToOplog(oplogEntry)
Storage-->>Primary: oplog写入完成
Note over Secondary: 从节点拉取oplog
Secondary->>OF: fetchNextBatch()
OF->>Primary: find(oplog).gt(lastFetched)
Primary-->>OF: oplogEntries[]
OF-->>Secondary: newEntries[]
Note over Secondary: 应用oplog条目
Secondary->>OA: applyOplogBatch(entries)
loop 应用每个oplog条目
OA->>OA: validateOplogEntry(entry)
alt insert操作
OA->>Storage: insertDocument(entry.o)
Storage-->>OA: insert成功
else update操作
OA->>Storage: updateDocument(entry.o2, entry.o)
Storage-->>OA: update成功
else delete操作
OA->>Storage: deleteDocument(entry.o)
Storage-->>OA: delete成功
end
OA->>Storage: advanceAppliedOpTime(entry.ts)
end
OA-->>Secondary: batch应用完成
Note over Secondary: 更新复制状态
Secondary->>RC: updateLastAppliedOpTime(opTime)
RC->>RC: checkWriteConcernWaiters()
RC->>Primary: updatePosition(appliedOpTime)
Primary-->>Secondary: 位置更新确认
Note over Primary: 检查写关注满足条件
Primary->>RC: checkWriteConcernSatisfied()
RC-->>Primary: 写关注已满足
2.2.1 图意概述
增量同步是副本集的核心机制,从节点持续拉取主节点的oplog并应用,保持数据最终一致性。
2.2.2 关键字段/接口
fetchNextBatch:获取下一批oplog条目applyOplogBatch:批量应用oplog条目updatePosition:向主节点报告应用进度checkWriteConcernSatisfied:检查写关注是否满足
2.2.3 边界条件
- 批次大小: oplog批次的大小控制和内存限制
- 应用延迟: 从节点的最大允许延迟时间
- 网络带宽: oplog传输的网络带宽限制
- 存储性能: 应用速度与存储I/O性能的平衡
3. 选举过程时序图
3.1 主节点选举
sequenceDiagram
autonumber
participant Node1 as 节点1(候选者)
participant Node2 as 节点2
participant Node3 as 节点3
participant Node4 as 节点4
participant Node5 as 节点5
participant TC as TopologyCoordinator
Note over Node1: 检测到主节点不可用
Node1->>Node1: detectPrimaryDown()
Node1->>TC: shouldStartElection()
TC-->>Node1: true
Note over Node1: 开始选举过程
Node1->>Node1: startElection()
Node1->>TC: processWinElection()
TC->>TC: incrementTerm()
TC-->>Node1: 新任期开始
Note over Node1: 向所有节点请求投票
Node1->>Node2: requestVote(term, candidateId, lastOpTime)
Node1->>Node3: requestVote(term, candidateId, lastOpTime)
Node1->>Node4: requestVote(term, candidateId, lastOpTime)
Node1->>Node5: requestVote(term, candidateId, lastOpTime)
Note over Node2,Node5: 各节点处理投票请求
Node2->>Node2: validateVoteRequest()
Node2->>Node2: checkOpTimeAndTerm()
Node2-->>Node1: voteGranted:true
Node3->>Node3: validateVoteRequest()
Node3->>Node3: checkOpTimeAndTerm()
Node3-->>Node1: voteGranted:true
Node4->>Node4: validateVoteRequest()
Node4->>Node4: alreadyVotedInTerm()
Node4-->>Node1: voteGranted:false
Node5->>Node5: validateVoteRequest()
Node5->>Node5: checkOpTimeAndTerm()
Node5-->>Node1: voteGranted:true
Note over Node1: 计算投票结果
Node1->>Node1: countVotes()
Node1->>Node1: checkMajority(3/5 > 2)
Note over Node1: 选举胜利,成为主节点
Node1->>TC: processWinElection()
TC->>TC: becomeLeader()
TC-->>Node1: 转换为PRIMARY状态
Note over Node1: 向所有节点发送心跳确认主节点身份
Node1->>Node2: heartbeat(term, PRIMARY)
Node1->>Node3: heartbeat(term, PRIMARY)
Node1->>Node4: heartbeat(term, PRIMARY)
Node1->>Node5: heartbeat(term, PRIMARY)
Node2-->>Node1: heartbeatResponse(term, SECONDARY)
Node3-->>Node1: heartbeatResponse(term, SECONDARY)
Node4-->>Node1: heartbeatResponse(term, SECONDARY)
Node5-->>Node1: heartbeatResponse(term, SECONDARY)
Note over Node1: 选举完成,开始处理写请求
Node1->>Node1: acceptWrites()
3.1.1 图意概述
主节点选举基于Raft协议,候选节点请求大多数节点投票,获得多数票后成为新主节点并向集群广播身份。
3.1.2 关键字段/接口
detectPrimaryDown:检测主节点故障requestVote:向其他节点请求投票validateVoteRequest:验证投票请求的合法性processWinElection:处理选举胜利
3.1.3 边界条件
- 投票超时: 投票请求的超时时间设置
- 任期管理: 选举任期的单调递增保证
- 网络分区: 网络分区情况下的选举行为
- 并发选举: 多个节点同时发起选举的处理
3.1.4 异常与回退
- 投票超时时重新发起选举
- 网络分区时等待分区恢复
- 选举失败时退回到从节点状态
- 发现更高任期时放弃选举
3.2 主节点步退
sequenceDiagram
autonumber
participant Client as 客户端
participant Primary as 主节点
participant Secondary1 as 从节点1
participant Secondary2 as 从节点2
participant RC as ReplicationCoordinator
Client->>Primary: rs.stepDown(120)
Primary->>RC: stepDown(force:false, waitTime:120s)
Note over Primary: 检查步退条件
RC->>RC: validateStepDownRequest()
RC->>RC: checkCatchUpStatus()
alt 从节点未追上
RC->>Primary: 等待从节点追上
Primary->>Primary: waitForCatchUp(120s)
loop 等待从节点同步
Primary->>Secondary1: checkAppliedOpTime()
Primary->>Secondary2: checkAppliedOpTime()
Secondary1-->>Primary: appliedOpTime
Secondary2-->>Primary: appliedOpTime
alt 从节点已追上
Primary->>Primary: 追上主节点,继续步退
break 退出等待循环
end
Note over Primary: 检查超时
alt 超时未追上
Primary-->>Client: stepDown失败(从节点未追上)
break 退出步退流程
end
end
end
Note over Primary: 执行步退
Primary->>RC: executeStepDown()
RC->>RC: transitionToSecondary()
RC->>RC: setFollowerMode(SECONDARY)
Note over Primary→Secondary1: 状态转换完成
Primary->>Primary: stopAcceptingWrites()
Primary->>Primary: killClientCursors()
Primary->>Primary: dropConnections()
Note over Primary: 通知其他节点状态变更
Primary->>Secondary1: heartbeat(SECONDARY)
Primary->>Secondary2: heartbeat(SECONDARY)
Secondary1-->>Primary: heartbeatResponse()
Secondary2-->>Primary: heartbeatResponse()
Note over Secondary1,Secondary2: 其他节点检测主节点状态变更
Secondary1->>Secondary1: detectPrimaryStepDown()
Secondary2->>Secondary2: detectPrimaryStepDown()
Note over Secondary1: 可能触发新的选举
Secondary1->>Secondary1: startElectionIfNeeded()
Primary-->>Client: stepDown成功
3.2.1 图意概述
主节点步退是主动的状态转换,确保从节点数据同步后再停止写入服务,维护集群的可用性。
3.2.2 关键字段/接口
stepDown:主节点步退命令validateStepDownRequest:验证步退请求waitForCatchUp:等待从节点数据追上transitionToSecondary:转换为从节点状态
3.2.3 边界条件
- 等待超时: 等待从节点追上的最大时间
- 强制步退: 不等待从节点的强制步退模式
- 客户端连接: 步退时的客户端连接处理
- 游标管理: 现有查询游标的清理
4. 故障转移时序图
4.1 主节点故障检测与转移
sequenceDiagram
autonumber
participant Primary as 主节点(故障)
participant Secondary1 as 从节点1
participant Secondary2 as 从节点2
participant Secondary3 as 从节点3
participant TC as TopologyCoordinator
participant Client as 客户端
Note over Primary: 主节点发生故障
Primary->>Primary: networkFailure() / crash()
Note over Secondary1,Secondary3: 从节点检测心跳超时
Secondary1->>Primary: heartbeat()
Secondary2->>Primary: heartbeat()
Secondary3->>Primary: heartbeat()
Note over Primary: 主节点无响应
Primary-->Secondary1: timeout
Primary-->Secondary2: timeout
Primary-->Secondary3: timeout
Note over Secondary1: 从节点1检测到故障
Secondary1->>TC: processHeartbeatResponse(timeout)
TC->>TC: markHostDown(primary)
TC->>TC: checkElectionTimeout()
TC-->>Secondary1: shouldStartElection:true
Note over Secondary2,Secondary3: 其他从节点也检测到故障
Secondary2->>TC: processHeartbeatResponse(timeout)
Secondary3->>TC: processHeartbeatResponse(timeout)
Note over Secondary1: 从节点1发起选举
Secondary1->>Secondary1: startElection()
Secondary1->>Secondary2: requestVote(term:5, lastOpTime)
Secondary1->>Secondary3: requestVote(term:5, lastOpTime)
Note over Secondary2,Secondary3: 处理投票请求
Secondary2->>Secondary2: validateCandidate()
Secondary2->>Secondary2: compareOpTime()
Secondary2-->>Secondary1: voteGranted:true
Secondary3->>Secondary3: validateCandidate()
Secondary3->>Secondary3: compareOpTime()
Secondary3-->>Secondary1: voteGranted:true
Note over Secondary1: 获得多数票,成为新主节点
Secondary1->>TC: processWinElection()
TC->>TC: becomeLeader(term:5)
TC-->>Secondary1: transitionToPrimary()
Note over Secondary1: 新主节点开始服务
Secondary1->>Secondary1: acceptWrites()
Secondary1->>Secondary2: heartbeat(PRIMARY, term:5)
Secondary1->>Secondary3: heartbeat(PRIMARY, term:5)
Secondary2->>Secondary2: acceptNewPrimary(term:5)
Secondary3->>Secondary3: acceptNewPrimary(term:5)
Secondary2-->>Secondary1: heartbeatResponse(SECONDARY)
Secondary3-->>Secondary1: heartbeatResponse(SECONDARY)
Note over Client: 客户端重连到新主节点
Client->>Secondary1: connect()
Secondary1-->>Client: connection established
Client->>Secondary1: insert({name:"Bob"})
Secondary1-->>Client: insert successful
Note over Secondary1: 故障转移完成
Secondary1->>Secondary2: replicate oplog
Secondary1->>Secondary3: replicate oplog
4.1.1 图意概述
故障转移通过心跳机制检测主节点故障,从节点自动发起选举选出新主节点,确保服务连续性。
4.1.2 关键字段/接口
processHeartbeatResponse:处理心跳响应markHostDown:标记节点为不可用checkElectionTimeout:检查选举超时acceptNewPrimary:接受新主节点
4.1.3 边界条件
- 心跳超时: 心跳检测的超时阈值设置
- 选举窗口: 多个节点同时发起选举的处理
- 脑裂防护: 防止出现多个主节点的机制
- 客户端切换: 客户端自动发现新主节点
5. 写关注等待时序图
5.1 多数写关注实现
sequenceDiagram
autonumber
participant Client as 客户端
participant Primary as 主节点
participant Secondary1 as 从节点1
participant Secondary2 as 从节点2
participant RC as ReplicationCoordinator
participant WaiterList as 写关注等待器
Client->>Primary: insert(doc, {w:"majority", wtimeout:5000})
Primary->>Primary: insertToCollection(doc)
Primary->>Primary: writeToOplog(oplogEntry)
Note over Primary: 注册写关注等待器
Primary->>RC: awaitReplication(opTime, w:"majority")
RC->>RC: validateWriteConcern(w:"majority")
RC->>WaiterList: addWaiter(opTime, writeConcern)
WaiterList-->>RC: waiterId
Note over Primary: 复制到从节点
Primary->>Secondary1: replicateOplog(oplogEntry)
Primary->>Secondary2: replicateOplog(oplogEntry)
Note over Secondary1: 从节点1应用oplog
Secondary1->>Secondary1: applyOplogEntry(entry)
Secondary1->>Secondary1: updateAppliedOpTime(opTime)
Secondary1->>Primary: updatePosition(appliedOpTime)
Note over Secondary2: 从节点2应用oplog
Secondary2->>Secondary2: applyOplogEntry(entry)
Secondary2->>Secondary2: updateAppliedOpTime(opTime)
Secondary2->>Primary: updatePosition(appliedOpTime)
Note over Primary: 检查写关注满足条件
Primary->>RC: processUpdatePosition(member1, opTime)
RC->>RC: updateMemberAppliedOpTime(member1, opTime)
RC->>RC: checkWriteConcernSatisfied(opTime, w:"majority")
alt 未满足majority
RC->>RC: waitForMoreUpdates()
else 满足majority
RC->>WaiterList: notifyWaiters(opTime)
WaiterList->>WaiterList: wakeMatchingWaiters(opTime)
WaiterList-->>RC: waitersNotified
end
RC-->>Primary: writeConcernSatisfied(opTime)
Primary-->>Client: {ok:1, writeConcern:{w:"majority", wtimeout:5000}}
5.1.1 图意概述
写关注确保数据复制到指定数量的节点后才向客户端确认,提供不同级别的持久性保证。
5.1.2 关键字段/接口
awaitReplication:等待复制完成addWaiter:添加写关注等待器updatePosition:更新节点复制位置checkWriteConcernSatisfied:检查写关注是否满足
5.1.3 边界条件
- 超时处理: 写关注等待超时的处理机制
- 节点数量: 不同写关注模式的节点数量要求
- 网络延迟: 网络延迟对写关注性能的影响
- 内存管理: 写关注等待器的内存使用管理
6. 配置变更时序图
6.1 副本集重配置
sequenceDiagram
autonumber
participant Admin as 管理员
participant Primary as 主节点
participant Secondary1 as 从节点1
participant Secondary2 as 从节点2
participant NewNode as 新节点
participant RC as ReplicationCoordinator
Admin->>Primary: rs.reconfig(newConfig, {force:false})
Primary->>RC: processReplSetReconfig(newConfig)
Note over Primary: 验证新配置
RC->>RC: validateNewConfig(newConfig)
RC->>RC: checkConfigCompatibility(oldConfig, newConfig)
alt 配置无效
RC-->>Primary: configValidationError
Primary-->>Admin: reconfig失败
else 配置有效
Note over Primary: 应用新配置
RC->>RC: installNewConfig(newConfig)
RC->>RC: updateMemberStates(newConfig)
Note over Primary: 向现有节点发送新配置
Primary->>Secondary1: replSetUpdateConfig(newConfig)
Primary->>Secondary2: replSetUpdateConfig(newConfig)
Secondary1->>Secondary1: installConfig(newConfig)
Secondary2->>Secondary2: installConfig(newConfig)
Secondary1-->>Primary: configUpdateAck
Secondary2-->>Primary: configUpdateAck
Note over Primary: 如果配置中添加了新节点
Primary->>NewNode: replSetHeartbeat(newConfig)
NewNode->>NewNode: receiveConfig(newConfig)
NewNode->>NewNode: startInitialSync()
Note over NewNode: 新节点开始初始同步
NewNode->>Primary: initialSyncRequest()
Primary-->>NewNode: beginInitialSync()
loop 初始同步过程
NewNode->>Primary: cloneData()
Primary-->>NewNode: dataChunk
end
NewNode->>NewNode: completeInitialSync()
NewNode->>Primary: readyForReplication()
Note over Primary: 更新拓扑状态
Primary->>RC: updateTopology(newMember)
RC->>RC: addMemberToTopology(newNode)
Note over Primary: 开始向新节点复制
Primary->>NewNode: replicateOplog()
NewNode-->>Primary: replicationAck
RC-->>Primary: reconfig完成
Primary-->>Admin: {ok:1, newConfig}
end
6.1.1 图意概述
副本集重配置支持动态添加、删除节点或修改配置参数,通过配置版本控制确保集群状态一致性。
6.1.2 关键字段/接口
processReplSetReconfig:处理重配置请求validateNewConfig:验证新配置合法性installNewConfig:安装新配置replSetUpdateConfig:向节点发送配置更新
6.1.3 边界条件
- 配置版本: 配置版本的单调递增保证
- 节点状态: 重配置时的节点状态检查
- 安全限制: 不能移除过多投票节点的限制
- 回滚处理: 配置应用失败时的回滚机制
7. 版本兼容与演进
7.1 版本兼容性说明
-
协议版本演进:
- PV0 → PV1:从传统选举协议升级到Raft-based协议
- MongoDB 4.0+:引入多数读关注
- MongoDB 4.2+:优化初始同步性能
- MongoDB 5.0+:支持时间点恢复
-
接口向后兼容:
- 旧版本客户端可以连接新版本副本集
- 心跳协议保持向后兼容
- oplog格式演进保持兼容性
-
性能优化历史:
- 并行初始同步:提高数据克隆效率
- 批量oplog应用:减少应用延迟
- 智能同步源选择:优化网络拓扑感知
7.2 监控和可观测性
sequenceDiagram
autonumber
participant Monitor as 监控系统
participant Primary as 主节点
participant Secondary1 as 从节点1
participant Secondary2 as 从节点2
loop 定期收集指标
Monitor->>Primary: db.serverStatus().repl
Primary-->>Monitor: 复制统计信息
Monitor->>Secondary1: rs.status()
Secondary1-->>Monitor: 节点状态信息
Monitor->>Secondary2: rs.status()
Secondary2-->>Monitor: 节点状态信息
Note over Monitor: 分析关键指标
Monitor->>Monitor: checkLagTime(appliedOpTime)
Monitor->>Monitor: checkElectionMetrics()
Monitor->>Monitor: checkOplogSize()
alt 检测到异常
Monitor->>Monitor: generateAlert()
Monitor->>Admin: sendNotification()
end
end