fix(nodemeta): start periodic reload goroutine for cross-replica node sync#542
Conversation
| defer s.mu.Unlock() | ||
| for nodeID, newSnap := range next { | ||
| if existing, ok := s.nodes[nodeID]; ok { | ||
| // Registration fields: always take from DB. RegisterNode writes to |
There was a problem hiding this comment.
The reasoning in this comment is not strictly correct under concurrency. While RegisterNode does write to MySQL before updating the in-memory map, the reload() DB scan may have already passed that row before the write completed. Consider this sequence:
- T0:
reload()scans the DB, capturing old registration values for node X - T1:
RegisterNode(node X)writes new labels/capacity to DB - T2:
RegisterNodeupdates the in-memory snapshot with the fresh values - T3:
applyReloadResultoverwrites the freshly-updated in-memory registration fields with the stale DB snapshot from T0
In practice, registration field changes are rare and the next reload cycle (≤ SyncMetaDataInterval) will self-correct, so this is unlikely to cause problems. But it is a trade-off worth being explicit about in the comment.
There was a problem hiding this comment.
You're right — the current comment overstates the guarantee. The race you described is real: the DB scan can capture a row before a concurrent RegisterNode write commits, so applyReloadResult may overwrite a fresher in-memory value with a stale snapshot from T0.
I've updated the comment to be explicit about this trade-off:
Registration fields: take from DB. A theoretical race exists where the reload() DB scan captures a row before a concurrent RegisterNode write commits, causing applyReloadResult to overwrite a fresher in-memory value with a stale DB snapshot. This is an accepted trade-off: registration field changes are rare, and any inconsistency self-corrects on the next reload cycle (≤ SyncMetaDataInterval).
In practice this only matters on node re-registration (label/capacity/quota changes after initial registration), which is rare and converges within one reload cycle. But the comment should be honest about the window rather than claiming the DB is always authoritative.
|
@fslongjin This PR is the foundation for the next piece of work. After it is merged, I'll submit a follow-up PR to support adding and removing node labels, which will complete the core functionality for node affinity scheduling. If there are no further concerns, I'd appreciate getting this PR merged so I can continue with the next step. Thanks for your time and feedback! |
Okay. Since there have been some feature updates on the master branch, after reviewing the changes in this PR, I think we can unify the architecture (currently there are three independent paths for periodically syncing data from the database, which doesn't seem ideal). I'm working on these changes now, and once I finish and verify them, I'll push them into this PR. |
@fslongjin I'd also be interested in contributing to this effort. I have a solid background in Kubernetes, scheduling, and related components, and I'd be happy to help with the design, implementation, or review if there's an opportunity. Please let me know If there are any tasks that could use an extra pair of hands, I'd be happy to contribute. |
28ae4c2 to
09b4afc
Compare
09b4afc to
12d30f0
Compare
12d30f0 to
e551f9a
Compare
… sync Signed-off-by: devincd <505259926@qq.com>
e551f9a to
c8bea94
Compare
|
Update: I've reverted the "unify three independent sync paths" refactoring that I mentioned earlier from this branch. Here's why: The sync-path unification was going to touch the same code paths that #630 (node isolation) introduced, but #630 has since been reverted from Because #630 is now gone and its replacement (label-based isolation) hasn't been designed yet, merging the sync-path unification ahead of it would mean unifying against a moving target. I'd rather do that unification in a follow-up PR once the label-based isolation approach lands and the full set of data paths that need syncing is stable. For now, this PR stays focused on its original scope: starting the I'll pick up the three-path unification in the next release cycle once we've settled the isolation-on-labels design. 🙏 |
…in applyReloadResult Signed-off-by: devincd <505259926@qq.com>
c8bea94 to
256fa59
Compare
Problem
In a multi-replica cubemaster deployment, node metadata (labels, capacity, instance type, component versions, etc.) registered on one replica was never visible to other replicas.
The root cause:
loopReload— the goroutine responsible for periodically syncing node metadata from MySQL into the in-memory map — was never started inInit(). Each replica only knew about nodes that contacted it directly, so scheduling decisions on other replicas operated on an incomplete view of the cluster.A secondary bug compounded the problem: even if the goroutine had been running,
reload()replaced the entire in-memory map with a DB snapshot (s.nodes = next). This meant any heartbeat that arrived on the current replica during the DB scan would be silently overwritten, potentially causing a healthy node to briefly appear unhealthy.Bug - loopReload goroutine never started
flowchart TD Node["Cubelet Node"] subgraph ReplicaA["CubeMaster Replica A"] RA_Reg["RegisterNode /\nUpdateNodeStatus"] RA_Mem["in-memory global.nodes\n✅ updated"] RA_MySQL_Write["MySQL write\n✅ persisted"] end subgraph ReplicaB["CubeMaster Replica B"] RB_Loop["❌ loopReload\nNEVER STARTED in Init()"] RB_Mem["in-memory global.nodes\n❌ stays empty forever"] RB_Sched["Scheduler\n❌ incomplete cluster view"] end MySQL[("MySQL")] Node -->|"register / heartbeat"| RA_Reg RA_Reg --> RA_MySQL_Write RA_Reg --> RA_Mem RA_MySQL_Write --> MySQL MySQL -.->|"sync never triggered"| RB_Loop RB_Loop -.->|"goroutine not running"| RB_Mem RB_Mem --> RB_Sched style RB_Loop fill:#ffcccc,color:#000 style RB_Mem fill:#ffcccc,color:#000 style RB_Sched fill:#ffcccc,color:#000 style ReplicaA fill:#dceefb,color:#000 style ReplicaB fill:#fff0f0,color:#000 style MySQL fill:#f5a623,color:#000Consequence: Replica B never learns about nodes registered on Replica A. Its scheduler operates on an empty or stale node list, causing missed scheduling opportunities or incorrect "no available node" errors.
Fix
Start the sync goroutine
A dd
go global.loopReload(ctx)inInit(). The goroutine fires everySyncMetaDataInterval(default 30 s) and pulls the latest state fromMySQL intoglobal.nodes.Replace full map swap with a field-level merge
Introduce
applyReloadResult()with the following strategy:RegisterNodeandpersistVersionswrite to MySQL before updating in-memory state, so the DB is the authoritative source for these fields.HeartbeatTime,Conditions,ReportedReady) keep the in-memory value when it is fresher than the DB snapshot. This prevents a slow DB scan from overwriting a heartbeat that arrived on this replica while the scan was in progress.Fix — Replace full map swap with field-level merge
When does the problem occur?
sequenceDiagram participant Node as Cubelet Node participant DB as MySQL participant Mem as Replica Memory Note over DB,Mem: reload() begins — scans DB, captures snapshot at T0 DB->>Mem: next["node-a"].HeartbeatTime = T0 (snapshot) Node->>DB: heartbeat T1 arrives (T1 > T0), written to DB Node->>Mem: HeartbeatTime = T1 written to memory Note over Mem: Memory is now ahead of snapshot: T1 > T0 ✅ Note over Mem: reload() finishes — applies s.nodes = next ❌ Mem->>Mem: HeartbeatTime overwritten: T1 → T0 💥 Note over Mem: applyCurrentHealth sees stale T0 Note over Mem: now - T0 > healthTimeout → Healthy = false ❌ Note over Mem: node excluded from scheduling 🚫The race window opens whenever a heartbeat is written to DB and memory after the reload scan has already read past that row, but before
s.nodes = nextis applied. With a 30 s reload interval and hundreds of nodes heartbeating every 10 s, this window is hit on virtually every reload cycle.How the fix works
flowchart TD Start(["applyReloadResult(next)"]) Loop["for each nodeID in next"] Exists{"nodeID exists\nin s.nodes?"} subgraph ExistingPath["Existing node — field-level merge"] RegFields["Registration fields\nlabels / capacity / instanceType\nquota / hostIP / grpcPort / versions\n──────────────────────────\nAlways take DB value\nDB write happens before memory write\nso DB is always ≥ memory"] HBCheck{"newSnap.HeartbeatTime\n> existing.HeartbeatTime?"} TakeDB["Take DB heartbeat fields\nConditions / Images\nLocalTemplates / ReportedReady"] KeepMem["Keep in-memory heartbeat fields\nin-memory is fresher —\nheartbeat arrived during DB scan"] Health["applyCurrentHealth(existing)"] end subgraph NewPath["New node — cross-replica discovery"] AddNode["s.nodes[nodeID] = newSnap\nnode registered on another replica"] end Start --> Loop --> Exists Exists -->|yes| RegFields RegFields --> HBCheck HBCheck -->|yes| TakeDB --> Health HBCheck -->|no| KeepMem --> Health Exists -->|no| AddNode style ExistingPath fill:#e8f4e8,color:#000 style NewPath fill:#dceefb,color:#000 style HBCheck fill:#fff9e6,color:#000 style KeepMem fill:#d4edda,color:#000 style TakeDB fill:#d4edda,color:#000 style AddNode fill:#cce5ff,color:#000How node data is synchronized across cubemaster replicas when fixed
flowchart TD Node["Cubelet Node"] subgraph ReplicaA["CubeMaster Replica A (receives request)"] RA_Reg["RegisterNode /\nUpdateNodeStatus"] RA_Mem["in-memory\nglobal.nodes"] RA_LC["localcache"] RA_Metric["fanOutResourceMetric"] end subgraph Storage["Shared Storage"] MySQL[("MySQL\nnode_registrations\nnode_status\nnode_component_versions")] Redis[("Redis\nNodeMetric")] end subgraph ReplicaB["CubeMaster Replica B (sync)"] RB_Loop["loopReload\nevery SyncMetaDataInterval"] RB_Merge["applyReloadResult()\n───────────────────────────\nregistration fields → DB wins\nversions → DB wins\nheartbeat fields → fresher wins\nnew node → add directly"] RB_Mem["in-memory\nglobal.nodes"] RB_LC["localcache\nloop / loopUpdateMetric"] end Node -->|"register / heartbeat"| RA_Reg RA_Reg -->|"1 write first"| MySQL RA_Reg -->|"2 update"| RA_Mem RA_Mem -->|"syncLocalcache"| RA_LC RA_Reg -->|"resource metrics only"| RA_Metric RA_Metric -->|"WriteNodeMetric"| Redis RA_Metric -->|"UpdateNodeMetricInProcess"| RA_LC MySQL -->|"every SyncMetaDataInterval"| RB_Loop RB_Loop --> RB_Merge RB_Merge -->|"merge"| RB_Mem Redis -->|"loopUpdateMetric tick"| RB_LC RB_Mem -->|"ListSchedulerNodes\nevery SyncMetaDataInterval"| RB_LC style MySQL fill:#f5a623,color:#000 style Redis fill:#d0021b,color:#fff style RB_Merge fill:#e8f4e8,color:#000 style ReplicaA fill:#dceefb,color:#000 style ReplicaB fill:#fdf3e3,color:#000 style Storage fill:#f9f9f9,color:#000Testing
Five new unit tests added in
service_reload_test.go, all exercisingapplyReloadResult()directly (no DB required):TestApplyReloadResultUpdatesRegistrationFields— labels, instance type, IP, port and quota fields are updated from the DB snapshotTestApplyReloadResultPreservesInMemoryHeartbeatWhenFresher— a fresher in-memory heartbeat is not overwritten by a stale DB valueTestApplyReloadResultTakesDBHeartbeatWhenFresher— a fresher DB heartbeat is correctly applied to the in-memory snapshotTestApplyReloadResultSyncsVersionsForExistingNode— component versions and their content hash are synced from DB for existing nodesTestApplyReloadResultAddsNewNodeFromDB— a node registered on another replica is discovered and added to the local in-memory map