Skip to content

fix(nodemeta): start periodic reload goroutine for cross-replica node sync#542

Merged
fslongjin merged 2 commits into
TencentCloud:masterfrom
devincd:fix/cross-replica-nodemeta-sync
Jun 26, 2026
Merged

fix(nodemeta): start periodic reload goroutine for cross-replica node sync#542
fslongjin merged 2 commits into
TencentCloud:masterfrom
devincd:fix/cross-replica-nodemeta-sync

Conversation

@devincd

@devincd devincd commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

In a multi-replica cubemaster deployment, node metadata (labels, capacity, instance type, component versions, etc.) registered on one replica was never visible to other replicas.

The root cause: loopReload — the goroutine responsible for periodically syncing node metadata from MySQL into the in-memory map — was never started in Init(). Each replica only knew about nodes that contacted it directly, so scheduling decisions on other replicas operated on an incomplete view of the cluster.

A secondary bug compounded the problem: even if the goroutine had been running, reload() replaced the entire in-memory map with a DB snapshot (s.nodes = next). This meant any heartbeat that arrived on the current replica during the DB scan would be silently overwritten, potentially causing a healthy node to briefly appear unhealthy.

Bug - loopReload goroutine never started

flowchart TD
    Node["Cubelet Node"]

    subgraph ReplicaA["CubeMaster Replica A"]
        RA_Reg["RegisterNode /\nUpdateNodeStatus"]
        RA_Mem["in-memory global.nodes\n✅ updated"]
        RA_MySQL_Write["MySQL write\n✅ persisted"]
    end

    subgraph ReplicaB["CubeMaster Replica B"]
        RB_Loop["❌ loopReload\nNEVER STARTED in Init()"]
        RB_Mem["in-memory global.nodes\n❌ stays empty forever"]
        RB_Sched["Scheduler\n❌ incomplete cluster view"]
    end

    MySQL[("MySQL")]

    Node -->|"register / heartbeat"| RA_Reg
    RA_Reg --> RA_MySQL_Write
    RA_Reg --> RA_Mem
    RA_MySQL_Write --> MySQL

    MySQL -.->|"sync never triggered"| RB_Loop
    RB_Loop -.->|"goroutine not running"| RB_Mem
    RB_Mem --> RB_Sched

    style RB_Loop fill:#ffcccc,color:#000
    style RB_Mem fill:#ffcccc,color:#000
    style RB_Sched fill:#ffcccc,color:#000
    style ReplicaA fill:#dceefb,color:#000
    style ReplicaB fill:#fff0f0,color:#000
    style MySQL fill:#f5a623,color:#000
Loading

Consequence: Replica B never learns about nodes registered on Replica A. Its scheduler operates on an empty or stale node list, causing missed scheduling opportunities or incorrect "no available node" errors.

Fix

Start the sync goroutine
A dd go global.loopReload(ctx) in Init(). The goroutine fires everySyncMetaDataInterval (default 30 s) and pulls the latest state fromMySQL into global.nodes.

Replace full map swap with a field-level merge
Introduce applyReloadResult() with the following strategy:

  • Registration fields (labels, capacity, instance type, quota, versions, etc.) always take the DB value. RegisterNode and persistVersions write to MySQL before updating in-memory state, so the DB is the authoritative source for these fields.
  • Heartbeat / status fields (HeartbeatTime, Conditions, ReportedReady) keep the in-memory value when it is fresher than the DB snapshot. This prevents a slow DB scan from overwriting a heartbeat that arrived on this replica while the scan was in progress.
  • New nodes (present in DB but absent in the in-memory map) are added directly — this is the primary cross-replica sync path.

Fix — Replace full map swap with field-level merge

When does the problem occur?

sequenceDiagram
    participant Node as Cubelet Node
    participant DB as MySQL
    participant Mem as Replica Memory

    Note over DB,Mem: reload() begins — scans DB, captures snapshot at T0

    DB->>Mem: next["node-a"].HeartbeatTime = T0  (snapshot)

    Node->>DB: heartbeat T1 arrives (T1 > T0), written to DB
    Node->>Mem: HeartbeatTime = T1 written to memory

    Note over Mem: Memory is now ahead of snapshot: T1 > T0 ✅

    Note over Mem: reload() finishes — applies s.nodes = next ❌
    Mem->>Mem: HeartbeatTime overwritten: T1 → T0 💥

    Note over Mem: applyCurrentHealth sees stale T0
    Note over Mem: now - T0 > healthTimeout → Healthy = false ❌
    Note over Mem: node excluded from scheduling 🚫
Loading

The race window opens whenever a heartbeat is written to DB and memory after the reload scan has already read past that row, but before s.nodes = next is applied. With a 30 s reload interval and hundreds of nodes heartbeating every 10 s, this window is hit on virtually every reload cycle.

How the fix works

flowchart TD
    Start(["applyReloadResult(next)"])
    Loop["for each nodeID in next"]
    Exists{"nodeID exists\nin s.nodes?"}

    subgraph ExistingPath["Existing node — field-level merge"]
        RegFields["Registration fields\nlabels / capacity / instanceType\nquota / hostIP / grpcPort / versions\n──────────────────────────\nAlways take DB value\nDB write happens before memory write\nso DB is always ≥ memory"]
        HBCheck{"newSnap.HeartbeatTime\n> existing.HeartbeatTime?"}
        TakeDB["Take DB heartbeat fields\nConditions / Images\nLocalTemplates / ReportedReady"]
        KeepMem["Keep in-memory heartbeat fields\nin-memory is fresher —\nheartbeat arrived during DB scan"]
        Health["applyCurrentHealth(existing)"]
    end

    subgraph NewPath["New node — cross-replica discovery"]
        AddNode["s.nodes[nodeID] = newSnap\nnode registered on another replica"]
    end

    Start --> Loop --> Exists
    Exists -->|yes| RegFields
    RegFields --> HBCheck
    HBCheck -->|yes| TakeDB --> Health
    HBCheck -->|no| KeepMem --> Health
    Exists -->|no| AddNode

    style ExistingPath fill:#e8f4e8,color:#000
    style NewPath fill:#dceefb,color:#000
    style HBCheck fill:#fff9e6,color:#000
    style KeepMem fill:#d4edda,color:#000
    style TakeDB fill:#d4edda,color:#000
    style AddNode fill:#cce5ff,color:#000
Loading

How node data is synchronized across cubemaster replicas when fixed

flowchart TD
    Node["Cubelet Node"]

    subgraph ReplicaA["CubeMaster Replica A (receives request)"]
        RA_Reg["RegisterNode /\nUpdateNodeStatus"]
        RA_Mem["in-memory\nglobal.nodes"]
        RA_LC["localcache"]
        RA_Metric["fanOutResourceMetric"]
    end

    subgraph Storage["Shared Storage"]
        MySQL[("MySQL\nnode_registrations\nnode_status\nnode_component_versions")]
        Redis[("Redis\nNodeMetric")]
    end

    subgraph ReplicaB["CubeMaster Replica B (sync)"]
        RB_Loop["loopReload\nevery SyncMetaDataInterval"]
        RB_Merge["applyReloadResult()\n───────────────────────────\nregistration fields → DB wins\nversions          → DB wins\nheartbeat fields  → fresher wins\nnew node          → add directly"]
        RB_Mem["in-memory\nglobal.nodes"]
        RB_LC["localcache\nloop / loopUpdateMetric"]
    end

    Node -->|"register / heartbeat"| RA_Reg
    RA_Reg -->|"1 write first"| MySQL
    RA_Reg -->|"2 update"| RA_Mem
    RA_Mem -->|"syncLocalcache"| RA_LC

    RA_Reg -->|"resource metrics only"| RA_Metric
    RA_Metric -->|"WriteNodeMetric"| Redis
    RA_Metric -->|"UpdateNodeMetricInProcess"| RA_LC

    MySQL -->|"every SyncMetaDataInterval"| RB_Loop
    RB_Loop --> RB_Merge
    RB_Merge -->|"merge"| RB_Mem

    Redis -->|"loopUpdateMetric tick"| RB_LC
    RB_Mem -->|"ListSchedulerNodes\nevery SyncMetaDataInterval"| RB_LC

    style MySQL fill:#f5a623,color:#000
    style Redis fill:#d0021b,color:#fff
    style RB_Merge fill:#e8f4e8,color:#000
    style ReplicaA fill:#dceefb,color:#000
    style ReplicaB fill:#fdf3e3,color:#000
    style Storage fill:#f9f9f9,color:#000
Loading
Data type Cross-replica sync path Latency
Registration / Versions MySQL → loopReload → applyReloadResult ≤ SyncMetaDataInterval
Heartbeat / Status MySQL → loopReload → applyReloadResult (fresher wins) ≤ SyncMetaDataInterval
Resource Metrics Redis → loopUpdateMetric near real-time (50 ms tick)

Testing

Five new unit tests added in service_reload_test.go, all exercising applyReloadResult() directly (no DB required):

  • TestApplyReloadResultUpdatesRegistrationFields — labels, instance type, IP, port and quota fields are updated from the DB snapshot
  • TestApplyReloadResultPreservesInMemoryHeartbeatWhenFresher — a fresher in-memory heartbeat is not overwritten by a stale DB value
  • TestApplyReloadResultTakesDBHeartbeatWhenFresher — a fresher DB heartbeat is correctly applied to the in-memory snapshot
  • TestApplyReloadResultSyncsVersionsForExistingNode — component versions and their content hash are synced from DB for existing nodes
  • TestApplyReloadResultAddsNewNodeFromDB — a node registered on another replica is discovered and added to the local in-memory map

Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
defer s.mu.Unlock()
for nodeID, newSnap := range next {
if existing, ok := s.nodes[nodeID]; ok {
// Registration fields: always take from DB. RegisterNode writes to

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning in this comment is not strictly correct under concurrency. While RegisterNode does write to MySQL before updating the in-memory map, the reload() DB scan may have already passed that row before the write completed. Consider this sequence:

  • T0: reload() scans the DB, capturing old registration values for node X
  • T1: RegisterNode(node X) writes new labels/capacity to DB
  • T2: RegisterNode updates the in-memory snapshot with the fresh values
  • T3: applyReloadResult overwrites the freshly-updated in-memory registration fields with the stale DB snapshot from T0

In practice, registration field changes are rare and the next reload cycle (≤ SyncMetaDataInterval) will self-correct, so this is unlikely to cause problems. But it is a trade-off worth being explicit about in the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the current comment overstates the guarantee. The race you described is real: the DB scan can capture a row before a concurrent RegisterNode write commits, so applyReloadResult may overwrite a fresher in-memory value with a stale snapshot from T0.

I've updated the comment to be explicit about this trade-off:

Registration fields: take from DB. A theoretical race exists where the reload() DB scan captures a row before a concurrent RegisterNode write commits, causing applyReloadResult to overwrite a fresher in-memory value with a stale DB snapshot. This is an accepted trade-off: registration field changes are rare, and any inconsistency self-corrects on the next reload cycle (≤ SyncMetaDataInterval).

In practice this only matters on node re-registration (label/capacity/quota changes after initial registration), which is rare and converges within one reload cycle. But the comment should be honest about the window rather than claiming the DB is always authoritative.

@devincd devincd requested a review from fslongjin June 17, 2026 03:33
@devincd

devincd commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@fslongjin
Could you please help review this PR again when you have time? I believe all previous comments have been addressed.

This PR is the foundation for the next piece of work. After it is merged, I'll submit a follow-up PR to support adding and removing node labels, which will complete the core functionality for node affinity scheduling.

If there are no further concerns, I'd appreciate getting this PR merged so I can continue with the next step. Thanks for your time and feedback!

@fslongjin

Copy link
Copy Markdown
Member

@fslongjin Could you please help review this PR again when you have time? I believe all previous comments have been addressed.

This PR is the foundation for the next piece of work. After it is merged, I'll submit a follow-up PR to support adding and removing node labels, which will complete the core functionality for node affinity scheduling.

If there are no further concerns, I'd appreciate getting this PR merged so I can continue with the next step. Thanks for your time and feedback!

Okay. Since there have been some feature updates on the master branch, after reviewing the changes in this PR, I think we can unify the architecture (currently there are three independent paths for periodically syncing data from the database, which doesn't seem ideal). I'm working on these changes now, and once I finish and verify them, I'll push them into this PR.

@devincd

devincd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@fslongjin Could you please help review this PR again when you have time? I believe all previous comments have been addressed.
This PR is the foundation for the next piece of work. After it is merged, I'll submit a follow-up PR to support adding and removing node labels, which will complete the core functionality for node affinity scheduling.
If there are no further concerns, I'd appreciate getting this PR merged so I can continue with the next step. Thanks for your time and feedback!

Okay. Since there have been some feature updates on the master branch, after reviewing the changes in this PR, I think we can unify the architecture (currently there are three independent paths for periodically syncing data from the database, which doesn't seem ideal). I'm working on these changes now, and once I finish and verify them, I'll push them into this PR.

@fslongjin
Thanks for the update! I think unifying the architecture makes a lot of sense, especially if it can eliminate the multiple independent database sync paths.

I'd also be interested in contributing to this effort. I have a solid background in Kubernetes, scheduling, and related components, and I'd be happy to help with the design, implementation, or review if there's an opportunity. Please let me know If there are any tasks that could use an extra pair of hands, I'd be happy to contribute.

@fslongjin fslongjin force-pushed the fix/cross-replica-nodemeta-sync branch from 28ae4c2 to 09b4afc Compare June 24, 2026 17:48
Comment thread CubeMaster/pkg/nodemeta/service.go
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service_reload_test.go
@fslongjin fslongjin force-pushed the fix/cross-replica-nodemeta-sync branch from 09b4afc to 12d30f0 Compare June 25, 2026 03:00
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
Comment thread CubeMaster/pkg/nodemeta/service.go Outdated
@fslongjin fslongjin force-pushed the fix/cross-replica-nodemeta-sync branch from 12d30f0 to e551f9a Compare June 25, 2026 03:22
… sync

Signed-off-by: devincd <505259926@qq.com>
@fslongjin fslongjin force-pushed the fix/cross-replica-nodemeta-sync branch from e551f9a to c8bea94 Compare June 26, 2026 03:54
@fslongjin

Copy link
Copy Markdown
Member

Update: I've reverted the "unify three independent sync paths" refactoring that I mentioned earlier from this branch. Here's why:

The sync-path unification was going to touch the same code paths that #630 (node isolation) introduced, but #630 has since been reverted from master (#655) due to implementation concerns. Specifically, I believe the node isolation feature should be built on top of the node label management API introduced in #633 rather than as a separate column + filter mechanism — labels are a more composable and Kubernetes-aligned primitive for expressing isolation intent.

Because #630 is now gone and its replacement (label-based isolation) hasn't been designed yet, merging the sync-path unification ahead of it would mean unifying against a moving target. I'd rather do that unification in a follow-up PR once the label-based isolation approach lands and the full set of data paths that need syncing is stable.

For now, this PR stays focused on its original scope: starting the loopReload goroutine (which was never called before — the critical bug fix) and replacing the full-map-swap with the field-level applyReloadResult merge. That part is solid and should go in as-is.

I'll pick up the three-path unification in the next release cycle once we've settled the isolation-on-labels design. 🙏

…in applyReloadResult

Signed-off-by: devincd <505259926@qq.com>
@fslongjin fslongjin force-pushed the fix/cross-replica-nodemeta-sync branch from c8bea94 to 256fa59 Compare June 26, 2026 04:02
@fslongjin fslongjin merged commit c6cb0f8 into TencentCloud:master Jun 26, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants