fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048) by kvaps · Pull Request #164 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-14T10:35:36Z

Summary

Adding two volumes back-to-back to an existing RD (two concurrent linstor volume-definition create <rd> <size> against an RD that already has placed diskful replicas) wedged the second volume permanently: on every replica it sat Inconsistent with replication:Established, resync-suspended:no, and no peer as SyncSource — the resync never bootstrapped and the volume was never usable. A single late vd-add, or sequential adds (wait for vol N UpToDate before adding N+1), converged fine. Only the concurrent / back-to-back case broke.

Root-causing on the stand revealed two independent races, both fixed here:

1. Control-plane lost-update (REST/store) — the second volume was silently dropped

POST /v1/resource-definitions/{rd}/volume-definitions (number-less) auto-assigned the smallest-free VolumeNumber in a REST-side List taken before the store Create. Two concurrent adds both read [vol-0], both chose VlmNr=1; one won, the other was rejected FAIL_EXISTS_VLM_DFN. The operator's second intended volume vanished — and when the CLI is backgrounded the failure surfaces no usable error at all (both vd c appear to succeed; the RD is left one volume short).

Fixes:

Move smallest-free allocation into the store as VolumeDefinitionStore.CreateAutoNumbered, re-deriving the hole inside the conflict-retry loop so the loser re-reads [vol-0, vol-1] and lands at vol-2.
Read the RD live (uncached) via mgr.GetAPIReader() on each retry — the informer-cached client trails a just-committed write, so retries against the cache re-derived the same number and the retry budget exhausted.
Post-commit verification: re-read live and confirm the assigned number actually persisted; a stale-read clobber forces a retry. The create now either persists the volume or retries — it can never return success having silently lost it.

2. Data-plane seed/promote race (satellite/DRBD) — the second volume wedged Inconsistent

The per-volume create-md + winner-election GI seed (ensurePerVolumeMetadata) was gated by parsing the on-disk .res, which the FSM dispatch's renderResFile preamble pre-populates with the new volume's block — so on a concurrent add the gate flipped false, the seed was skipped, and the new volume came up with no metadata and no elected SyncSource. Gate now keys off the race-free MetadataCreated signal; the idempotent per-volume HasMD probe inside is the sole authority for what needs create-md.
On a ≥3-diskful RD the per-volume winner seed can still lose a race with its own bring-up, leaving the volume Inconsistent on every replica. Added a late-add-promote self-heal: the deterministic lowest-node-id replica force-primaries to become the SyncSource — but only when the RD is past day0 (a local UpToDate sibling volume exists) and every peer is fully connected (the lowest-id election needs complete peer info). These two gates make it impossible to misfire on a normal day0 bootstrap or to elect multiple promoters on a partial peer set — the split-brain a first cut of this self-heal caused, caught in stand validation and fixed before merge.

Data safety

The fixes do not change what gets seeded — the per-volume seed vetoes (PeerHasData / SkipInitialSync / AnyConnectedPeerHasDataForVolume) are untouched, so no replica is ever declared UpToDate-over-empty and no wrong-direction sync is possible. The day0 skip-initial-sync contract (BUG-028/#121) and single/sequential late adds (Bug 384) are preserved. The late-add-promote relies on every replica sharing the volume's deterministic day0 current-UUID, so primary --force mints no unrelated UUID; the peer-has-data veto keeps it from ever force-priming over a real data-bearing peer (Bug 342 guard).

Testing

L1: REST lost-update (in-memory + k8s envtest concurrency, distinct numbers under genuine 409s); satellite seed-gate runs despite a pre-rendered .res; late-add-promote probe (8 cases incl. day0-no-fire and partial-connectivity-no-fire) + reconciler wiring. go test ./... and go test -race ./internal/controller/ ./pkg/satellite/... green; golangci-lint 0 issues.
L6: tests/e2e/cli-matrix/multi-volume-late-vd-create.sh runs the two late vd c concurrently, asserts 3 VolumeDefinitions (no drop) + all-9 UpToDate + kernel-truth.
L7: new tests/operator-harness/replay/vd-late-concurrent-no-drop.yaml (+ a vd_count await kind).
Stand (big), final fix: day0 3-replica autoplace PASS ×2 (no split-brain), day0 skip-init PASS, single late-add PASS, concurrent late-add PASS ×3. Satellite logs confirm the self-heal fires on a single node only, with zero StandAlone / unrelated-data events.

Note

Concurrent late vd-add is fundamentally racy upstream too, but the previous behaviour — a permanently-unusable volume with no error surfaced — was unacceptable. After this change each concurrent add either lands and converges, or fails loudly; it never silently wedges.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Prevented lost/duplicate volume-number allocations by moving auto-numbering into the store with live-state verification and adding retry/verification safeguards.
- Fixed concurrent volume-definition spec updates to avoid clobbering volume definitions added during races.
- Improved Bug-048 late-add recovery with safer promote gating and automated invalidation for stalled late-add resyncs.
Tests
- Added concurrency and Bug-048 regression coverage, plus e2e/harness assertions to catch lost updates.
Documentation
- Updated harness documentation with the new vd_count await assertion and clarified the await.kind contract.

Two back-to-back number-less `linstor vd c <rd>` calls against an RD that already carries vol-0 used to drop the second add: each request picked the smallest free VolumeNumber in a REST-side List BEFORE the store Create, so both read [vol-0], both chose VlmNr=1, the loser was rejected FAIL_EXISTS_VLM_DFN, and the operator's second intended volume silently vanished (only vol-0 + vol-1 landed). When the loser backgrounds its CLI the failure surfaces no usable error at all. Move the smallest-free allocation INTO the store layer as VolumeDefinitionStore.CreateAutoNumbered, which re-derives the hole on every conflict-retry attempt (k8s: allocate inside RetryOnConflict under the apiserver resourceVersion guard; inmemory: under the write lock). The read of the existing set and the append are now atomic with respect to a racing add, so the loser re-reads [vol-0, vol-1] and lands at vol-2. The explicit-number path keeps the plain Create (its retry already converges and a genuine duplicate must still 409). The DRBD per-volume day0/winner-election seed path (BUG-028/384 seedFreshVolumes/isLateAddWinner) is untouched and already converges concurrent late adds correctly once both volumes exist with distinct numbers — verified on stand by forcing distinct --vlmnr. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

L6 cli-matrix multi-volume-late-vd-create.sh: take PR #162's corrected Status-based detection (status_disk_state instead of the null .vlms projection) and make the two late `vd c` calls run CONCURRENTLY (background + wait) so they exercise the auto-assign lost-update window. Add the BUG-048 wire-level assertion: the RD must carry exactly 3 VolumeDefinitions after the two concurrent adds — a dropped add leaves only 2 and surfaces a spurious "volume definition 1 already exists". L7 replay vd-late-create-concurrent-no-lost-update.yaml: codify the post-fix invariant (vol-0 up, two back-to-back late adds → vd_count==3 + all_uptodate). Adds a `vd_count` await kind to the harness lib that asserts the exact VolumeDefinition count on an RD. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

… not .res (BUG-048) The per-volume create-md + winner-election GI seed for a late-added volume (ensurePerVolumeMetadata) was gated by hasLateAddedVolume(resPath, dr) — comparing the desired volume set against the `volume N {` blocks in the on-disk .res. That .res is re-rendered by the FSM dispatch's renderResFile preamble on every adjust/up, so on a concurrent two-VD add an earlier reconcile's adjust pre-populated the new volume's block BEFORE this gate read the file. The gate then saw "no late volume", SKIPPED the metadata+seed pass entirely, and the new volume came up via the kernel adjust with no metadata and no elected SyncSource — Inconsistent on every replica, permanently wedged (the operator-reported BUG-048 symptom: replication:Established, resync-suspended:no, no SyncSource). It hit ~half of concurrent late two-VD adds; single and sequential adds, where the create-md pass and the render do not interleave, were unaffected. Gate on the race-free dr.GetMetadataCreated() (this RD is past first activation) instead. ensurePerVolumeMetadata is already per-volume idempotent — it probes `drbdadm dump-md <rd>/<vol>` and only create-md's + seeds the volumes that genuinely lack metadata (len(freshlyCreated)==0 → early return), so running it on every post-activation diskful reconcile is a no-op on converged state (cheap per-volume dump-md probe) and never re-stamps an existing volume's GI/bitmap. The winner election and every seed veto (PeerHasData / SkipInitialSync / AnyConnectedPeerHasDataForVolume) are unchanged, so day0 skip-init (BUG-028) and single/sequential late adds (Bug 384) are preserved. The now-unused .res-parsing hasLateAddedVolume helper is removed. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…BUG-048) The store-level atomic VolumeNumber allocation still dropped the second of two concurrent `vd c` in production: fetchRD reads through the manager's informer-CACHED client, which trails a just-committed write by the watch round-trip. So when the winner's Update lands, the loser's conflict-retry re-reads the STALE cached RD, re-derives the same smallest-free number, its Update 409s again, and retry.DefaultRetry (~5 short steps) exhausts before the cache catches up — the second volume is silently lost (observed on stand: vd_count=2 after two concurrent number-less `vd c`). CreateAutoNumbered now reads the RD through a direct, UNCACHED API reader (mgr.GetAPIReader(), plumbed via store NewWithAPIReader) on every retry attempt, so each retry sees the winner's committed VolumeDefinition and advances to the next free number. The retry budget is widened to 12 steps so a burst of concurrent creates still converges. nil apiReader (in-memory / unit stores with no informer) falls back to the cached client unchanged. Adds a k8s-store concurrency conformance test (N concurrent CreateAutoNumbered against a real envtest apiserver → distinct numbers, none dropped). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…UG-048) Two 1G volumes added concurrently on a 3-diskful RD run up to 6 fresh initial syncs at once on the loop substrate, so convergence routinely exceeds the single-volume 240s budget (observed vol-2 still Outdated — converging, not the Inconsistent wedge — at 240s, UpToDate moments later). 360s removes the flake while staying far inside the wedge discriminator. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…(BUG-048) The race-free metadata gate eliminated the 2-replica concurrent late-add wedge, but on a ≥3-diskful RD the per-volume winner seed can still race its own bring-up: the late volume occasionally comes up Inconsistent on EVERY diskful replica with no SyncSource and no peer holding data — the operator-reported wedge (replication:Established, resync-suspended:no, no SyncSource). Neither existing self-heal recovers it: maybeRecovery Promote is gated on the dispatcher's auto-primary (suppressed on the initialized RD every late-add lands on) and on the local being already UpToDate (it is Inconsistent here); maybeSoloPromote requires zero peers. Add a dedicated kernel-truth self-heal. Adm.NeedsLateAddPromote fires ONLY when a local diskful volume is Inconsistent, NO connected peer exposes committed data for it (no real SyncSource to wait for), none is actively resyncing, no replica is Primary, and this node holds the lowest node-id among the volume's diskful replicas — so exactly one deterministic node force-primaries. maybeLateAddPromote runs it on the steady-state reconcile WITHOUT the auto-primary gate (that is precisely what is missing on the late-add path), throttled by the shared recoveryPromoteThrottle. Data-safety: a fresh late-added volume is seeded at the deterministic day0 current-UUID on every replica before bring-up, so primary --force mints no unrelated UUID; the Inconsistent peers SyncTarget from this now-Primary source. The peer-has-data veto preserves the Bug 342 unrelated-data guard (a relocate target never force-primaries over a data-bearing peer), and the active-resync + peer-Primary vetoes prevent churn. Self-limiting: the predicate stops holding once the peers converge. Unit-pinned in needs_late_add_promote_test.go (fires on the wedge; vetoes on peer-data / active-resync / peer-Primary / lower-id / all-UpToDate). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Reconciler-level pins that maybeLateAddPromote force-primaries the lowest-id replica on the late-add wedge WITHOUT an auto-primary flag, skips diskless replicas, and vetoes when a peer holds data. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Defence-in-depth against the silent lost-update. The optimistic-lock Update normally guarantees no racing writer clobbered the allocation — a stale resourceVersion 409s and the loop retries. But under heavy apiserver/etcd load a follower read can return a resourceVersion that is already superseded yet still accepted on Update, so two concurrent auto-numbered creates can both return success while only one VolumeDefinition lands (the operator-visible silent drop: both `linstor vd c` exit 0, the RD has one fewer volume than asked). After the Update commits, re-read live and confirm the assigned VolumeNumber is actually present; if it vanished, surface a synthetic Conflict so the whole allocate+append re-runs against the corrected state. The create now either persists the volume or retries — it can never return success having silently lost it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…limit The replay-runner prefixes 'replay-' and suffixes a random token; the 40-char workflow name pushed the generated RD name past DRBD's resource-name length cap, so 'resource-definition create' failed with exit 10 before the scenario even started. Rename to vd-late-concurrent-no-drop so the generated name fits. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Stand validation caught a serious regression in the late-add-promote self-heal: NeedsLateAddPromote could not distinguish a normal day0 bootstrap (every volume transiently Inconsistent, no SyncSource yet — the normal fresh state) from the BUG-048 late-add wedge, so it MISFIRED on a fresh RD's vol-0. Worse, its lowest-node-id election counted only CONNECTED peers; at day0 t+1s peers are still StandAlone/Connecting, so each node saw a partial peer set, several concluded "I am lowest", and two non-lowest nodes force-primaried ~37ms apart → split-brain on the baseline volume (observed: worker-1 + worker-3 both promoted, worker-2 the true lowest did not). Add two split-brain-safety gates: 1. Require a local UpToDate sibling volume — proves the RD is PAST first activation. A pure day0 bootstrap has NO UpToDate volume, so the predicate can never fire there (the normal winner-election + auto-promote own day0). 2. Require EVERY peer connection to be fully Connected — the lowest-node-id election is only sound with COMPLETE peer information. Any StandAlone/Connecting peer defers the self-heal, so every node computes the election over the same full set and EXACTLY ONE (the true lowest id) ever promotes. Unit tests previously only modelled fully-Connected peers (so they passed while live partial-connectivity behaviour broke); add explicit pins for the day0-all-Inconsistent and partial-connectivity cases, and give the veto fixtures a UpToDate sibling so they exercise the intended veto rather than gate 1. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

When the concurrent seed races and a late volume comes up Inconsistent on every replica, the late-add-promote self-heal recovers it, but the detect + recoveryPromoteThrottle + force-primary + resync chain runs past the bare 240s convergence budget on a contended stand. 360s keeps the recovery path from flaking while staying well inside the wedge discriminator. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-14T10:36:02Z

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 59 minutes and 19 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1300c55-1739-4ce8-b708-ec003354585a

📥 Commits

Reviewing files that changed from the base of the PR and between 477fc3e and e743c3a.

📒 Files selected for processing (2)

pkg/store/k8s/volume_definitions.go
tests/integration/harness/manager.go

📝 Walkthrough

Walkthrough

Fixes BUG-048 (concurrent VolumeNumber auto-assignment lost-updates) by moving allocation from REST into an atomic store-layer CreateAutoNumbered operation with K8s uncached-reader retries and post-commit verification. Fixes BUG-332 (late-added DRBD volumes wedged Inconsistent across replicas) by adding NeedsLateAddPromote and LateAddResyncKickVolumes kernel-truth gates and wiring corresponding reconciler self-heals (maybeLateAddPromote, maybeKickLateAddResync), replacing a stale .res-parsing heuristic with metadata-condition gating. Includes comprehensive regression tests across store, REST, DRBD, satellite, and e2e layers.

Changes

BUG-048: Concurrent VolumeNumber Auto-Numbering

Layer / File(s)	Summary
Store interface and K8s uncached-reader implementation `pkg/store/store.go`, `pkg/store/k8s/volume_definitions.go`, `pkg/store/k8s/k8s.go`	`VolumeDefinitionStore` gains `CreateAutoNumbered` contract; K8s `volumeDefinitions` extends struct with optional `apiReader`; implements `CreateAutoNumbered` with smallest-free allocation, conflict/not-found retry loop, `smallestFreeVolumeNumber`, `fetchRDLive`, and `verifyVolumeLanded` post-commit verification; schema import added.
Store constructors and apiserver wiring `pkg/store/k8s/k8s.go`, `cmd/apiserver/main.go`	`New` delegates to `NewWithAPIReader`; `NewWithAPIReader` wires provided `apiReader` into `volumeDefinitions`; apiserver `main` constructs store with `mgr.GetAPIReader()` to enable uncached live reads during conflict retries.
In-memory store and REST refactor `pkg/store/inmemory_volume_definition.go`, `pkg/rest/volume_definitions.go`	In-memory store implements atomic smallest-free allocation under write-lock; REST `handleVDCreate` detects omitted `volume_number`, validates explicit numbers, routes all persistence through `persistNewVolumeDefinition` helper with centralized conflict handling and `CreateAutoNumbered`/`Create` dispatch; legacy REST-side `autoAssignVolumeNumber` removed.
Controller merge-patch conflict protection `internal/controller/resource_controller.go`, `internal/controller/bug_048_vd_drop_minor_patch_test.go`	`patchRDVolumeMinors` documents Bug-048 and conditionally embeds `metadata.resourceVersion` in merge-patch body; regression tests assert stale-snapshot conflict rejection and fresh-snapshot minor stamping without clobbering appended volumes.
Store, REST, K8s, and harness regression tests `pkg/store/storetest/storetest.go`, `pkg/store/k8s/k8s_test.go`, `pkg/rest/volume_definitions_test.go`, `tests/operator-harness/lib.sh`, `tests/operator-harness/replay-runner.sh`, `CLAUDE.md`, `tests/operator-harness/replay/vd-late-concurrent-no-drop.yaml`	Storetest adds sequential and hole-reuse `CreateAutoNumbered` contract checks; K8s envtest concurrently creates and validates distinct allocations; REST test asserts two simultaneous omitted-number POSTs yield `{0,1,2}`; harness adds `vd_count` assertion (counts VDs via linstor CLI), documents it in replay-runner and CLAUDE.md; vd-late-concurrent-no-drop.yaml replay scenario asserts `vd_count==3` after concurrent late creates with all-uptodate convergence.

BUG-332 + BUG-048: Late-Add Promote and Resync-Kick Self-Heals

Layer / File(s)	Summary
DRBD late-add helpers `pkg/drbd/drbdadm.go`, `pkg/drbd/drbdsetup_show.go`	`Adm.InvalidateVolume` shells out `drbdadm invalidate` for single volume; `Adm.MintLateAddSource` probes drbdsetup JSON, collects local `Inconsistent` minors, disconnects peers, clears minors via `drbdsetup new-current-uuid --clear-bitmap`, and reconnects via `drbdadm adjust`; `drbdsetupStatusDevice` schema adds `Minor` field to capture kernel per-device minor.
DRBD NeedsLateAddPromote and LateAddResyncKickVolumes gates `pkg/drbd/drbdadm.go`	`Adm.NeedsLateAddPromote` probes drbdsetup JSON enforcing no existing Primary, day0 exit (local UpToDate exists), full peer connectivity, at least one local Inconsistent volume, and per-volume vetoes (no committed peer data, no active resync, this node is lowest diskful node-id); `Adm.LateAddResyncKickVolumes` returns locally Inconsistent volumes eligible for invalidate gated on day0, full peer connectivity, UpToDate peer sources, and absence of active resync (via `peerDeviceResyncInProgress` checking replication state and `resync-suspended`).
DRBD gate unit tests `pkg/drbd/needs_late_add_promote_test.go`, `pkg/drbd/needs_late_add_resync_kick_test.go`	`NeedsLateAddPromote` covers 12+ cases including positive genuine-wedge and negatives for not-lowest-node-id, peer-data SyncTarget veto, active-resync SyncSource veto, Primary veto, day0 safety-gate, partial-connectivity, all-UpToDate no-op, lower-id peer bring-up, diskless witness. `LateAddResyncKickVolumes` covers positive partial-stall and post-promote-dependency, and negatives for no-UpToDate-peer, active-SyncTarget, multi-peer in-progress, outdated-local, UpToDate-local, day0, Primary veto, peer-not-connected, converged no-op, probe-failure.
Reconciler metadata gate fix and dwell tracking `pkg/satellite/reconciler.go`	Removes stale `hasLateAddedVolume` .res-parsing; re-gates `ensurePerVolumeMetadata` on `dr.GetMetadataCreated()`; adds `firstLateAddStallAt` map for in-memory stall-dwell tracking and `lateAddStallDwell` constant (20s); `NewReconciler` initializes `firstLateAddStallAt`.
Reconciler self-heal pipeline and late-add methods `pkg/satellite/reconciler.go`	`maybePromoteSelfHeals` invokes `maybeLateAddPromote` (gates on diskless, Adm, `NeedsLateAddPromote`, executes `MintLateAddSource` and secondary) and `maybeKickLateAddResync` (gates on `NeedsLateAddResyncKick`, enforces dwell before invalidate, throttles per-volume kick, clears dwell when stall clears) after `maybeRecoveryPromote` but before `maybeSoloPromote`.
Reconciler late-add self-heal tests `pkg/satellite/maybe_late_add_promote_test.go`, `pkg/satellite/maybe_kick_late_add_resync_test.go`, `pkg/satellite/reconciler_drbd_test.go`	`maybe_late_add_promote_test.go` covers wedged-late-volume fire (disconnect/clear-bitmap/adjust), diskless skip, peer-data skip; `maybe_kick_late_add_resync_test.go` covers stalled-volume dwell-gated fire, dwell reset on stall-clear, diskless skip, active-SyncTarget skip, throttle enforcement; `reconciler_drbd_test.go` adds race test asserting per-volume create-md still runs despite pre-rendered `.res` volume block when metadata condition triggers.
CLI matrix e2e scenario `tests/e2e/cli-matrix/multi-volume-late-vd-create.sh`	Updates comments documenting Bug-332 and Bug-048, resolves 3 diskful nodes, waits for vol-0 UpToDate via per-node `status_disk_state` polling, executes concurrent vol-1 and vol-2 late creates, asserts `vd_count==3` to detect dropped allocations, extends convergence wait to 360s, replaces `kubectl debug` + `drbdadm status` with `on_node drbdsetup status` kernel-truth check that fails for Diskless volumes 1 or 2.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    Note over RESTHandler,K8sAPI: BUG-048: Concurrent auto-numbering via store
    participant RESTHandler as REST handleVDCreate
    participant persistVD as persistNewVolumeDefinition
    participant K8sStore as volumeDefinitions.CreateAutoNumbered
    participant K8sAPI as K8s API (uncached reader)
  end

  RESTHandler->>persistVD: autoNumber=true, vd
  persistVD->>K8sStore: CreateAutoNumbered(ctx, rdName, vd)
  loop conflict/not-found retry
    K8sStore->>K8sAPI: GET ResourceDefinition (fetchRDLive, uncached)
    K8sAPI-->>K8sStore: latest RD + inline VDs
    K8sStore->>K8sStore: smallestFreeVolumeNumber(existing)
    K8sStore->>K8sAPI: Update RD with appended VD
    alt conflict
      K8sAPI-->>K8sStore: 409 Conflict → retry
    else success
      K8sAPI-->>K8sStore: 200 OK
      K8sStore->>K8sAPI: GET RD live (verifyVolumeLanded)
      alt assigned VD present
        K8sAPI-->>K8sStore: confirmed
      else lost-update detected
        K8sAPI-->>K8sStore: synthetic conflict → retry
      end
    end
  end
  K8sStore-->>persistVD: allocatedVolumeNumber
  persistVD-->>RESTHandler: success

sequenceDiagram
  rect rgba(255, 218, 185, 0.5)
    Note over Reconciler,DRBDKernel: BUG-332: Late-add promote and resync-kick self-heals
    participant Reconciler as satellite.Reconciler
    participant maybeLateAdd as maybeLateAddPromote / maybeKickLateAddResync
    participant NeedsGate as Adm.NeedsLateAddPromote / NeedsLateAddResyncKick
    participant drbdsetup as drbdsetup status --json
    participant DRBDKernel as DRBD Kernel
  end

  Reconciler->>maybeLateAdd: maybePromoteSelfHeals(ctx, dr, diskless)
  alt diskless=true or Adm=nil
    maybeLateAdd-->>Reconciler: skip
  else
    maybeLateAdd->>NeedsGate: NeedsLateAddPromote(ctx, resource)
    NeedsGate->>drbdsetup: drbdsetup status --json
    drbdsetup-->>NeedsGate: JSON status
    NeedsGate->>NeedsGate: localHasUpToDateVolume + allPeersConnected + lateAddVolumeNeedsLocalPromote
    NeedsGate-->>maybeLateAdd: true/false
    alt needs promote
      maybeLateAdd->>DRBDKernel: drbdadm primary --force (runAutoPromote)
      DRBDKernel-->>maybeLateAdd: ok
      maybeLateAdd->>DRBDKernel: drbdadm secondary
    end
    maybeLateAdd->>NeedsGate: NeedsLateAddResyncKick(ctx, resource)
    NeedsGate->>drbdsetup: drbdsetup status --json
    drbdsetup-->>NeedsGate: JSON status
    NeedsGate-->>maybeLateAdd: stalled resync detected
    alt stall beyond dwell && not throttled
      maybeLateAdd->>DRBDKernel: drbdadm invalidate resource/volume
      DRBDKernel-->>maybeLateAdd: ok
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cozystack/blockstor#120: Both PRs modify the satellite reconciliation self-heal pipeline in pkg/satellite/reconciler.go around maybePromoteSelfHeals (main PR adds late-add promote and late-add resync-kick, while retrieved PR introduces the maybeSoloPromote structure that the main PR extends).
cozystack/blockstor#151: Both PRs modify the REST volume-definition create flow in pkg/rest/volume_definitions.go, specifically handleVDCreate and parent RD/VD lookup handling for cache-lag safety around concurrent allocation.
cozystack/blockstor#162: Both PRs touch the same e2e release-gate scenario tests/e2e/cli-matrix/multi-volume-late-vd-create.sh, updating late volume-add assertions and convergence logic to expose and validate BUG-048 and BUG-332 fixes.

Poem

🐇 Hop, hop, the numbers raced,
Two volumes chasing the same space!
Now the store locks down the lane,
No lost-update sneaks again.
Late-adds wedged? Force-primary's key—
The lowest node-id sets them free!
And when resync stalls, we kick it right—
Back to sync before the night! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: fixing a critical bug (BUG-048) where concurrent volume definition adds would cause a lost update.
Docstring Coverage	✅ Passed	Docstring coverage is 85.94% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/bug-048-concurrent-late-vd-wedge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses BUG-048, resolving a control-plane availability issue where concurrent late volume additions (number-less creations) against an initialized ResourceDefinition could result in lost updates or volumes permanently wedged in an Inconsistent state. The fix moves the auto-assign VolumeNumber logic into the store layer (CreateAutoNumbered) to make the allocation atomic within the conflict-retry loop using an uncached API reader, and introduces a dedicated NeedsLateAddPromote self-heal mechanism to unstick wedged late-added volumes. A critical correctness issue was identified in the Kubernetes store implementation: if the uncached API reader is nil (such as in test environments), the post-commit verification will query the stale informer cache, falsely detect a conflict, and retry, which leads to duplicate volume allocations and orphaned volumes. It is recommended to skip this verification when the API reader is unavailable.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-14T10:37:54Z

+func (s *volumeDefinitions) verifyVolumeLanded(ctx context.Context, rdName string, assigned int32) error {
+	rd, err := s.fetchRDLive(ctx, rdName)
+	if err != nil {
+		return err
+	}
+
+	for i := range rd.Spec.VolumeDefinitions {
+		if rd.Spec.VolumeDefinitions[i].VolumeNumber == assigned {
+			return nil
+		}
+	}
+
+	return apierrors.NewConflict(
+		schema.GroupResource{Group: crdv1alpha1.GroupVersion.Group, Resource: "resourcedefinitions"},
+		Name(rdName),
+		errors.Newf("volume %d did not persist (concurrent clobber); retrying allocation", assigned),
+	)
+}


Critical Correctness Bug: Potential Orphaned/Duplicate Volume Creation due to Cache Lag

If s.apiReader is nil (which is the default fallback when the store is initialized via storek8s.New(mgr.GetClient())), fetchRDLive falls back to the informer-cached client. This introduces a severe race condition that can lead to orphaned/duplicate volumes being silently created in etcd:

First Attempt: CreateAutoNumbered successfully updates the ResourceDefinition in etcd to add vol-1.

Verification: verifyVolumeLanded is called and fetches the RD. Because apiReader is nil, it reads from the stale informer cache, which has not yet synced the update. It fails to find vol-1 and returns a synthetic Conflict error.

Retry: The loop retries. If the informer cache has synced by the time the next iteration fetches the RD, it will now see [vol-0, vol-1].

Duplicate Allocation: smallestFreeVolumeNumber will allocate vol-2 instead of vol-1.

Second Commit: The Update succeeds, committing vol-2 to etcd.

Result: Both vol-1 and vol-2 are created in etcd, but only vol-2 is returned to the caller. vol-1 is permanently leaked/orphaned.

To prevent this, we must skip the post-commit verification if s.apiReader is nil.

func (s *volumeDefinitions) verifyVolumeLanded(ctx context.Context, rdName string, assigned int32) error { if s.apiReader == nil { return nil } rd, err := s.fetchRDLive(ctx, rdName) if err != nil { return err } for i := range rd.Spec.VolumeDefinitions { if rd.Spec.VolumeDefinitions[i].VolumeNumber == assigned { return nil } } return apierrors.NewConflict( schema.GroupResource{Group: crdv1alpha1.GroupVersion.Group, Resource: "resourcedefinitions"}, Name(rdName), errors.Newf("volume %d did not persist (concurrent clobber); retrying allocation", assigned), ) }

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/satellite/reconciler.go`:
- Around line 2094-2095: The per-volume metadata gate at line 2094 in the
ensurePerVolumeMetadata call only checks dr.GetMetadataCreated(), but this fails
during migration/stamper-lag windows when the marker exists but the Condition is
false, causing the branch to be skipped. Since line 2146 already uses
.md-created as a fallback source of truth for post-first-activation, restore
this same fallback check in the gate condition at line 2094 by adding an OR
condition that also checks for the .md-created fallback alongside
dr.GetMetadataCreated(), ensuring late-added volumes do not miss create-md and
avoid the diskless wedge path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 605ed738-d790-4cda-b3cb-ae0044a24200

📥 Commits

Reviewing files that changed from the base of the PR and between 015374d and 39f148b.

📒 Files selected for processing (19)

CLAUDE.md
cmd/apiserver/main.go
pkg/drbd/drbdadm.go
pkg/drbd/needs_late_add_promote_test.go
pkg/rest/volume_definitions.go
pkg/rest/volume_definitions_test.go
pkg/satellite/maybe_late_add_promote_test.go
pkg/satellite/reconciler.go
pkg/satellite/reconciler_drbd_test.go
pkg/store/inmemory_volume_definition.go
pkg/store/k8s/k8s.go
pkg/store/k8s/k8s_test.go
pkg/store/k8s/volume_definitions.go
pkg/store/store.go
pkg/store/storetest/storetest.go
tests/e2e/cli-matrix/multi-volume-late-vd-create.sh
tests/operator-harness/lib.sh
tests/operator-harness/replay-runner.sh
tests/operator-harness/replay/vd-late-concurrent-no-drop.yaml

coderabbitai · 2026-06-14T10:45:56Z

+	if !diskless && dr.GetMetadataCreated() &&
 		!r.isDisklessToDiskfulFlip(ctx, dr, diskless) {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore .md-created fallback in the per-volume metadata gate.

At Line 2094, ensurePerVolumeMetadata is gated only by dr.GetMetadataCreated(). But Line 2146 already treats .md-created as the fallback source of truth for “post-first-activation.” In a migration/stamper-lag window (marker exists, Condition false), this branch is skipped and late-added volumes can miss create-md, reintroducing the diskless wedge path.

💡 Suggested fix

- if !diskless && dr.GetMetadataCreated() && + _, mdStatErr := os.Stat(mdMarkerPath) + postFirstActivation := dr.GetMetadataCreated() || mdStatErr == nil + if !diskless && postFirstActivation && !r.isDisklessToDiskfulFlip(ctx, dr, diskless) { err = r.ensurePerVolumeMetadata(ctx, dr, devices, diskless) if err != nil { return err } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/satellite/reconciler.go` around lines 2094 - 2095, The per-volume metadata gate at line 2094 in the ensurePerVolumeMetadata call only checks dr.GetMetadataCreated(), but this fails during migration/stamper-lag windows when the marker exists but the Condition is false, causing the branch to be skipped. Since line 2146 already uses .md-created as a fallback source of truth for post-first-activation, restore this same fallback check in the gate condition at line 2094 by adding an OR condition that also checks for the .md-created fallback alongside dr.GetMetadataCreated(), ensuring late-added volumes do not miss create-md and avoid the diskless wedge path.

The silent VD-drop core. ensureRDVolumeMinors reads the RD, allocates DRBD minors for any volume still missing one, then writes them back via patchRDVolumeMinors — a JSON merge-patch on spec.volumeDefinitions. RFC 7386 merge-patch REPLACES an array wholesale, so the write stamps the ENTIRE volumeDefinitions list from the snapshot the reconciler read at the top of the function. A concurrent number-less `linstor vd c` that appended a new volume between that read and the patch is silently overwritten away — and both `vd c` return success because the clobber is an async reconciler write, not their own create. This is the operator-visible "vdCount short by one, both vd c rc0, zero conflict/retry in the logs" failure. The code claimed "optimistic concurrency" in a comment but never set metadata.resourceVersion, so the patch could never 409 and the caller's IsConflict re-read branch was dead. Embed the snapshot's metadata.resourceVersion in the merge-patch body (the same precondition client.MergeFromWithOptimisticLock builds for typed patches): a racing VD append moves the RD past the snapshot, the apiserver rejects the wholesale replace with Conflict, the caller re-reads, and the next reconcile re-allocates minors against the now-complete list — the appended volume survives. A missing resourceVersion (RD freshly built in a unit path) degrades to the prior blind patch rather than failing. L1: a stale-snapshot patch racing an appended volume must 409 (and not drop it); a fresh-snapshot patch must still commit. The clobber test fails pre-fix (the blind patch returns nil and drops vol-1). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…G-048) The ≥3-replica convergence wedge. The late-add-promote lowest-node-id election (lateAddVolumeNeedsLocalPromote) counted a lower-id peer as a competing promoter ONLY when it observed that peer's volume as Inconsistent. A freshly late-added volume on a lower-id diskful peer passes through Attaching / Negotiating / DUnknown while its kernel slot brings the new volume up — it has NOT yet settled to Inconsistent. The old code treated those transient states as "not a competing diskful promoter; ignore", so a higher-id node observing a lower-id peer mid-bring-up concluded "I am lowest" and force-primaried. Once the true-lowest peer finished bring-up it ALSO promoted: two nodes minted divergent current-UUIDs on the same late volume → it wedged PausedSyncS / StandAlone split-brain with no convergence. Stand-observed on 3-diskful: node-id 1 and node-id 0 both logged the late-add force-primary for the same late volume; the volume stuck Inconsistent on every replica past the budget. Treat a lower-id peer in Attaching / Negotiating / DUnknown (the transient diskful bring-up / connection-not-fully-negotiated states) as the rightful promoter and defer — only the true-lowest diskful node ever promotes, so exactly one SyncSource is seeded. A steady-state Diskless witness is still ignored (it never becomes a diskful promoter; deferring to it would deadlock), so the lowest DISKFUL replica still seeds. This change is downstream of the two day0 split-brain gates (localHasUpToDateVolume + allPeersConnected) added in e766d4d: at day0 no volume is UpToDate so gate 1 blocks the election entirely and this code is never reached — the split-brain leg is untouched. The change is strictly MORE conservative (defers more, never promotes more), so it can only reduce split-brain risk. L1: defer to a lower-id peer in each transient bring-up state (fails pre-fix); still promote over a lower-id Diskless witness; the true-lowest still promotes despite a higher-id peer mid-bring-up. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…licas (BUG-048) The residual ≥3-replica concurrent late-vd-add convergence wedge. When two volumes are added back-to-back to an already-initialized 3-diskful RD, the last volume occasionally enters a DRBD resync that STALLS and never feeds every peer — a SyncSource is elected (syncSeen=1) but the volume sits Inconsistent/Outdated on two of three replicas with out-of-sync frozen at the full size. Two stand-observed signatures (both on the last concurrently-added volume of a 3-diskful RD): - dependency deadlock: a peer-device is PausedSyncS with resync-suspended:dependency (DRBD serialised this volume's resync behind a sibling that already finished, but the dependency latch never cleared) while partner peers wait in WFBitMapT/ resync-suspended:peer for a bitmap exchange the paused source never starts. oos frozen at full forever. - partial stall: one peer reaches UpToDate (a real SyncSource) but a second peer stays WFBitMapT/Outdated and never finalises. Neither existing self-heal recovers it: maybeLateAddPromote mints a source only when NONE exists (all-Inconsistent, no peer data) and a force-primary does not clear a resync-suspended:dependency latch; maybeRecoveryPromote requires the local already-UpToDate. Add a dedicated kernel-truth self-heal. Adm.NeedsLateAddResyncKick fires ONLY when no replica is Primary, the RD is past day0 (a local UpToDate volume exists), every peer is Connected, and a peer-device is in a stalled resync (PausedSync*/WFBitMap* with a non-"no" resync-suspended). maybeKickLateAddResync then disconnect+adjusts the resource, re-running the GI handshake so DRBD restarts the resync from the correct direction. Data-safe: disconnect/reconnect never mutates data and DRBD re-derives the sync direction from generation-UUIDs/bitmaps. A 45s stall-dwell gate ensures only a GENUINELY frozen resync is kicked, never the normal transient volume-resync serialization (a few seconds of PausedSyncS/ dependency while a sibling finishes). Throttled by the shared recoveryPromoteThrottle and self-limiting (once converged no stalled peer-device remains). Skipped on diskless replicas. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai

♻️ Duplicate comments (1)

pkg/satellite/reconciler.go (1)
2121-2123: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the .md-created fallback in this gate.

Line 2121 now keys this branch only on dr.GetMetadataCreated(), but Lines 2172-2173 still treat .md-created as the fallback source of truth for firstActivation. In the upgrade/stamper-lag window where the marker exists and the Condition is still false, this skips ensurePerVolumeMetadata here and also skips the first-activation path below, so a late-added volume still misses create-md and comes up unusable.
💡 Proposed fix
-	if !diskless && dr.GetMetadataCreated() &&
+	_, mdStatErr := os.Stat(mdMarkerPath)
+	postFirstActivation := dr.GetMetadataCreated() || mdStatErr == nil
+	if !diskless && postFirstActivation &&
 		!r.isDisklessToDiskfulFlip(ctx, dr, diskless) {
 		err = r.ensurePerVolumeMetadata(ctx, dr, devices, diskless)
 		if err != nil {
 			return err
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/satellite/reconciler.go` around lines 2121 - 2123, The condition at line
2121 that gates the ensurePerVolumeMetadata call currently only checks
dr.GetMetadataCreated(), but this creates a gap during upgrade scenarios where
the .md-created marker exists but the Condition hasn't been set yet. To fix
this, add a fallback check for the .md-created marker to the condition at line
2121, similar to how the firstActivation logic at lines 2172-2173 handles it.
This ensures the ensurePerVolumeMetadata execution and first-activation path
both trigger correctly even when the Condition is still false but the marker
exists, preventing late-added volumes from becoming unusable.

🧹 Nitpick comments (2)

pkg/satellite/maybe_kick_late_add_resync_test.go (1)

147-160: ⚡ Quick win

Tighten the diskless test to assert zero DRBD probing.

This test currently only checks that no disconnect ran. It should also assert drbdsetup status pvc-ka --json was not called, so diskless short-circuit behavior is pinned.

Suggested assertion

 	if slices.Contains(fx.CommandLines(), "drbdadm disconnect pvc-ka") {
 		t.Errorf("diskless replica must NOT be kicked, got: %v", fx.CommandLines())
 	}
+	if slices.Contains(fx.CommandLines(), kickStatusKey) {
+		t.Errorf("diskless replica must skip status probing, got: %v", fx.CommandLines())
+	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/satellite/maybe_kick_late_add_resync_test.go` around lines 147 - 160, The
TestMaybeKickLateAddResync_SkipsDiskless test currently only verifies that the
disconnect command is not executed, but does not verify that DRBD probing is
skipped entirely for diskless replicas. Add an additional assertion after the
existing check that verifies the drbdsetup status command with --json flag was
not called on the FakeExec object, ensuring the diskless short-circuit behavior
properly avoids unnecessary DRBD probing operations.

pkg/drbd/needs_late_add_resync_kick_test.go (1)

187-207: ⚡ Quick win

Add an explicit peer-Primary veto regression test.

TestNeedsLateAddResyncKick_PrimaryVetoes currently exercises only local role:"Primary" (Line 191). The “Primary anywhere” safety contract is stronger; add a case where local is Secondary but a connection has peer-role:"Primary".

Suggested test addition

+func TestNeedsLateAddResyncKick_PeerPrimaryVetoes(t *testing.T) {
+	adm := admWithKickStatus(t, `[{
+	  "name":"pvc-kick","node-id":0,"role":"Secondary",
+	  "devices":[
+	    {"volume":0,"disk-state":"UpToDate"},
+	    {"volume":1,"disk-state":"Inconsistent"}
+	  ],
+	  "connections":[{
+	    "peer-node-id":1,"name":"n2","connection-state":"Connected","peer-role":"Primary",
+	    "peer_devices":[
+	      {"volume":1,"peer-disk-state":"Inconsistent","replication-state":"PausedSyncS","resync-suspended":"dependency"}
+	    ]
+	  }]
+	}]`)
+
+	if adm.NeedsLateAddResyncKick(t.Context(), "pvc-kick") {
+		t.Fatal("must NOT kick while any peer is Primary")
+	}
+}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/drbd/needs_late_add_resync_kick_test.go` around lines 187 - 207, The
TestNeedsLateAddResyncKick_PrimaryVetoes function currently only tests the local
Primary role scenario (as seen in the "role":"Primary" on line 191). Add a new
test case to verify the veto also works when a peer has Primary role instead of
the local node. Create an additional test function or extend the existing test
to validate that NeedsLateAddResyncKick returns false when the local node is
Secondary but a peer connection has peer-role:"Primary", ensuring the "Primary
anywhere" safety contract is enforced for both local and peer Primary scenarios.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/satellite/reconciler.go`:
- Around line 2121-2123: The condition at line 2121 that gates the
ensurePerVolumeMetadata call currently only checks dr.GetMetadataCreated(), but
this creates a gap during upgrade scenarios where the .md-created marker exists
but the Condition hasn't been set yet. To fix this, add a fallback check for the
.md-created marker to the condition at line 2121, similar to how the
firstActivation logic at lines 2172-2173 handles it. This ensures the
ensurePerVolumeMetadata execution and first-activation path both trigger
correctly even when the Condition is still false but the marker exists,
preventing late-added volumes from becoming unusable.

---

Nitpick comments:
In `@pkg/drbd/needs_late_add_resync_kick_test.go`:
- Around line 187-207: The TestNeedsLateAddResyncKick_PrimaryVetoes function
currently only tests the local Primary role scenario (as seen in the
"role":"Primary" on line 191). Add a new test case to verify the veto also works
when a peer has Primary role instead of the local node. Create an additional
test function or extend the existing test to validate that
NeedsLateAddResyncKick returns false when the local node is Secondary but a peer
connection has peer-role:"Primary", ensuring the "Primary anywhere" safety
contract is enforced for both local and peer Primary scenarios.

In `@pkg/satellite/maybe_kick_late_add_resync_test.go`:
- Around line 147-160: The TestMaybeKickLateAddResync_SkipsDiskless test
currently only verifies that the disconnect command is not executed, but does
not verify that DRBD probing is skipped entirely for diskless replicas. Add an
additional assertion after the existing check that verifies the drbdsetup status
command with --json flag was not called on the FakeExec object, ensuring the
diskless short-circuit behavior properly avoids unnecessary DRBD probing
operations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 063be8c9-5106-44fa-acd8-8280a5e0cd53

📥 Commits

Reviewing files that changed from the base of the PR and between 035915d and c4271d1.

📒 Files selected for processing (5)

pkg/drbd/drbdadm.go
pkg/drbd/needs_late_add_promote_test.go
pkg/drbd/needs_late_add_resync_kick_test.go
pkg/satellite/maybe_kick_late_add_resync_test.go
pkg/satellite/reconciler.go

…-048 replay The vd-late-concurrent-no-drop replay's all_uptodate budget comment now references both BUG-048 convergence self-heals — the late-add-promote (mints a source when none exists) and the new late-add resync-kick (restarts a stalled/paused resync that never feeds all peers) — so the 360s budget rationale stays accurate after the resync-kick fix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…nvalidate, not disconnect (BUG-048) The first resync-kick iteration used `disconnect`+`adjust` to re-run the DRBD handshake. On-stand that made the wedge WORSE: every replica of a fresh late-added volume shares the deterministic day0 current-UUID, so the post-reconnect GI handshake concludes "same generation, no resync needed" and ABANDONS the dirty bitmap — the connection went StandAlone / Established with out-of-sync frozen at the full size, never converging. Replace the connection re-handshake with a per-volume local `drbdadm invalidate <res>/<vol>`: it explicitly discards the empty, non-authoritative LOCAL copy of the one stalled volume and forces a full SyncTarget FROM an UpToDate peer — working WITH the bitmap instead of re-deriving sync direction from the ambiguous day0 GI. Adm.LateAddResyncKickVolumes returns the local volumes that are below UpToDate, have a connected UpToDate peer to pull from, and are not already being actively SyncTarget-pulled — only when no replica is Primary, the RD is past day0, and every peer is Connected. maybeKickLateAddResync invalidates each. This covers the partial-stall signature (a peer reached UpToDate but this replica is stuck WFBitMapT/Outdated) and the post-promote dependency case (maybeLateAddPromote mints the source for the all-Inconsistent case, then invalidate re-pulls the stragglers). Data-safe: only a non-authoritative below-UpToDate local copy is ever discarded, and only when an UpToDate peer exists as the source. The 45s stall-dwell + recoveryPromoteThrottle still gate it so the normal transient volume-resync serialization is never disrupted. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…a time, back off on active resync) (BUG-048) The first invalidate iteration flapped on-stand: it invalidated every stalled volume at once and re-fired while a triggered resync was still running, so out-of-sync dropped to 0, the kick re-invalidated, and oos jumped back to full — the volume never settled UpToDate within the convergence budget. Three refinements make the kick idempotent and non-flapping: - invalidate ONLY a genuinely-stuck Inconsistent local volume; Outdated is now excluded (it is the transient post-resync / mid-handshake state a just-invalidated volume passes through — re-invalidating it re-dirties a converging copy). - back off whenever ANY peer-device on the volume has a resync actively in progress in either direction (peerDeviceResyncInProgress) — so once an invalidate triggers a SyncTarget the kick stops until it finishes; only a genuinely stalled (paused/suspended) handshake re-qualifies. - invalidate ONE volume per kick (lowest-numbered stalled), since DRBD serialises resyncs per connection — invalidating several at once re-creates the dependency-serialization that wedges. Also shortens the stall-dwell 45s→20s: the in-progress back-off is now the primary anti-flap guard, so the dwell only needs a small settling margin, and a genuinely frozen wedge recovers well inside the convergence budget. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…ry --force (BUG-048) The all-Inconsistent late-add wedge (no replica ever won the volume's UpToDate seed) was never actually recovered. maybeLateAddPromote ran `drbdadm primary --force <res>`, but the kernel REJECTS it with "(-16) Disk state is lower than outdated" whenever any volume of the resource is Inconsistent — stand-observed: the promote retried every throttle window and the volume stayed Inconsistent on all 3 replicas past the convergence budget (out-of-sync frozen at full). `primary` is also resource-wide, so one Inconsistent late volume blocks promoting the whole resource even though the sibling volumes are UpToDate. Replace the force-primary with MintLateAddSource, which makes THIS node an UpToDate source per-volume the way DRBD's own skip-initial-sync does: disconnect the resource, run `drbdsetup new-current-uuid --clear-bitmap <minor>` for each LOCAL Inconsistent volume (flips it UpToDate with a clean bitmap), then `drbdadm adjust` to reconnect — the Inconsistent peers then SyncTarget from the freshly-UpToDate local volumes. Verified on-stand: the clear-bitmap flips the volume UpToDate and the peers begin converging, where primary --force was permanently rejected. Data-safe: gated by NeedsLateAddPromote (lowest node-id, NO peer holds committed data, no Primary), so exactly one deterministic node mints the source and the minted copy is byte-equivalent to the equally-empty peers (day0 lineage). Only Inconsistent volumes are cleared; UpToDate siblings are untouched. Adds the device minor to the parsed drbdsetup status so the clear-bitmap can target the specific stuck volume. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

pkg/drbd/needs_late_add_promote_test.go (1)

343-412: ⚡ Quick win

Add a failure-path test for MintLateAddSource reconnect-on-error behavior.

Current coverage pins success/no-op paths, but not the branch where new-current-uuid --clear-bitmap fails and drbdadm adjust must still run before returning error.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/drbd/needs_late_add_promote_test.go` around lines 343 - 412, Add a new
test function after TestMintLateAddSource_NoopWhenNothingInconsistent that
verifies the error recovery behavior of MintLateAddSource. The test should set
up a FakeExec that returns an error for one of the new-current-uuid clear-bitmap
commands (e.g., for minor 1001), then call adm.MintLateAddSource and verify
that: (1) the function returns an error, (2) drbdadm adjust was still executed
after the failed command (to restore the connection), and (3) the command
sequence shows the disconnect, the failed clear-bitmap attempt, and the adjust
command completing in order. This test ensures that the reconnect-on-error
cleanup path is properly exercised.

pkg/satellite/maybe_late_add_promote_test.go (1)

120-122: ⚡ Quick win

Harden skip-path assertions to catch unintended mint side effects.

Line 120 and Line 150 only guard against drbdadm disconnect pvc-la. A regression could still issue new-current-uuid --clear-bitmap or drbdadm adjust and these tests would still pass.

Suggested assertion tightening

 if slices.Contains(fx.CommandLines(), "drbdadm disconnect pvc-la") {
   t.Errorf("diskless replica must NOT mint a source, got: %v", fx.CommandLines())
 }
+if slices.Contains(fx.CommandLines(), "drbdsetup new-current-uuid --clear-bitmap 1002") ||
+	slices.Contains(fx.CommandLines(), "drbdadm adjust pvc-la") {
+	t.Errorf("diskless replica must NOT run mint sub-steps, got: %v", fx.CommandLines())
+}

 if slices.Contains(fx.CommandLines(), "drbdadm disconnect pvc-la") {
   t.Errorf("must NOT mint a source when a peer holds data, got: %v", fx.CommandLines())
 }
+if slices.Contains(fx.CommandLines(), "drbdsetup new-current-uuid --clear-bitmap 1002") ||
+	slices.Contains(fx.CommandLines(), "drbdadm adjust pvc-la") {
+	t.Errorf("must NOT run mint sub-steps when peer has committed data, got: %v", fx.CommandLines())
+}

Also applies to: 150-152

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/satellite/maybe_late_add_promote_test.go` around lines 120 - 122, The
assertions in the diskless replica test at lines 120-122 and 150-152 only check
for the specific command `drbdadm disconnect pvc-la`, but this is insufficient
to catch all unintended mint side effects. Enhance both assertions to verify
that none of the following commands that indicate a mint operation are present
in the command lines: `new-current-uuid --clear-bitmap`, `drbdadm adjust`, and
the current `drbdadm disconnect pvc-la`. Use a more comprehensive check (such as
iterating through multiple command patterns or using a helper function) to
ensure that if any of these mint-related commands are issued, the test will fail
and catch the regression.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/drbd/drbdadm.go`:
- Around line 1759-1761: The comment documentation for local volume state
handling in LateAddResyncKickVolumes incorrectly states that both Inconsistent
and Outdated can qualify for discarding, but the actual implementation in
lateAddVolumeNeedsInvalidate intentionally excludes Outdated volumes. Update the
comment to reference only Inconsistent as the qualifying state for local volumes
that can be discarded, removing the reference to Outdated to align with the
actual implementation. This documentation issue appears at multiple locations in
the code and must be corrected at all occurrences.
- Around line 554-557: The error handling after the json.Unmarshal call in the
MintLateAddSource function currently combines the check for parse errors and
empty status in a single condition, but errors.Wrapf(nil, ...) returns nil when
err is nil, causing empty status results to be silently treated as success.
Split the two checks so that when len(status) == 0 but err == nil, you
explicitly return an error with an appropriate message (such as "empty drbdsetup
status for resource") rather than relying on the combined error wrapping which
will return nil and mask the invalid probe result.

---

Nitpick comments:
In `@pkg/drbd/needs_late_add_promote_test.go`:
- Around line 343-412: Add a new test function after
TestMintLateAddSource_NoopWhenNothingInconsistent that verifies the error
recovery behavior of MintLateAddSource. The test should set up a FakeExec that
returns an error for one of the new-current-uuid clear-bitmap commands (e.g.,
for minor 1001), then call adm.MintLateAddSource and verify that: (1) the
function returns an error, (2) drbdadm adjust was still executed after the
failed command (to restore the connection), and (3) the command sequence shows
the disconnect, the failed clear-bitmap attempt, and the adjust command
completing in order. This test ensures that the reconnect-on-error cleanup path
is properly exercised.

In `@pkg/satellite/maybe_late_add_promote_test.go`:
- Around line 120-122: The assertions in the diskless replica test at lines
120-122 and 150-152 only check for the specific command `drbdadm disconnect
pvc-la`, but this is insufficient to catch all unintended mint side effects.
Enhance both assertions to verify that none of the following commands that
indicate a mint operation are present in the command lines: `new-current-uuid
--clear-bitmap`, `drbdadm adjust`, and the current `drbdadm disconnect pvc-la`.
Use a more comprehensive check (such as iterating through multiple command
patterns or using a helper function) to ensure that if any of these mint-related
commands are issued, the test will fail and catch the regression.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dcd10d7b-4265-4810-bc7a-110d47dae305

📥 Commits

Reviewing files that changed from the base of the PR and between c4271d1 and 477fc3e.

📒 Files selected for processing (8)

pkg/drbd/drbdadm.go
pkg/drbd/drbdsetup_show.go
pkg/drbd/needs_late_add_promote_test.go
pkg/drbd/needs_late_add_resync_kick_test.go
pkg/satellite/maybe_kick_late_add_resync_test.go
pkg/satellite/maybe_late_add_promote_test.go
pkg/satellite/reconciler.go
tests/operator-harness/replay/vd-late-concurrent-no-drop.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/satellite/reconciler.go
tests/operator-harness/replay/vd-late-concurrent-no-drop.yaml

coderabbitai · 2026-06-14T16:13:41Z

+	err = json.Unmarshal(out, &status)
+	if err != nil || len(status) == 0 {
+		return nil, errors.Wrapf(err, "parse drbdsetup status %s", resource)
+	}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In github.com/cockroachdb/errors, what is the behavior of errors.Wrapf(nil, "msg")? Does it return nil?

💡 Result:

In the github.com/cockroachdb/errors library, calling errors.Wrapf(nil, "msg") returns nil [1][2]. The library's documentation explicitly states that all wrapper constructors, including Wrapf, can be safely applied to a nil error because they behave as no-ops in that case [1][3]. This allows developers to use these functions directly on error return values without needing to explicitly check for nil first [1][2].

Citations:

1: https://github.com/cockroachdb/errors/blob/master/README.md

2: https://pkg.go.dev/github.com/cockroachdb/errors

3: https://github.com/cockroachdb/errors/tree/v1.2.4

🏁 Script executed:

# Check if the file exists and examine the context around lines 554-557 wc -l pkg/drbd/drbdadm.go

Repository: cozystack/blockstor

Length of output: 88

🏁 Script executed:

# Read the specific lines in context sed -n '545,565p' pkg/drbd/drbdadm.go

Repository: cozystack/blockstor

Length of output: 699

🏁 Script executed:

# Check for SPDX license header at the top of the file head -5 pkg/drbd/drbdadm.go

Repository: cozystack/blockstor

Length of output: 146

🏁 Script executed:

# Search for the MintLateAddSource function to understand context grep -n "MintLateAddSource" pkg/drbd/drbdadm.go

Repository: cozystack/blockstor

Length of output: 232

🏁 Script executed:

# Search for similar patterns where json.Unmarshal errors are combined with length checks rg 'json\.Unmarshal.*\|\|.*len\(' pkg/drbd/ -A 2

Repository: cozystack/blockstor

Length of output: 45

Handle empty drbdsetup status as an explicit error in MintLateAddSource.

json.Unmarshal success with an empty status currently shares the same branch as parse errors. When len(status) == 0 but err == nil, the condition err != nil || len(status) == 0 is true, but errors.Wrapf(nil, "...") returns nil, resulting in the function returning (nil, nil) (success). This silent no-op masks an invalid probe result. Split these checks and return an explicit error when len(status) == 0.

Suggested fix

err = json.Unmarshal(out, &status) - if err != nil || len(status) == 0 { - return nil, errors.Wrapf(err, "parse drbdsetup status %s", resource) - } + if err != nil { + return nil, errors.Wrapf(err, "parse drbdsetup status %s", resource) + } + if len(status) == 0 { + return nil, errors.Newf("parse drbdsetup status %s: empty status payload", resource) + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/drbd/drbdadm.go` around lines 554 - 557, The error handling after the json.Unmarshal call in the MintLateAddSource function currently combines the check for parse errors and empty status in a single condition, but errors.Wrapf(nil, ...) returns nil when err is nil, causing empty status results to be silently treated as success. Split the two checks so that when len(status) == 0 but err == nil, you explicitly return an error with an appropriate message (such as "empty drbdsetup status for resource") rather than relying on the combined error wrapping which will return nil and mask the invalid probe result.

coderabbitai · 2026-06-14T16:13:41Z

+//   - the LOCAL volume is below UpToDate (Inconsistent/Outdated) — only a
+//     non-authoritative local copy is ever discarded, never an UpToDate
+//     one;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align LateAddResyncKickVolumes docs with implementation.

The top-level comment says local Inconsistent/Outdated can qualify, but lateAddVolumeNeedsInvalidate intentionally excludes Outdated. Please update the comment to match actual behavior.

Also applies to: 1829-1844

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/drbd/drbdadm.go` around lines 1759 - 1761, The comment documentation for local volume state handling in LateAddResyncKickVolumes incorrectly states that both Inconsistent and Outdated can qualify for discarding, but the actual implementation in lateAddVolumeNeedsInvalidate intentionally excludes Outdated volumes. Update the comment to reference only Inconsistent as the qualifying state for local volumes that can be discarded, removing the reference to Outdated to align with the actual implementation. This documentation issue appears at multiple locations in the code and must be corrected at all occurrences.

# Conflicts: # tests/e2e/cli-matrix/multi-volume-late-vd-create.sh

…048 de-regress) CreateAutoNumbered's post-commit verification (verifyVolumeLanded) re-reads the RD to confirm the just-written VolumeNumber landed and forces a retry on a synthetic Conflict when it is absent. That guard is only sound when the re-read is LIVE: with no uncached reader wired it falls back to the informer cache, which trails the Update just committed, so it routinely fails to observe its own freshly written volume, surfaces a FALSE conflict, and the retry re-derives the next free number off the lagging cache and appends a SECOND volume. A single auto-numbered create could leave several phantom VolumeDefinitions behind — `vd d 0` then leaves a stray, and a late `vd c` lands 4 VDs instead of 1. Skip the post-commit verification when no live reader is available: a successful optimistic-locked Update is already the apiserver's authoritative confirmation, and there is nothing trustworthy to verify against a stale cache. Production wires mgr.GetAPIReader(), so the live verification path — the actual BUG-048 lost-update guard — is preserved untouched. Also align the integration harness store wiring with cmd/apiserver/main.go (NewWithAPIReader + mgr.GetAPIReader()). The harness used the plain New (no live reader), so CreateAutoNumbered's cache-lag re-read 409-stormed against the RD reconciler's concurrent RD writes; the slow create then made the linstor client re-POST, producing duplicate auto-numbered VDs. Production never hit this because the apiserver always wires the direct reader. The store-layer BUG-048 concurrency guarantee is unchanged: concurrent auto-numbered creates still converge to distinct VolumeNumbers via the optimistic-lock 409 retry (TestK8sVolumeDefinitionConcurrentAutoNumber), and the live post-commit verification still guards the heavy-load lost-update on the production path. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Andrei Kvapil (kvaps) and others added 11 commits June 14, 2026 09:58

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Andrei Kvapil (kvaps) and others added 3 commits June 14, 2026 16:13

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Andrei Kvapil (kvaps) and others added 4 commits June 14, 2026 18:03

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into HEAD

8850ae3

# Conflicts: # tests/e2e/cli-matrix/multi-volume-late-vd-create.sh

Andrei Kvapil (kvaps) mentioned this pull request Jun 14, 2026

docs(known-issues): document BUG-048 concurrent late vd-create limitation #165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048)#164

fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048)#164
Andrei Kvapil (kvaps) wants to merge 20 commits into
mainfrom
fix/bug-048-concurrent-late-vd-wedge

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if !diskless && dr.GetMetadataCreated() &&
		!r.isDisklessToDiskfulFlip(ctx, dr, diskless) {

Conversation

Andrei Kvapil (kvaps) commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Control-plane lost-update (REST/store) — the second volume was silently dropped

2. Data-plane seed/promote race (satellite/DRBD) — the second volume wedged Inconsistent

Data safety

Testing

Note

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Critical Correctness Bug: Potential Orphaned/Duplicate Volume Creation due to Cache Lag

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading