test(harness): release-gate cleanup — cli-matrix/replay/burnin/cli-parity (batch 4) + unmask BUG-048 by kvaps · Pull Request #162 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-14T06:25:13Z

Final test-debt cleanup for the release gate. Test/harness/docs only — no product code. Each fix re-confirmed on stand big.

bug-278-skipdisk-autoclear: cell stamped SkipDisk on a HEALTHY replica then read it back, racing the product's correct SkipDiskClearer auto-clear. Restructured into a two-sided contract (persists on Diskless; auto-clears on healthy after satellite restart). 3× PASS.
drbd-options-hierarchy-controller: SIGPIPE (exit 141) from ... | head under pipefail. Switched to mapfile + wait_diskful_count. 3× PASS.
multi-volume-late-vd-create: detection parsed always-null .vlms[].state.disk_state; switched to CRD-backed status_disk_state/wait_diskful_count. ESCALATION: corrected detection UNMASKS a real product bug (BUG-048) — concurrent late volume-adds wedge the 2nd volume Inconsistent forever (no SyncSource). Left unmasked so the gate blocks on it. Single/sequential adds converge fine.
replay/snap-create-multiple-group-consistency: YAML used positional colon args (CLI rejects, argparse exit 2) → BUG-046 L7 coverage was dead. Fixed to create-multiple <snap> -r <rd>.... Replay 3× PASS.
burnin-blockstor: md5 check didn't verify dd exit → EAGAIN zero-byte read = md5("") false mismatch. Now guards dd PIPESTATUS + byte count, skips compare on transient read failure. 12/12 no false alarms.
cli-parity-refresh: normalizer extended to collapse ASCII-table border-width drift (pool-name length) + upstream-only DfltRscGrp PlaceCount render → rg l / rg l --pastable now PARITY. Residual sp l divergence whitelisted (row 03, genuine ENV topology + FILE_THIN envelope).

Product bug surfaced (BUG-048, tracked separately)

Concurrent late volume-adds (Bug-332 repro): the 2nd volume wedges Inconsistent on every replica (replication:Established, resync-suspended:no, no SyncSource) → resync never bootstraps → permanently unusable. Single/sequential adds fine. Reproduced lvm-thin + FILE_THIN on big. Unmasked, not fixed here.

Validation

bash -n + shellcheck -x clean; cells 1/2/4 + replay 3× PASS on big; burnin 12/12; cli-parity rg-l → PARITY, sp-l whitelisted.

Summary by CodeRabbit

Bug Fixes
- Improved block storage test resilience to transient device read failures.
- Updated snapshot creation test for CLI argument format compatibility.
Tests
- Enhanced SkipDisk auto-clear behavior validation.
- Strengthened multi-volume uptodate detection testing.
- Reduced CLI parity testing false-positive diffs from formatting variations.
- Improved DRBD options extraction robustness in tests.
Documentation
- Updated CLI parity known deltas documentation.

…r contract The cell stamped DrbdOptions/SkipDisk=True on a HEALTHY UpToDate replica and read the prop back as a setup precondition. But the product's SkipDiskClearer correctly auto-clears the observer-owned stamp on every reconcile of a healthy slot (HasDisklessVolume==false), so the satellite wiped the stamp between the SSA apply and the read-back — the precondition raced the product's correct behaviour and aborted with "stamp did not land". The stamped state on a healthy disk is transient by design. Restructure into a two-sided contract: A. Persistence (positive control): stamp on a genuinely-Diskless replica, where the auto-clear gate is intentionally closed, and assert the stamp survives — proving the clearer is selective. B. Auto-clear: stamp on a healthy replica, restart its satellite (the Talos reattach shape), and poll for the stamp's ABSENCE (the steady state) plus disk UpToDate, instead of reading back the doomed stamp. No product behaviour changed. Validated on stand "big". Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…it loop The cell picked a diskful node via `linstor_diskful_nodes | head -1`: head closed the pipe after the first node while the helper's internal `while read` loop was still emitting the rest, so under `set -o pipefail` the pipeline exited 141 and `set -e` aborted the cell before the C1/C2 assertions ran. The replay twin of this scenario passes, confirming the product is correct. Consume the helper via mapfile over a process substitution (no pipe) — the convention every other cell follows — and use the existing wait_diskful_count helper for the convergence wait. Also harden show_max_buffers: capture drbdsetup output into a variable before the awk `... exit`, so the early pipe close can never SIGPIPE the writer. Validated on stand "big". Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…e runs The replay called `snapshot create-multiple` with positional colon args (`{{rd}}:grp bug046-multi-b:grp`), which linstor-client 1.27.1 rejects with argparse exit 2 (`-r`/--resource_names is required). The step never ran, so BUG-046's L7 group-consistency coverage was dead. Use the correct CLI form: the snapshot name is a single positional and the resource names go to `-r`. The snapshot name must come BEFORE `-r`, otherwise argparse's variadic `-r A B grp` greedily consumes `grp` and leaves no positional ("too few arguments"). Mirrors the multi-resource form the cli-matrix cell drives. Validated PASS on stand "big". Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…lse alarms The md5-consistency check compared primary vs peer md5 without checking dd's exit code. Under churn dd can fail to open /dev/drbdN (EAGAIN) and read zero bytes; md5("") is a fixed digest, so the unguarded read raised a FALSE mismatch alarm that would drown out a real future divergence. Guard each read with dd's exit status (PIPESTATUS) and the byte count (must equal the full 1 MiB); the md5 is still piped straight from dd into md5sum so binary NUL bytes are never routed through a shell variable. On a transient read failure the iteration emits a sentinel and its compare is SKIPPED (neither PASS nor FAIL) rather than reported as a mismatch. Validated on stand "big" (12/12 iterations pass, no false alarms). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…p-l topology The pool-name normalizer cleared n l / r l / v l / vd l false diffs but left rg l / rg l --pastable as WIRE_SHAPE: after the pool token is collapsed to <POOL>, two artefacts survived `diff -w`: - ASCII-table border width tracks the longest cell, so BS's pool name `stand` (5) rendered one dash wider than the oracle's `pool` (4) on every +----+ / |====| rule line (dashes are not whitespace). - The built-in DfltRscGrp row prints `PlaceCount: N` on upstream but a blank cell on BS — a render choice over the same default. Extend normalize_side_output to collapse border-char runs and strip the DfltRscGrp PlaceCount value on both sides. rg l (05) and rg l --pastable (20) now classify PARITY (verified against the live dev-stand raw artefacts). The residual `sp l` (03) divergence is genuine — the BS stand and the upstream oracle have different storage-pool topologies, and BS's FILE_THIN DTO leaves PoolName/SharedName blank where the oracle renders a backing path / shared name. Whitelisted as row 03 with justification rather than masked; sp l -m (79) and sp l --show-props (04) still catch real capacity/driver regressions. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…lms) The wait loops parsed `linstor --machine-readable resource list` for `.vlms[].state.disk_state`, but the blockstor apiserver leaves that array null in the golinstor JSON projection, so the jq always read empty and the loop timed out at the vol-0 stage even though `linstor r l` (which reads Resource.Status) showed every replica UpToDate. Switch the convergence checks to status_disk_state / wait_diskful_count (the CRD-backed `.status.volumes[].diskState`, the same reliable helpers the passing cells use), resolve the diskful node set via mapfile over a process substitution (no SIGPIPE), and probe the kernel-truth check via on_node drbdsetup instead of the same null-projection jq. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-14T06:25:29Z

📝 Walkthrough

Walkthrough

This PR improves test reliability across multiple areas: the burnin script gains DRBD read-failure guards, three e2e CLI matrix tests are restructured with new polling helpers and pipefail-safe subprocess patterns, a snapshot replay YAML is updated for linstor-client 1.27.1, and the CLI parity normalization script gains two new sed rules to suppress false-positive diffs from table border drift and DfltRscGrp PlaceCount rendering.

Changes

CLI Parity Normalization and Allowlist

Layer / File(s)	Summary
`normalize_side_output` rules and allowlist entry `tests/operator-harness/cli-parity-refresh.sh`, `docs/cli-parity-known-deltas.md`	Adds documentation and two new `sed` transformations to collapse ASCII table border-width drift and strip `DfltRscGrp PlaceCount` from `SelectFilter` cells; whitelists the `sp l` `WIRE_SHAPE` delta (id `03`) as permanent and records the 2026-06-14 refresh history entry.

Test Infrastructure Robustness Improvements

Layer / File(s)	Summary
Burnin guarded md5 read `tests/burnin-blockstor.sh`	Rewrites primary and peer `dd` reads to validate full 1 MiB byte counts via redirected stderr; emits `READFAIL` on short/failed reads and skips the iteration instead of recording a false PASS/FAIL.
Bug 278 skipdisk autoclear: 3-node topology and polling helpers `tests/e2e/cli-matrix/bug-278-skipdisk-autoclear-after-reattach.sh`	Adds `stamp_skipdisk` and `skipdisk_prop` helpers, increases worker count to 3, adds a diskless `N3` persistence positive-control, and replaces fragile pre-read assertions with a steady-state absence poll and `wait_status_state` after satellite restart on `N2`.
drbd-options-hierarchy SIGPIPE-safe extraction `tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh`	Rewrites `show_max_buffers` to capture `drbdsetup` into a variable before `awk` via here-string; replaces piped `grep/head` node selection with `wait_diskful_count` and `mapfile` from process substitution.
multi-volume-late-vd-create: polling helper and kernel check `tests/e2e/cli-matrix/multi-volume-late-vd-create.sh`	Resolves three diskful nodes up front, adds `wait_all_replicas_volume_uptodate` polling helper with 240s timeouts and per-`(node, vol)` diagnostics, and replaces `kubectl debug`/`jq` kernel check with `DISKFUL_NODES[0]` + `on_node drbdsetup status`.
Snapshot create-multiple CLI arg format `tests/operator-harness/replay/snap-create-multiple-group-consistency.yaml`	Updates the `snapshot create-multiple` step to use linstor-client 1.27.1 positional snapshot name + `-r` flag form, removing colon-suffixed per-resource naming.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

cozystack/blockstor#161: Directly modifies the same docs/cli-parity-known-deltas.md allowlist and tests/operator-harness/cli-parity-refresh.sh normalization logic that this PR extends.

Poem

🐇 A bunny once read from a DRBD device,
Got zero bytes back — that wasn't so nice!
Now READFAIL skips gracefully past,
And false md5 failures are a thing of the past.
Three workers, new helpers, no SIGPIPE in sight —
The tests hop along with resilience and might! 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: test cleanup affecting cli-matrix, replay, burnin, and cli-parity tooling, plus unmasking BUG-048 as stated in objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/harness-batch4-final

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces several test improvements, bug fixes, and test-harness updates across the test suite. Key changes include adding known CLI parity deltas, guarding dd reads in the burn-in test to handle transient failures, refactoring the bug-278 skipdisk test to include a persistence control, fixing SIGPIPE issues in DRBD options tests, and correcting the snapshot-create-multiple CLI syntax. The review feedback correctly identifies a critical issue in tests/burnin-blockstor.sh where PIPESTATUS[0] fails to capture the exit status of dd inside a command substitution, and provides actionable suggestions to enable pipefail inside the subshell and capture the exit status of the assignment.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-14T06:26:54Z

+        md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err | md5sum | awk '{print \$1}')
+        rc=\${PIPESTATUS[0]}


The exit status of dd inside the command substitution md5=\$(...) is not captured by \${PIPESTATUS[0]} in the outer remote shell. In Bash, PIPESTATUS in the outer shell only reflects the exit status of the command substitution itself (which is the exit status of the last command in the pipeline, awk, which is 0 even if dd fails). Furthermore, pipefail is not enabled in the remote bash -c session, so the subshell itself exits with 0 even if dd fails.\n\nTo reliably capture the exit status of dd without relying on PIPESTATUS in the outer shell, enable pipefail inside the subshell and capture the exit status of the assignment using \$?.

Suggested change

md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err | md5sum | awk '{print \$1}')

rc=\${PIPESTATUS[0]}

md5=\$(set -o pipefail; dd if=\${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err | md5sum | awk '{print \$1}')\n rc=\$?

gemini-code-assist · 2026-06-14T06:26:55Z

+        md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-peer.err | md5sum | awk '{print \$1}')
+        rc=\${PIPESTATUS[0]}


The exit status of dd inside the command substitution md5=\$(...) is not captured by \${PIPESTATUS[0]} in the outer remote shell. In Bash, PIPESTATUS in the outer shell only reflects the exit status of the command substitution itself (which is the exit status of the last command in the pipeline, awk, which is 0 even if dd fails). Furthermore, pipefail is not enabled in the remote bash -c session, so the subshell itself exits with 0 even if dd fails.\n\nTo reliably capture the exit status of dd without relying on PIPESTATUS in the outer shell, enable pipefail inside the subshell and capture the exit status of the assignment using \$?.

Suggested change

md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-peer.err | md5sum | awk '{print \$1}')

rc=\${PIPESTATUS[0]}

md5=\$(set -o pipefail; dd if=\${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-peer.err | md5sum | awk '{print \$1}')\n rc=\$?

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/burnin-blockstor.sh`:
- Around line 171-177: The test script can exit with success even when all
iterations fail with READFAIL because skipped iterations are not being tracked.
Add a skip counter variable that increments whenever a READFAIL or empty MD5 is
detected in the condition that checks PRIMARY_MD5 and PEER_MD5, then modify the
final exit gate at Line 192 to require that at least one successful compare
occurred (not just pass=0 fail=0) before returning success; sustained read
failures should not be treated as passing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e4e82250-3791-4b4e-996a-ca6348e4588f

📥 Commits

Reviewing files that changed from the base of the PR and between 015374d and 7ea408e.

📒 Files selected for processing (7)

docs/cli-parity-known-deltas.md
tests/burnin-blockstor.sh
tests/e2e/cli-matrix/bug-278-skipdisk-autoclear-after-reattach.sh
tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh
tests/e2e/cli-matrix/multi-volume-late-vd-create.sh
tests/operator-harness/cli-parity-refresh.sh
tests/operator-harness/replay/snap-create-multiple-group-consistency.yaml

coderabbitai · 2026-06-14T06:32:16Z

+    if [[ "$PRIMARY_MD5" == "READFAIL" || "$PEER_MD5" == "READFAIL" \
+          || -z "$PRIMARY_MD5" || -z "$PEER_MD5" ]]; then
+        # Transient read failure on at least one side — neither PASS nor
+        # FAIL. A real mismatch is only credible when BOTH reads succeeded
+        # and returned the full 1 MiB.
+        echo "[$(date -u +%FT%TZ)] iter=$ITER SKIP: transient dd read failure (primary='$PRIMARY_MD5' peer='$PEER_MD5'); not comparing"
+    elif [[ "$PRIMARY_MD5" == "$PEER_MD5" ]]; then


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

READFAIL-only runs can incorrectly end as success.

READFAIL iterations are skipped but never escalated; with the current final gate (Line 192), sustained read failures can still exit 0 (pass=0 fail=0). Add skip/streak accounting and require at least one successful compare before returning success.

💡 Patch sketch

PASS=0 FAIL=0 +SKIP=0 +READFAIL_STREAK=0 ITER=0 @@ if [[ "$PRIMARY_MD5" == "READFAIL" || "$PEER_MD5" == "READFAIL" \ || -z "$PRIMARY_MD5" || -z "$PEER_MD5" ]]; then + SKIP=$((SKIP + 1)) + READFAIL_STREAK=$((READFAIL_STREAK + 1)) echo "[$(date -u +%FT%TZ)] iter=$ITER SKIP: transient dd read failure (primary='$PRIMARY_MD5' peer='$PEER_MD5'); not comparing" + if (( READFAIL_STREAK >= 10 )); then + FAIL=$((FAIL + 1)) + echo "[$(date -u +%FT%TZ)] iter=$ITER FAIL: persistent dd read failures ($READFAIL_STREAK in a row)" + fi elif [[ "$PRIMARY_MD5" == "$PEER_MD5" ]]; then PASS=$((PASS + 1)) + READFAIL_STREAK=0 else FAIL=$((FAIL + 1)) + READFAIL_STREAK=0 echo "[$(date -u +%FT%TZ)] iter=$ITER FAIL: md5 mismatch primary=$PRIMARY_MD5 peer=$PEER_MD5" fi @@ -echo "[$(date -u +%FT%TZ)] DONE iter=$ITER pass=$PASS fail=$FAIL" -[[ $FAIL -eq 0 ]] || exit 1 +echo "[$(date -u +%FT%TZ)] DONE iter=$ITER pass=$PASS fail=$FAIL skip=$SKIP" +[[ $FAIL -eq 0 && $PASS -gt 0 ]] || exit 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/burnin-blockstor.sh` around lines 171 - 177, The test script can exit with success even when all iterations fail with READFAIL because skipped iterations are not being tracked. Add a skip counter variable that increments whenever a READFAIL or empty MD5 is detected in the condition that checks PRIMARY_MD5 and PEER_MD5, then modify the final exit gate at Line 192 to require that at least one successful compare occurred (not just pass=0 fail=0) before returning success; sustained read failures should not be treated as passing.

L6 cli-matrix multi-volume-late-vd-create.sh: take PR #162's corrected Status-based detection (status_disk_state instead of the null .vlms projection) and make the two late `vd c` calls run CONCURRENTLY (background + wait) so they exercise the auto-assign lost-update window. Add the BUG-048 wire-level assertion: the RD must carry exactly 3 VolumeDefinitions after the two concurrent adds — a dropped add leaves only 2 and surfaces a spurious "volume definition 1 already exists". L7 replay vd-late-create-concurrent-no-lost-update.yaml: codify the post-fix invariant (vol-0 up, two back-to-back late adds → vd_count==3 + all_uptodate). Adds a `vd_count` await kind to the harness lib that asserts the exact VolumeDefinition count on an RD. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…#164) * fix(rest,store): close concurrent late vd-add lost-update race (BUG-048) Two back-to-back number-less `linstor vd c <rd>` calls against an RD that already carries vol-0 used to drop the second add: each request picked the smallest free VolumeNumber in a REST-side List BEFORE the store Create, so both read [vol-0], both chose VlmNr=1, the loser was rejected FAIL_EXISTS_VLM_DFN, and the operator's second intended volume silently vanished (only vol-0 + vol-1 landed). When the loser backgrounds its CLI the failure surfaces no usable error at all. Move the smallest-free allocation INTO the store layer as VolumeDefinitionStore.CreateAutoNumbered, which re-derives the hole on every conflict-retry attempt (k8s: allocate inside RetryOnConflict under the apiserver resourceVersion guard; inmemory: under the write lock). The read of the existing set and the append are now atomic with respect to a racing add, so the loser re-reads [vol-0, vol-1] and lands at vol-2. The explicit-number path keeps the plain Create (its retry already converges and a genuine duplicate must still 409). The DRBD per-volume day0/winner-election seed path (BUG-028/384 seedFreshVolumes/isLateAddWinner) is untouched and already converges concurrent late adds correctly once both volumes exist with distinct numbers — verified on stand by forcing distinct --vlmnr. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(harness): concurrent late vd-add catchers for BUG-048 (L6 + L7) L6 cli-matrix multi-volume-late-vd-create.sh: take PR #162's corrected Status-based detection (status_disk_state instead of the null .vlms projection) and make the two late `vd c` calls run CONCURRENTLY (background + wait) so they exercise the auto-assign lost-update window. Add the BUG-048 wire-level assertion: the RD must carry exactly 3 VolumeDefinitions after the two concurrent adds — a dropped add leaves only 2 and surfaces a spurious "volume definition 1 already exists". L7 replay vd-late-create-concurrent-no-lost-update.yaml: codify the post-fix invariant (vol-0 up, two back-to-back late adds → vd_count==3 + all_uptodate). Adds a `vd_count` await kind to the harness lib that asserts the exact VolumeDefinition count on an RD. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite): seed late-added volume race-free, off MetadataCreated not .res (BUG-048) The per-volume create-md + winner-election GI seed for a late-added volume (ensurePerVolumeMetadata) was gated by hasLateAddedVolume(resPath, dr) — comparing the desired volume set against the `volume N {` blocks in the on-disk .res. That .res is re-rendered by the FSM dispatch's renderResFile preamble on every adjust/up, so on a concurrent two-VD add an earlier reconcile's adjust pre-populated the new volume's block BEFORE this gate read the file. The gate then saw "no late volume", SKIPPED the metadata+seed pass entirely, and the new volume came up via the kernel adjust with no metadata and no elected SyncSource — Inconsistent on every replica, permanently wedged (the operator-reported BUG-048 symptom: replication:Established, resync-suspended:no, no SyncSource). It hit ~half of concurrent late two-VD adds; single and sequential adds, where the create-md pass and the render do not interleave, were unaffected. Gate on the race-free dr.GetMetadataCreated() (this RD is past first activation) instead. ensurePerVolumeMetadata is already per-volume idempotent — it probes `drbdadm dump-md <rd>/<vol>` and only create-md's + seeds the volumes that genuinely lack metadata (len(freshlyCreated)==0 → early return), so running it on every post-activation diskful reconcile is a no-op on converged state (cheap per-volume dump-md probe) and never re-stamps an existing volume's GI/bitmap. The winner election and every seed veto (PeerHasData / SkipInitialSync / AnyConnectedPeerHasDataForVolume) are unchanged, so day0 skip-init (BUG-028) and single/sequential late adds (Bug 384) are preserved. The now-unused .res-parsing hasLateAddedVolume helper is removed. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(store,apiserver): read RD live (uncached) in CreateAutoNumbered (BUG-048) The store-level atomic VolumeNumber allocation still dropped the second of two concurrent `vd c` in production: fetchRD reads through the manager's informer-CACHED client, which trails a just-committed write by the watch round-trip. So when the winner's Update lands, the loser's conflict-retry re-reads the STALE cached RD, re-derives the same smallest-free number, its Update 409s again, and retry.DefaultRetry (~5 short steps) exhausts before the cache catches up — the second volume is silently lost (observed on stand: vd_count=2 after two concurrent number-less `vd c`). CreateAutoNumbered now reads the RD through a direct, UNCACHED API reader (mgr.GetAPIReader(), plumbed via store NewWithAPIReader) on every retry attempt, so each retry sees the winner's committed VolumeDefinition and advances to the next free number. The retry budget is widened to 12 steps so a burst of concurrent creates still converges. nil apiReader (in-memory / unit stores with no informer) falls back to the cached client unchanged. Adds a k8s-store concurrency conformance test (N concurrent CreateAutoNumbered against a real envtest apiserver → distinct numbers, none dropped). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(harness): widen concurrent late-vd convergence budget to 360s (BUG-048) Two 1G volumes added concurrently on a 3-diskful RD run up to 6 fresh initial syncs at once on the loop substrate, so convergence routinely exceeds the single-volume 240s budget (observed vol-2 still Outdated — converging, not the Inconsistent wedge — at 240s, UpToDate moments later). 360s removes the flake while staying far inside the wedge discriminator. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite,drbd): late-add-promote self-heal for ≥3-replica wedge (BUG-048) The race-free metadata gate eliminated the 2-replica concurrent late-add wedge, but on a ≥3-diskful RD the per-volume winner seed can still race its own bring-up: the late volume occasionally comes up Inconsistent on EVERY diskful replica with no SyncSource and no peer holding data — the operator-reported wedge (replication:Established, resync-suspended:no, no SyncSource). Neither existing self-heal recovers it: maybeRecovery Promote is gated on the dispatcher's auto-primary (suppressed on the initialized RD every late-add lands on) and on the local being already UpToDate (it is Inconsistent here); maybeSoloPromote requires zero peers. Add a dedicated kernel-truth self-heal. Adm.NeedsLateAddPromote fires ONLY when a local diskful volume is Inconsistent, NO connected peer exposes committed data for it (no real SyncSource to wait for), none is actively resyncing, no replica is Primary, and this node holds the lowest node-id among the volume's diskful replicas — so exactly one deterministic node force-primaries. maybeLateAddPromote runs it on the steady-state reconcile WITHOUT the auto-primary gate (that is precisely what is missing on the late-add path), throttled by the shared recoveryPromoteThrottle. Data-safety: a fresh late-added volume is seeded at the deterministic day0 current-UUID on every replica before bring-up, so primary --force mints no unrelated UUID; the Inconsistent peers SyncTarget from this now-Primary source. The peer-has-data veto preserves the Bug 342 unrelated-data guard (a relocate target never force-primaries over a data-bearing peer), and the active-resync + peer-Primary vetoes prevent churn. Self-limiting: the predicate stops holding once the peers converge. Unit-pinned in needs_late_add_promote_test.go (fires on the wedge; vetoes on peer-data / active-resync / peer-Primary / lower-id / all-UpToDate). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(satellite): pin maybeLateAddPromote self-heal wiring (BUG-048) Reconciler-level pins that maybeLateAddPromote force-primaries the lowest-id replica on the late-add wedge WITHOUT an auto-primary flag, skips diskless replicas, and vetoes when a peer holds data. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(store): verify volume landed after auto-numbered create (BUG-048) Defence-in-depth against the silent lost-update. The optimistic-lock Update normally guarantees no racing writer clobbered the allocation — a stale resourceVersion 409s and the loop retries. But under heavy apiserver/etcd load a follower read can return a resourceVersion that is already superseded yet still accepted on Update, so two concurrent auto-numbered creates can both return success while only one VolumeDefinition lands (the operator-visible silent drop: both `linstor vd c` exit 0, the RD has one fewer volume than asked). After the Update commits, re-read live and confirm the assigned VolumeNumber is actually present; if it vanished, surface a synthetic Conflict so the whole allocate+append re-runs against the corrected state. The create now either persists the volume or retries — it can never return success having silently lost it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(harness): shorten BUG-048 replay name to fit DRBD resource-name limit The replay-runner prefixes 'replay-' and suffixes a random token; the 40-char workflow name pushed the generated RD name past DRBD's resource-name length cap, so 'resource-definition create' failed with exit 10 before the scenario even started. Rename to vd-late-concurrent-no-drop so the generated name fits. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(drbd): gate late-add-promote against day0 split-brain (BUG-048) Stand validation caught a serious regression in the late-add-promote self-heal: NeedsLateAddPromote could not distinguish a normal day0 bootstrap (every volume transiently Inconsistent, no SyncSource yet — the normal fresh state) from the BUG-048 late-add wedge, so it MISFIRED on a fresh RD's vol-0. Worse, its lowest-node-id election counted only CONNECTED peers; at day0 t+1s peers are still StandAlone/Connecting, so each node saw a partial peer set, several concluded "I am lowest", and two non-lowest nodes force-primaried ~37ms apart → split-brain on the baseline volume (observed: worker-1 + worker-3 both promoted, worker-2 the true lowest did not). Add two split-brain-safety gates: 1. Require a local UpToDate sibling volume — proves the RD is PAST first activation. A pure day0 bootstrap has NO UpToDate volume, so the predicate can never fire there (the normal winner-election + auto-promote own day0). 2. Require EVERY peer connection to be fully Connected — the lowest-node-id election is only sound with COMPLETE peer information. Any StandAlone/Connecting peer defers the self-heal, so every node computes the election over the same full set and EXACTLY ONE (the true lowest id) ever promotes. Unit tests previously only modelled fully-Connected peers (so they passed while live partial-connectivity behaviour broke); add explicit pins for the day0-all-Inconsistent and partial-connectivity cases, and give the veto fixtures a UpToDate sibling so they exercise the intended veto rather than gate 1. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(harness): give BUG-048 replay 360s for the self-heal recovery path When the concurrent seed races and a late volume comes up Inconsistent on every replica, the late-add-promote self-heal recovers it, but the detect + recoveryPromoteThrottle + force-primary + resync chain runs past the bare 240s convergence budget on a contended stand. 360s keeps the recovery path from flaking while staying well inside the wedge discriminator. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(controller): optimistic-lock the RD volume-minor patch (BUG-048) The silent VD-drop core. ensureRDVolumeMinors reads the RD, allocates DRBD minors for any volume still missing one, then writes them back via patchRDVolumeMinors — a JSON merge-patch on spec.volumeDefinitions. RFC 7386 merge-patch REPLACES an array wholesale, so the write stamps the ENTIRE volumeDefinitions list from the snapshot the reconciler read at the top of the function. A concurrent number-less `linstor vd c` that appended a new volume between that read and the patch is silently overwritten away — and both `vd c` return success because the clobber is an async reconciler write, not their own create. This is the operator-visible "vdCount short by one, both vd c rc0, zero conflict/retry in the logs" failure. The code claimed "optimistic concurrency" in a comment but never set metadata.resourceVersion, so the patch could never 409 and the caller's IsConflict re-read branch was dead. Embed the snapshot's metadata.resourceVersion in the merge-patch body (the same precondition client.MergeFromWithOptimisticLock builds for typed patches): a racing VD append moves the RD past the snapshot, the apiserver rejects the wholesale replace with Conflict, the caller re-reads, and the next reconcile re-allocates minors against the now-complete list — the appended volume survives. A missing resourceVersion (RD freshly built in a unit path) degrades to the prior blind patch rather than failing. L1: a stale-snapshot patch racing an appended volume must 409 (and not drop it); a fresh-snapshot patch must still commit. The clobber test fails pre-fix (the blind patch returns nil and drops vol-1). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(drbd): defer late-add-promote to a lower-id peer mid-bring-up (BUG-048) The ≥3-replica convergence wedge. The late-add-promote lowest-node-id election (lateAddVolumeNeedsLocalPromote) counted a lower-id peer as a competing promoter ONLY when it observed that peer's volume as Inconsistent. A freshly late-added volume on a lower-id diskful peer passes through Attaching / Negotiating / DUnknown while its kernel slot brings the new volume up — it has NOT yet settled to Inconsistent. The old code treated those transient states as "not a competing diskful promoter; ignore", so a higher-id node observing a lower-id peer mid-bring-up concluded "I am lowest" and force-primaried. Once the true-lowest peer finished bring-up it ALSO promoted: two nodes minted divergent current-UUIDs on the same late volume → it wedged PausedSyncS / StandAlone split-brain with no convergence. Stand-observed on 3-diskful: node-id 1 and node-id 0 both logged the late-add force-primary for the same late volume; the volume stuck Inconsistent on every replica past the budget. Treat a lower-id peer in Attaching / Negotiating / DUnknown (the transient diskful bring-up / connection-not-fully-negotiated states) as the rightful promoter and defer — only the true-lowest diskful node ever promotes, so exactly one SyncSource is seeded. A steady-state Diskless witness is still ignored (it never becomes a diskful promoter; deferring to it would deadlock), so the lowest DISKFUL replica still seeds. This change is downstream of the two day0 split-brain gates (localHasUpToDateVolume + allPeersConnected) added in e766d4d: at day0 no volume is UpToDate so gate 1 blocks the election entirely and this code is never reached — the split-brain leg is untouched. The change is strictly MORE conservative (defers more, never promotes more), so it can only reduce split-brain risk. L1: defer to a lower-id peer in each transient bring-up state (fails pre-fix); still promote over a lower-id Diskless witness; the true-lowest still promotes despite a higher-id peer mid-bring-up. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite,drbd): kick stalled late-add resync to converge all replicas (BUG-048) The residual ≥3-replica concurrent late-vd-add convergence wedge. When two volumes are added back-to-back to an already-initialized 3-diskful RD, the last volume occasionally enters a DRBD resync that STALLS and never feeds every peer — a SyncSource is elected (syncSeen=1) but the volume sits Inconsistent/Outdated on two of three replicas with out-of-sync frozen at the full size. Two stand-observed signatures (both on the last concurrently-added volume of a 3-diskful RD): - dependency deadlock: a peer-device is PausedSyncS with resync-suspended:dependency (DRBD serialised this volume's resync behind a sibling that already finished, but the dependency latch never cleared) while partner peers wait in WFBitMapT/ resync-suspended:peer for a bitmap exchange the paused source never starts. oos frozen at full forever. - partial stall: one peer reaches UpToDate (a real SyncSource) but a second peer stays WFBitMapT/Outdated and never finalises. Neither existing self-heal recovers it: maybeLateAddPromote mints a source only when NONE exists (all-Inconsistent, no peer data) and a force-primary does not clear a resync-suspended:dependency latch; maybeRecoveryPromote requires the local already-UpToDate. Add a dedicated kernel-truth self-heal. Adm.NeedsLateAddResyncKick fires ONLY when no replica is Primary, the RD is past day0 (a local UpToDate volume exists), every peer is Connected, and a peer-device is in a stalled resync (PausedSync*/WFBitMap* with a non-"no" resync-suspended). maybeKickLateAddResync then disconnect+adjusts the resource, re-running the GI handshake so DRBD restarts the resync from the correct direction. Data-safe: disconnect/reconnect never mutates data and DRBD re-derives the sync direction from generation-UUIDs/bitmaps. A 45s stall-dwell gate ensures only a GENUINELY frozen resync is kicked, never the normal transient volume-resync serialization (a few seconds of PausedSyncS/ dependency while a sibling finishes). Throttled by the shared recoveryPromoteThrottle and self-limiting (once converged no stalled peer-device remains). Skipped on diskless replicas. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(harness): document the late-add resync-kick self-heal in the BUG-048 replay The vd-late-concurrent-no-drop replay's all_uptodate budget comment now references both BUG-048 convergence self-heals — the late-add-promote (mints a source when none exists) and the new late-add resync-kick (restarts a stalled/paused resync that never feeds all peers) — so the 360s budget rationale stays accurate after the resync-kick fix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite,drbd): recover stalled late-add resync via per-volume invalidate, not disconnect (BUG-048) The first resync-kick iteration used `disconnect`+`adjust` to re-run the DRBD handshake. On-stand that made the wedge WORSE: every replica of a fresh late-added volume shares the deterministic day0 current-UUID, so the post-reconnect GI handshake concludes "same generation, no resync needed" and ABANDONS the dirty bitmap — the connection went StandAlone / Established with out-of-sync frozen at the full size, never converging. Replace the connection re-handshake with a per-volume local `drbdadm invalidate <res>/<vol>`: it explicitly discards the empty, non-authoritative LOCAL copy of the one stalled volume and forces a full SyncTarget FROM an UpToDate peer — working WITH the bitmap instead of re-deriving sync direction from the ambiguous day0 GI. Adm.LateAddResyncKickVolumes returns the local volumes that are below UpToDate, have a connected UpToDate peer to pull from, and are not already being actively SyncTarget-pulled — only when no replica is Primary, the RD is past day0, and every peer is Connected. maybeKickLateAddResync invalidates each. This covers the partial-stall signature (a peer reached UpToDate but this replica is stuck WFBitMapT/Outdated) and the post-promote dependency case (maybeLateAddPromote mints the source for the all-Inconsistent case, then invalidate re-pulls the stragglers). Data-safe: only a non-authoritative below-UpToDate local copy is ever discarded, and only when an UpToDate peer exists as the source. The 45s stall-dwell + recoveryPromoteThrottle still gate it so the normal transient volume-resync serialization is never disrupted. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite,drbd): prevent late-add invalidate flap (one volume at a time, back off on active resync) (BUG-048) The first invalidate iteration flapped on-stand: it invalidated every stalled volume at once and re-fired while a triggered resync was still running, so out-of-sync dropped to 0, the kick re-invalidated, and oos jumped back to full — the volume never settled UpToDate within the convergence budget. Three refinements make the kick idempotent and non-flapping: - invalidate ONLY a genuinely-stuck Inconsistent local volume; Outdated is now excluded (it is the transient post-resync / mid-handshake state a just-invalidated volume passes through — re-invalidating it re-dirties a converging copy). - back off whenever ANY peer-device on the volume has a resync actively in progress in either direction (peerDeviceResyncInProgress) — so once an invalidate triggers a SyncTarget the kick stops until it finishes; only a genuinely stalled (paused/suspended) handshake re-qualifies. - invalidate ONE volume per kick (lowest-numbered stalled), since DRBD serialises resyncs per connection — invalidating several at once re-creates the dependency-serialization that wedges. Also shortens the stall-dwell 45s→20s: the in-progress back-off is now the primary anti-flap guard, so the dwell only needs a small settling margin, and a genuinely frozen wedge recovers well inside the convergence budget. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite,drbd): mint late-add source via clear-bitmap, not primary --force (BUG-048) The all-Inconsistent late-add wedge (no replica ever won the volume's UpToDate seed) was never actually recovered. maybeLateAddPromote ran `drbdadm primary --force <res>`, but the kernel REJECTS it with "(-16) Disk state is lower than outdated" whenever any volume of the resource is Inconsistent — stand-observed: the promote retried every throttle window and the volume stayed Inconsistent on all 3 replicas past the convergence budget (out-of-sync frozen at full). `primary` is also resource-wide, so one Inconsistent late volume blocks promoting the whole resource even though the sibling volumes are UpToDate. Replace the force-primary with MintLateAddSource, which makes THIS node an UpToDate source per-volume the way DRBD's own skip-initial-sync does: disconnect the resource, run `drbdsetup new-current-uuid --clear-bitmap <minor>` for each LOCAL Inconsistent volume (flips it UpToDate with a clean bitmap), then `drbdadm adjust` to reconnect — the Inconsistent peers then SyncTarget from the freshly-UpToDate local volumes. Verified on-stand: the clear-bitmap flips the volume UpToDate and the peers begin converging, where primary --force was permanently rejected. Data-safe: gated by NeedsLateAddPromote (lowest node-id, NO peer holds committed data, no Primary), so exactly one deterministic node mints the source and the minted copy is byte-equivalent to the equally-empty peers (day0 lineage). Only Inconsistent volumes are cleared; UpToDate siblings are untouched. Adds the device minor to the parsed drbdsetup status so the clear-bitmap can target the specific stuck volume. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(store): make auto-numbered VD create idempotent under retry (BUG-048 de-regress) CreateAutoNumbered's post-commit verification (verifyVolumeLanded) re-reads the RD to confirm the just-written VolumeNumber landed and forces a retry on a synthetic Conflict when it is absent. That guard is only sound when the re-read is LIVE: with no uncached reader wired it falls back to the informer cache, which trails the Update just committed, so it routinely fails to observe its own freshly written volume, surfaces a FALSE conflict, and the retry re-derives the next free number off the lagging cache and appends a SECOND volume. A single auto-numbered create could leave several phantom VolumeDefinitions behind — `vd d 0` then leaves a stray, and a late `vd c` lands 4 VDs instead of 1. Skip the post-commit verification when no live reader is available: a successful optimistic-locked Update is already the apiserver's authoritative confirmation, and there is nothing trustworthy to verify against a stale cache. Production wires mgr.GetAPIReader(), so the live verification path — the actual BUG-048 lost-update guard — is preserved untouched. Also align the integration harness store wiring with cmd/apiserver/main.go (NewWithAPIReader + mgr.GetAPIReader()). The harness used the plain New (no live reader), so CreateAutoNumbered's cache-lag re-read 409-stormed against the RD reconciler's concurrent RD writes; the slow create then made the linstor client re-POST, producing duplicate auto-numbered VDs. Production never hit this because the apiserver always wires the direct reader. The store-layer BUG-048 concurrency guarantee is unchanged: concurrent auto-numbered creates still converge to distinct VolumeNumbers via the optimistic-lock 409 retry (TestK8sVolumeDefinitionConcurrentAutoNumber), and the live post-commit verification still guards the heavy-load lost-update on the production path. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

Andrei Kvapil (kvaps) and others added 6 commits June 14, 2026 08:35

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit f351504 into main Jun 14, 2026
37 of 39 checks passed

This was referenced Jun 14, 2026

fix(rest,store,satellite): converge concurrent late vd-adds (BUG-048) #164

Merged

docs(known-issues): document BUG-048 concurrent late vd-create limitation #165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(harness): release-gate cleanup — cli-matrix/replay/burnin/cli-parity (batch 4) + unmask BUG-048#162

test(harness): release-gate cleanup — cli-matrix/replay/burnin/cli-parity (batch 4) + unmask BUG-048#162
Andrei Kvapil (kvaps) merged 6 commits into
mainfrom
test/harness-batch4-final

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err \| md5sum \| awk '{print \$1}')
		rc=\${PIPESTATUS[0]}

	md5=\$(dd if=${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err \| md5sum \| awk '{print \$1}')
	rc=\${PIPESTATUS[0]}
	md5=\$(set -o pipefail; dd if=\${DEV} bs=1M count=1 iflag=direct 2>/tmp/burnin-dd-primary.err \| md5sum \| awk '{print \$1}')\n rc=\$?

Conversation

Andrei Kvapil (kvaps) commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Product bug surfaced (BUG-048, tracked separately)

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading