test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3) by kvaps · Pull Request #161 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-14T01:27:43Z

Harness/docs/stand-only fixes for the remaining release-gate sweep false-negatives. NO product code. All 13 cells validated on stand big 3× (PASS or clean SKIP).

cli-matrix cell fixes

vd-d-prune-resource-volumes-no-flap: wrong wire keys .rsc_flags/.vlms → .flags/.volumes.
snap-single-replica-zfs-restore-u464: single-peer wait_uptodate $3 unbound.
snap-create-inactive-replica-excluded: jq --argjson got node names → crash.
snap-create-multiple-group-consistency / snap-create-multiple-lifecycle: xxd absent on Talos → marker no-op (replaced).
snap-cross-node-consistency: 256MiB write to 256MiB DRBD dev → ENOSPC (net<gross); seed now 255MiB net → clean SKIP on FILE_THIN.
ps-cdp-zfs-roundtrip: linstor ps l python KeyError under pipefail in the skip-gate → clean SKIP when no free device.
toggle-disk-solo-last-copy: premature Spec.Initialized precondition (latches on write/promote, not bare UpToDate).
vd-resize-single-replica-no-restart: asserted DRBD net size ≥ gross VD size (off by metadata reserve).
vd-resize-with-diskless-peer: depended on empty diskless-peer device state.
sync-final-uptodate-transition: 1GiB sync throttled (~4.3MiB/s PausedSyncT) exceeded the 240s window → reduced test volume to 512MiB (window/assert untouched). Confirmed NOT an observer-projection gap.

Known-delta reclassification (intentional BS behavior, cells now assert the BS contract)

row 84: r-c-over-tiebreaker-skip-sync — tiebreaker→diskful promotion runs full SyncTarget (anyDiskfulPeerHasData) to avoid a UUID-mint StandAlone wedge.
row 85: r-l-conns-shapes 331.D — bare SyncSource/SyncTarget literal when OutOfSyncKib≤0.

Infra hardening

timeout-wrap the hang sources: assert_no_orphans lvs exec (lib.sh, hung ~2h), replay-runner exec, csi-sanity terminating sidecar.
cli-parity: normalize per-side pool names + reap leftover RGs (incl. csi StorageClass sc-<uuid> residue) before diffing so environment drift can't masquerade as divergence.

Summary by CodeRabbit

Documentation
- Updated CLI parity documentation to clarify intentional behavioral differences regarding disk promotion sync operations and state column rendering formats.
Tests
- Enhanced test reliability with timeout protections for test execution and improved snapshot testing across multiple scenarios.
- Updated Kubernetes job configuration for more robust CSI testing.

wait_uptodate is a 2-peer helper (rd, primary, peer, [vol]). - snap-create-inactive-replica-excluded passed a third NODE as the positional vol, which the kernel ground-truth path fed to jq via --argjson v <node> → jq aborted on a non-numeric value, failing setup. Wait UpToDate pairwise from the primary instead (one query on the primary covers all three replicas). - snap-single-replica-zfs-restore-u464 called it on a single-replica RD with no peer → $3 unbound under set -u. Poll the lone node's observer disk state with wait_disk_state instead. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

xxd is not installed in the Talos satellite containers, so the marker-counter write (printf %016x | xxd -r -p) and read (head -c 8 | xxd -p) were no-ops: the writer's pipe failed and the || break killed the correlated-writer loop on its first iteration, so the cross-RD point-in-time parity assertions read a stale/zero counter and falsely passed. Emit the 8 big-endian counter bytes via the satellite's own bash printf '\xNN' (bash IS present and honours \xNN, unlike /bin/sh dash), and read them back with od -An -tx1 | tr -d ' \n' (same continuous-hex digest xxd -p produced). Affects snap-create-multiple-group-consistency and snap-create-multiple-lifecycle. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…onsistency The cell seeded a full 256 MiB pattern (dd bs=1M count=256) onto a 256 MiB DRBD device. DRBD carves its internal metadata out of the same backing volume, so the net block device presents only ~255.8 MiB — the write overran the last block with ENOSPC and the seed failed before any snapshot was taken (Bug-351 fixture never exercised). Derive the writable size from the device's own net byte count (floor to whole MiB) and use that SEED_MIB for both the seed write and the per-node snapshot md5 read so the comparison stays apples-to-apples. Clean-SKIP if the net size can't be resolved. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

The UpToDate-count wait read the resource-list machine-readable rows via the golinstor/upstream .rsc_flags / .vlms spellings, which do not exist on the blockstor wire (pkg/api/v1/resource.go: Flags json:"flags", Volumes json:"volumes"). .rsc_flags//[] excluded nothing and .vlms[]? selected nothing, so count_uptodate stayed 0 and the wait timed out even on a healthy RD. Switch to .flags / .volumes (same wire-key class as BUG-044). Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…KeyError The python 'linstor physical-storage list' render path raises a KeyError/AttributeError inside responses.py when a spare device has no size field (observed for a zero-size zram), exiting non-zero. Under set -o pipefail the no-free-device probe pipeline aborted the whole script as a spurious FAIL before the SKIP gate could fire, and the SKIP path then re-invoked the same throwing plain 'ps l'. Capture the machine-readable stdout into a variable first (|| true), parse it in isolation from the CLI exit status, and dump the machine-readable JSON (never the throwing plain render) on the SKIP path. No-free-device now SKIPs cleanly instead of FAILing. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…olo cell RD.Spec.Initialized does NOT latch on bare UpToDate. The controller (ensureSkipInitSyncDecision) flips it true only when a diskful peer is PROVEN data-bearing — its DRBD CurrentGI observed AND differing from the deterministic day0 GI (anyProvenDataBearingDiskfulPeer / isDay0SeededDiskfulVolume). A freshly-created, never-written pair sits at the day0 GI forever, so the old "UpToDate ⇒ Initialized" precondition always died. Write real data on the primary (promote, dd 16 MiB past day0, demote) to advance the generation and trigger the latch — the same pattern u145-write-then-add-replica-syncs uses — then poll up to 60s for the async latch. The solo diskless→diskful toggle path under test is unchanged. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…e in resize cells The DRBD kernel device is a few MiB smaller than the gross VolumeDefinition size (internal metadata reserve), and a flag-only diskless replica has an empty observer device-state row. - vd-resize-single-replica-no-restart asserted the kernel DRBD device reached the full gross 2 GiB (2097152 KiB); it tops out at ~2096156 KiB net, so the check failed on a perfectly-converged resize. Require within an 8 MiB slack of nominal (same slack assert_resize_converged uses). - vd-resize-with-diskless-peer treated the diskless peer's empty diskState as a hard FAIL; the observer legitimately omits the volumes[] row for a flag-only diskless replica. Gate on the DISKLESS flag + reject only explicit wedge states, accepting Diskless / UpToDate / empty. The load-bearing assertion (no backing LV on the diskless peer) is unchanged. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

A kubectl exec into a satellite can block indefinitely when the pod's kernel/lvm path is wedged — a hung lvs against an lvm2 daemon on a FILE_THIN node stalled a full cli-matrix sweep for ~2h. Add a bs_exec helper that runs the exec under a hard wall-clock cap (timeout, 30s default; falls back to bare exec when timeout is absent) and route the hang-prone probes through it: - cli-matrix/lib.sh assert_no_orphans: drbdsetup status / .res test / lvs / zfs list per-satellite probes. - operator-harness/lib.sh await readers: pod_md5_invariant (md5sum), drbd_option (drbdsetup show), disk_state (drbdsetup status --json). A wedged exec now times out and is treated as 'nothing observed on that pod' (best-effort semantics already in force) instead of stalling the whole sweep. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

linstor-csi is a non-terminating CSI gRPC server. As a plain Job container it kept the pod Running even after csi-sanity exited, so the Job never reached complete and the harness wait blocked for the full ~1h timeout. Move it to a Kubernetes native sidecar — an initContainer with restartPolicy: Always — which the kubelet terminates once the regular containers exit, so the Job completes on csi-sanity's exit code. Native sidecars are GA from k8s 1.29 (the stand runs 1.34). Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

… deltas Two cli-matrix cells asserted the UPSTREAM contract where blockstor deliberately diverges; flip them to assert the BS contract and document the deltas (rows 84, 85). - r-c-over-tiebreaker-skip-sync: a tiebreaker→diskful promotion runs a full SyncTarget, not skip-sync. This is intentional — the anyDiskfulPeerHasData gate (pkg/dispatcher) suppresses the auto-primary seed when a data-bearing diskful peer exists, so the promoted replica SyncTargets the real bytes instead of force-priming and minting an unrelated UUID that wedges the pair in StandAlone. The cell now asserts SyncTarget→UpToDate convergence (the transit is expected) plus the Bug 348 SyncSource shape. - r-l-conns-shapes sub-test D: a bare SyncSource/SyncTarget literal (no (NN%) suffix) is intentional when OutOfSyncKib<=0 — withSyncPercent short-circuits rather than emit a misleading (0%)/(100%). The cell now accepts a Sync* token bare OR with (NN%), and only flags the real regression (a terminal UpToDate carrying a (NN%) suffix). docs/cli-parity-known-deltas.md gains rows 84 and 85 with the product rationale and a refresh-history note. Harness + docs only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…ty-refresh Environment drift could masquerade as a BS↔upstream divergence in the parity diff: - Storage-pool names differ purely by env config (BS stand/lvm-thin/ zfs-thin/zfs-thick vs the oracle's pool), so an otherwise-identical sp l / r l / vd l row diffed solely on the pool column. Normalize every side's pool token to a canonical <POOL> on diff-only copies (raw artefacts stay verbatim) so only real shape divergence survives. - The csi-sanity Job leaves StorageClass-backed resource-groups named sc-<uuid> behind; they made BS's rg l carry rows the oracle never had. Reap that csi-sanity-shaped residue on both sides before seeding and diffing. Only sc-<uuid> RGs are reaped — the parity seed and any operator/default RG are left untouched. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

… per-entry envelope The Phase 5 partial-failure check invoked the CLI with the colon rd:snap form, which this linstor client rejects with an argparse usage error (it wants -r RD... SNAP) naming neither RD, so the error-envelope assertion failed on a CLI-grammar artefact, not the apiserver contract. Mirror Phase 3: when the CLI form is rejected, drive the same good+missing batch through POST /v1/actions/snapshot/multi. The apiserver is best-effort per-entry: it returns 201 with a JSON array carrying a SUCCESS for the good RD and an ERROR naming the missing RD (object not found), creating the good-side snap. That per-entry error envelope is the acceptable shape — recognise it and only FAIL on SILENT partial success (one snap created with no error reported). Harness-only; no product change. The substantive Bug 353 assertions (shared GroupID, atomic SuspendIo phase advance with 0s slip, Ready, batch-scoped resume-io, cross-RD counter parity) already pass on big. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…an observer gap) The 3rd-replica initial SyncTarget intermittently overran the 240s wait. Investigation on big: the observer projection is correct throughout (it reports Inconsistent + SyncTarget/PausedSyncT and the out-of-sync trend accurately, then bare UpToDate + Established once drained), so this is NOT a BUG-045 observer-projection gap. DRBD throttles the resync to ~4.3 MiB/s under PausedSyncT rate control, so a full 1 GiB sync took ~255-270s and crossed the 240s edge even on a quiet stand. The day0 skip-sync does not always fire for the freshly added replica, so the cell does exercise a real full sync. Reduce the volume to 512 MiB: still a real, observable Inconsistent→SyncTarget→bare-UpToDate window (the Bug 329 transition under test), but the full sync now completes in ~120s, comfortably inside the unchanged 240s budget. No assertion weakening, no timeout inflation, no masking — the read already polls the right condition. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-14T01:27:59Z

📝 Walkthrough

Walkthrough

This PR registers two intentional CLI parity deltas (#84: tiebreaker→diskful full SyncTarget, #85: bare SyncSource/SyncTarget rendering), rewrites the associated e2e assertions to match those contracts, adds a bs_exec bounded-timeout helper across both test harness libs, replaces xxd with od/tr for Talos compatibility in snapshot scripts, fixes a range of DRBD/snapshot/resize e2e correctness issues, enhances the parity-refresh harness with output normalization and residue cleanup, and moves linstor-csi to a native sidecar initContainer in the CSI sanity Job manifest.

Changes

E2E Infrastructure and Parity Delta Test Suite Update

Layer / File(s)	Summary
bs_exec bounded timeout helper `tests/e2e/cli-matrix/lib.sh`, `tests/operator-harness/lib.sh`	Introduces `bs_exec`/`BS_EXEC_TIMEOUT` in both harness libs; wires `assert_no_orphans` probes (DRBD slot, `.res` file, LVM scan) and three in-pod assertion probes (`md5sum`, `drbdsetup show`, `drbdsetup status`) through it so kubectl execs are time-bounded.
Delta `#84`: tiebreaker→diskful full SyncTarget contract `docs/cli-parity-known-deltas.md`, `tests/e2e/cli-matrix/r-c-over-tiebreaker-skip-sync.sh`	Documents delta `#84` in the known-deltas file and rewrites the promotion test to expect a full SyncTarget traversal (240 s budget) instead of failing on skip-sync absence; adds delta-specific diagnostics on timeout.
Delta `#85`: SyncSource/SyncTarget rendering contract `docs/cli-parity-known-deltas.md`, `tests/e2e/cli-matrix/r-l-conns-shapes.sh`	Documents delta `#85` and updates Sub-test D to accept bare or `(NN%)`-suffixed `Sync*` states while treating `UpToDate(NN%)` as a Bug A regression.
xxd→od/tr Talos compatibility `tests/e2e/cli-matrix/snap-create-multiple-group-consistency.sh`, `tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh`	Replaces `xxd -r -p` (write) and `xxd -p` (read) with Bash `printf`/`sed` and `od -An -tx1 \| tr -d` pipelines across both snapshot scripts; also rewrites the partial-failure batch path in lifecycle to fall back to the apiserver REST endpoint when CLI rejects argument grammar.
Snapshot e2e correctness fixes `tests/e2e/cli-matrix/snap-create-inactive-replica-excluded.sh`, `tests/e2e/cli-matrix/snap-cross-node-consistency.sh`, `tests/e2e/cli-matrix/snap-single-replica-zfs-restore-u464.sh`	Splits three-node `wait_uptodate` into two sequential calls; derives `SEED_MIB` from resolved DRBD net device size for consistent seed/md5 comparison; switches single-replica readiness to `wait_disk_state` with unbound-peer arg.
DRBD/volume e2e robustness fixes `tests/e2e/cli-matrix/toggle-disk-solo-last-copy.sh`, `tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh`, `tests/e2e/cli-matrix/vd-resize-single-replica-no-restart.sh`, `tests/e2e/cli-matrix/vd-resize-with-diskless-peer.sh`, `tests/e2e/cli-matrix/sync-final-uptodate-transition.sh`, `tests/e2e/cli-matrix/ps-cdp-zfs-roundtrip.sh`	Fixes `toggle-disk-solo-last-copy` Initialized latch (16 MiB write + 60 s poll); corrects `vd-d-prune` jq field names; adds `META_SLACK_KIB` threshold for resize assertion; introduces `diskless_state_ok` for empty-diskState tolerance; reduces `sync-final-uptodate-transition` volume from 1 GiB to 512 MiB; captures `ps_json` once with `\|\| true` to avoid pipefail abort.
CLI parity refresh harness improvements `tests/operator-harness/cli-parity-refresh.sh`	Adds `normalize_side_output` (pool-name token canonicalization) and `reap_residual_resource_groups` (deletes leftover `sc-*` RGs); wires normalization into `classify()` diff path and runs the reaper in `seed()` before each parity command.
linstor-csi initContainer sidecar `stand/csi-sanity-job.yaml`	Moves `linstor-csi` from the Job's `containers` list to a non-terminating `initContainers` entry (`restartPolicy: Always`), so the Job completion status follows `csi-sanity` alone.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cozystack/blockstor#83: Introduces the tiebreaker/quorum controller logic whose SyncTarget and SyncSource rendering behavior is directly asserted by the #84/fix(placer): exclude INACTIVE replicas from place_count so autoplace refills active redundancy (Bug 393) #85 delta test updates in this PR.
cozystack/blockstor#120: Modifies the solo-promote diskful replica flow that toggle-disk-solo-last-copy.sh exercises; this PR adds the RD.Spec.Initialized latch polling that depends on that controller behavior.
cozystack/blockstor#92: Introduced the vd-d-prune-resource-volumes-no-flap.sh no-flap scenario; this PR fixes the jq field names (.flags/.volumes) needed for the test to correctly count UpToDate replicas in that scenario.

Poem

🐇 Hop, hop, the rabbit said,
No more xxd in our stead!
od and tr shall read our bytes,
bs_exec ends the hang-filled nights.
Delta #84 syncs full by design,
And every test now runs just fine! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: test harness and CLI-matrix/replay/parity fixes for release-gate validation. It is specific, concise, and clearly conveys the scope (batch 3) without being vague or misleading.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch test/cli-matrix-harness-batch3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces several robustness improvements, bug fixes, and test updates to the CLI parity and e2e test suites. Key changes include converting linstor-csi to a Kubernetes native sidecar in the sanity job, introducing a bounded bs_exec wrapper to prevent wedged commands from stalling test sweeps, replacing xxd with portable shell commands, and updating tests to align with intentional behavior (such as full-sync on tiebreaker promotion and bare Sync* literals). Feedback on the changes highlights two issues: a masked dd failure in toggle-disk-solo-last-copy.sh that could cause the test to hang, and a portability issue in cli-parity-refresh.sh where sed -i -E with \b word boundaries is incompatible with macOS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-14T01:29:43Z

+on_node "$N1" sh -c "
+    D=${N1_DEV};
+    drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;
+    dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;
+    sync;
+    drbdadm secondary $RD 2>/dev/null || true;
+" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"


The on_node command runs a subshell where dd's failure is masked. Because there is no set -e inside the sh -c string, and the final command (drbdadm secondary ... || true) is guaranteed to succeed, the overall on_node command will always return exit code 0 even if dd fails (e.g., due to an unsupported oflag=direct or write error). This masks the write failure and causes the test to hang/timeout for 60 seconds waiting for RD.Spec.Initialized to latch true, making debugging difficult.

We should capture dd's exit status and explicitly exit with it at the end of the subshell to ensure failures are propagated immediately.

Suggested change

on_node "$N1" sh -c "

D=${N1_DEV};

drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;

dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;

sync;

drbdadm secondary $RD 2>/dev/null || true;

" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"

on_node "$N1" sh -c "

D=${N1_DEV};

drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;

dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;

rc=\$?;

sync;

drbdadm secondary $RD 2>/dev/null || true;

exit \$rc;

" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"

gemini-code-assist · 2026-06-14T01:29:43Z

+    local p
+    for p in $pools; do
+        [[ -n "$p" ]] || continue
+        # Word-boundary replace so `stand` doesn't clobber `standby` etc.
+        sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true
+    done


Using sed -i -E with \b word boundaries is not portable to macOS (BSD sed). On macOS, sed -i expects a backup extension argument (otherwise it treats -E as the extension), and BSD sed does not support \b for word boundaries (it uses [[:<:]] and [[:>:]]). This causes the normalization to fail silently on macOS, leading to false-positive test failures due to unnormalized pool names.

Using perl -pi -e is highly portable and behaves identically on both macOS and Linux, supporting both in-place editing and \b word boundaries natively.

Suggested change

local p

for p in $pools; do

[[ -n "$p" ]] || continue

# Word-boundary replace so `stand` doesn't clobber `standby` etc.

sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true

done

local p

for p in $pools; do

[[ -n "$p" ]] || continue

# Word-boundary replace so `stand` doesn't clobber `standby` etc.

# perl is used for portable in-place word-boundary regex across macOS and Linux.

perl -pi -e "s/\\b\\Q${p}\\E\\b/<POOL>/g" "$file" 2>/dev/null || true

done

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh (1)

460-462: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use delta counts for BAD_SNAP validation, not absolute totals.

Line 460 computes pre_count, but Lines 501–520 validate only absolute post_count. If stale BAD_SNAP CRDs already exist, a silent partial success can evade the post_count == 1 failure branch and false-pass.

Proposed fix

 pre_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
     | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
@@
     post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
         | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
+    created_count=$(( post_count - pre_count ))
@@
-    elif (( post_count == 1 )); then
+    elif (( created_count == 1 )); then
         echo "FAIL (Bug 353 partial-fail): create-multiple SILENTLY created 1/2 snaps when one RD was missing (no per-entry error reported)" >&2
@@
-    elif (( post_count == 0 )); then
+    elif (( created_count == 0 )); then
         echo "   → exit 0 but no CRD was created — apiserver fail-fast detected at validation time, acceptable"
     fi
 else
@@
     post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
         | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
-    if (( post_count > 0 )); then
+    created_count=$(( post_count - pre_count ))
+    if (( created_count > 0 )); then
         echo "   note: partial-fail rejected with $post_count orphan CRD(s); cleanup will reap them"
         "${LCTL[@]}" snapshot delete "$RD_A" "$BAD_SNAP" 2>/dev/null || true
     fi
 fi

Also applies to: 501-520

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh` around lines 460 -
462, The validation at lines 501-520 checks absolute post_count values against
fixed expectations, but pre_count is computed at line 460 and never used. This
allows false passes if stale BAD_SNAP snapshots already exist. Modify the
validation logic at lines 501-520 to compute and validate the delta (post_count
minus pre_count) instead of checking absolute counts, ensuring exactly one new
snapshot is created rather than just verifying a specific total count exists.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh`:
- Around line 107-122: The net_bytes assignment using the on_node command can
cause the script to abort under set -euo pipefail if the probe command fails,
preventing the intended skip path. To fix this, add error handling to the
net_bytes assignment (around line 107) to ensure the command failure does not
abort the script. Use either a trailing || operator to suppress the error or
wrap the assignment with proper error handling so that if the probe fails,
net_bytes is set to an empty string, allowing the subsequent conditional checks
to properly evaluate and trigger the skip path when appropriate.

---

Outside diff comments:
In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh`:
- Around line 460-462: The validation at lines 501-520 checks absolute
post_count values against fixed expectations, but pre_count is computed at line
460 and never used. This allows false passes if stale BAD_SNAP snapshots already
exist. Modify the validation logic at lines 501-520 to compute and validate the
delta (post_count minus pre_count) instead of checking absolute counts, ensuring
exactly one new snapshot is created rather than just verifying a specific total
count exists.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7beafefe-7557-40a5-a18b-307196549f7c

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9b648 and fc55750.

📒 Files selected for processing (18)

docs/cli-parity-known-deltas.md
stand/csi-sanity-job.yaml
tests/e2e/cli-matrix/lib.sh
tests/e2e/cli-matrix/ps-cdp-zfs-roundtrip.sh
tests/e2e/cli-matrix/r-c-over-tiebreaker-skip-sync.sh
tests/e2e/cli-matrix/r-l-conns-shapes.sh
tests/e2e/cli-matrix/snap-create-inactive-replica-excluded.sh
tests/e2e/cli-matrix/snap-create-multiple-group-consistency.sh
tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh
tests/e2e/cli-matrix/snap-cross-node-consistency.sh
tests/e2e/cli-matrix/snap-single-replica-zfs-restore-u464.sh
tests/e2e/cli-matrix/sync-final-uptodate-transition.sh
tests/e2e/cli-matrix/toggle-disk-solo-last-copy.sh
tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh
tests/e2e/cli-matrix/vd-resize-single-replica-no-restart.sh
tests/e2e/cli-matrix/vd-resize-with-diskless-peer.sh
tests/operator-harness/cli-parity-refresh.sh
tests/operator-harness/lib.sh

coderabbitai · 2026-06-14T01:35:52Z

+net_bytes=$(on_node "$N1" bash -c "
+    dev='$dev'
+    [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
+    blockdev --getsize64 \"\$dev\" 2>/dev/null
+" 2>/dev/null | tr -dc '0-9')
+SEED_MIB=$SIZE_MIB
+if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
+    net_mib=$(( net_bytes / 1024 / 1024 ))
+    if (( net_mib < SEED_MIB )); then
+        SEED_MIB=$net_mib
+    fi
+fi
+if (( SEED_MIB < 1 )); then
+    echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
+    exit 0
+fi


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden the net-size probe failure path to avoid false FAIL/ENOSPC regressions.

Under set -euo pipefail, Line 107 can abort the script if the probe command fails, and when it returns empty, SEED_MIB remains SIZE_MIB, bypassing the intended skip path and risking the original full-gross write failure.

Proposed fix

-net_bytes=$(on_node "$N1" bash -c " +net_bytes=$(on_node "$N1" bash -c " dev='$dev' [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1) blockdev --getsize64 \"\$dev\" 2>/dev/null -" 2>/dev/null | tr -dc '0-9') -SEED_MIB=$SIZE_MIB +" 2>/dev/null | tr -dc '0-9' || true) +SEED_MIB=0 if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then net_mib=$(( net_bytes / 1024 / 1024 )) - if (( net_mib < SEED_MIB )); then - SEED_MIB=$net_mib - fi + SEED_MIB=$net_mib + (( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB fi if (( SEED_MIB < 1 )); then echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)" exit 0 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

net_bytes=$(on_node "$N1" bash -c "

dev='$dev'

[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)

blockdev --getsize64 \"\$dev\" 2>/dev/null

" 2>/dev/null | tr -dc '0-9')

SEED_MIB=$SIZE_MIB

if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then

net_mib=$(( net_bytes / 1024 / 1024 ))

if (( net_mib < SEED_MIB )); then

SEED_MIB=$net_mib

fi

fi

if (( SEED_MIB < 1 )); then

echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"

exit 0

fi

net_bytes=$(on_node "$N1" bash -c "

dev='$dev'

[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)

blockdev --getsize64 \"\$dev\" 2>/dev/null

" 2>/dev/null | tr -dc '0-9' || true)

SEED_MIB=0

if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then

net_mib=$(( net_bytes / 1024 / 1024 ))

SEED_MIB=$net_mib

(( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB

fi

if (( SEED_MIB < 1 )); then

echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"

exit 0

fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh` around lines 107 - 122, The net_bytes assignment using the on_node command can cause the script to abort under set -euo pipefail if the probe command fails, preventing the intended skip path. To fix this, add error handling to the net_bytes assignment (around line 107) to ensure the command failure does not abort the script. Use either a trailing || operator to suppress the error or wrap the assignment with proper error handling so that if the probe fails, net_bytes is set to an empty string, allowing the subsequent conditional checks to properly evaluate and trigger the skip path when appropriate.

Resolve conflict in tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh: keep the complete wire-key fix (.flags + .volumes[]) from this branch, which subsumes the partial .flags-only fix that landed on main via #158. Both sides intended the same correction; this branch's version is the strict superset. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Andrei Kvapil (kvaps) and others added 13 commits June 14, 2026 02:37

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit 015374d into main Jun 14, 2026
15 checks passed

Andrei Kvapil (kvaps) deleted the test/cli-matrix-harness-batch3 branch June 14, 2026 03:39

coderabbitai Bot mentioned this pull request Jun 14, 2026

test(harness): release-gate cleanup — cli-matrix/replay/burnin/cli-parity (batch 4) + unmask BUG-048 #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3)#161

test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3)#161
Andrei Kvapil (kvaps) merged 14 commits into
mainfrom
test/cli-matrix-harness-batch3

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Andrei Kvapil (kvaps) commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cli-matrix cell fixes

Known-delta reclassification (intentional BS behavior, cells now assert the BS contract)

Infra hardening

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading