Skip to content

test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3)#161

Merged
Andrei Kvapil (kvaps) merged 14 commits into
mainfrom
test/cli-matrix-harness-batch3
Jun 14, 2026
Merged

test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3)#161
Andrei Kvapil (kvaps) merged 14 commits into
mainfrom
test/cli-matrix-harness-batch3

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 14, 2026

Copy link
Copy Markdown
Member

Harness/docs/stand-only fixes for the remaining release-gate sweep false-negatives. NO product code. All 13 cells validated on stand big 3× (PASS or clean SKIP).

cli-matrix cell fixes

  • vd-d-prune-resource-volumes-no-flap: wrong wire keys .rsc_flags/.vlms.flags/.volumes.
  • snap-single-replica-zfs-restore-u464: single-peer wait_uptodate $3 unbound.
  • snap-create-inactive-replica-excluded: jq --argjson got node names → crash.
  • snap-create-multiple-group-consistency / snap-create-multiple-lifecycle: xxd absent on Talos → marker no-op (replaced).
  • snap-cross-node-consistency: 256MiB write to 256MiB DRBD dev → ENOSPC (net<gross); seed now 255MiB net → clean SKIP on FILE_THIN.
  • ps-cdp-zfs-roundtrip: linstor ps l python KeyError under pipefail in the skip-gate → clean SKIP when no free device.
  • toggle-disk-solo-last-copy: premature Spec.Initialized precondition (latches on write/promote, not bare UpToDate).
  • vd-resize-single-replica-no-restart: asserted DRBD net size ≥ gross VD size (off by metadata reserve).
  • vd-resize-with-diskless-peer: depended on empty diskless-peer device state.
  • sync-final-uptodate-transition: 1GiB sync throttled (~4.3MiB/s PausedSyncT) exceeded the 240s window → reduced test volume to 512MiB (window/assert untouched). Confirmed NOT an observer-projection gap.

Known-delta reclassification (intentional BS behavior, cells now assert the BS contract)

  • row 84: r-c-over-tiebreaker-skip-sync — tiebreaker→diskful promotion runs full SyncTarget (anyDiskfulPeerHasData) to avoid a UUID-mint StandAlone wedge.
  • row 85: r-l-conns-shapes 331.D — bare SyncSource/SyncTarget literal when OutOfSyncKib≤0.

Infra hardening

  • timeout-wrap the hang sources: assert_no_orphans lvs exec (lib.sh, hung ~2h), replay-runner exec, csi-sanity terminating sidecar.
  • cli-parity: normalize per-side pool names + reap leftover RGs (incl. csi StorageClass sc-<uuid> residue) before diffing so environment drift can't masquerade as divergence.

Summary by CodeRabbit

  • Documentation

    • Updated CLI parity documentation to clarify intentional behavioral differences regarding disk promotion sync operations and state column rendering formats.
  • Tests

    • Enhanced test reliability with timeout protections for test execution and improved snapshot testing across multiple scenarios.
    • Updated Kubernetes job configuration for more robust CSI testing.

Andrei Kvapil (kvaps) and others added 13 commits June 14, 2026 02:37
wait_uptodate is a 2-peer helper (rd, primary, peer, [vol]).

- snap-create-inactive-replica-excluded passed a third NODE as the
  positional vol, which the kernel ground-truth path fed to jq via
  --argjson v <node> → jq aborted on a non-numeric value, failing
  setup. Wait UpToDate pairwise from the primary instead (one query
  on the primary covers all three replicas).
- snap-single-replica-zfs-restore-u464 called it on a single-replica
  RD with no peer → $3 unbound under set -u. Poll the lone node's
  observer disk state with wait_disk_state instead.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
xxd is not installed in the Talos satellite containers, so the
marker-counter write (printf %016x | xxd -r -p) and read
(head -c 8 | xxd -p) were no-ops: the writer's pipe failed and the
|| break killed the correlated-writer loop on its first iteration,
so the cross-RD point-in-time parity assertions read a stale/zero
counter and falsely passed.

Emit the 8 big-endian counter bytes via the satellite's own bash
printf '\xNN' (bash IS present and honours \xNN, unlike /bin/sh
dash), and read them back with od -An -tx1 | tr -d ' \n' (same
continuous-hex digest xxd -p produced). Affects
snap-create-multiple-group-consistency and
snap-create-multiple-lifecycle.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…onsistency

The cell seeded a full 256 MiB pattern (dd bs=1M count=256) onto a
256 MiB DRBD device. DRBD carves its internal metadata out of the
same backing volume, so the net block device presents only ~255.8
MiB — the write overran the last block with ENOSPC and the seed
failed before any snapshot was taken (Bug-351 fixture never
exercised).

Derive the writable size from the device's own net byte count
(floor to whole MiB) and use that SEED_MIB for both the seed write
and the per-node snapshot md5 read so the comparison stays
apples-to-apples. Clean-SKIP if the net size can't be resolved.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The UpToDate-count wait read the resource-list machine-readable rows
via the golinstor/upstream .rsc_flags / .vlms spellings, which do
not exist on the blockstor wire (pkg/api/v1/resource.go: Flags
json:"flags", Volumes json:"volumes"). .rsc_flags//[] excluded
nothing and .vlms[]? selected nothing, so count_uptodate stayed 0
and the wait timed out even on a healthy RD. Switch to .flags /
.volumes (same wire-key class as BUG-044).

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…KeyError

The python 'linstor physical-storage list' render path raises a
KeyError/AttributeError inside responses.py when a spare device has
no size field (observed for a zero-size zram), exiting non-zero.
Under set -o pipefail the no-free-device probe pipeline aborted the
whole script as a spurious FAIL before the SKIP gate could fire, and
the SKIP path then re-invoked the same throwing plain 'ps l'.

Capture the machine-readable stdout into a variable first (|| true),
parse it in isolation from the CLI exit status, and dump the
machine-readable JSON (never the throwing plain render) on the SKIP
path. No-free-device now SKIPs cleanly instead of FAILing.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…olo cell

RD.Spec.Initialized does NOT latch on bare UpToDate. The controller
(ensureSkipInitSyncDecision) flips it true only when a diskful peer is
PROVEN data-bearing — its DRBD CurrentGI observed AND differing from
the deterministic day0 GI (anyProvenDataBearingDiskfulPeer /
isDay0SeededDiskfulVolume). A freshly-created, never-written pair sits
at the day0 GI forever, so the old "UpToDate ⇒ Initialized"
precondition always died.

Write real data on the primary (promote, dd 16 MiB past day0, demote)
to advance the generation and trigger the latch — the same pattern
u145-write-then-add-replica-syncs uses — then poll up to 60s for the
async latch. The solo diskless→diskful toggle path under test is
unchanged.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…e in resize cells

The DRBD kernel device is a few MiB smaller than the gross
VolumeDefinition size (internal metadata reserve), and a flag-only
diskless replica has an empty observer device-state row.

- vd-resize-single-replica-no-restart asserted the kernel DRBD device
  reached the full gross 2 GiB (2097152 KiB); it tops out at ~2096156
  KiB net, so the check failed on a perfectly-converged resize. Require
  within an 8 MiB slack of nominal (same slack assert_resize_converged
  uses).
- vd-resize-with-diskless-peer treated the diskless peer's empty
  diskState as a hard FAIL; the observer legitimately omits the
  volumes[] row for a flag-only diskless replica. Gate on the DISKLESS
  flag + reject only explicit wedge states, accepting Diskless /
  UpToDate / empty. The load-bearing assertion (no backing LV on the
  diskless peer) is unchanged.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
A kubectl exec into a satellite can block indefinitely when the pod's
kernel/lvm path is wedged — a hung lvs against an lvm2 daemon on a
FILE_THIN node stalled a full cli-matrix sweep for ~2h. Add a bs_exec
helper that runs the exec under a hard wall-clock cap (timeout, 30s
default; falls back to bare exec when timeout is absent) and route the
hang-prone probes through it:

- cli-matrix/lib.sh assert_no_orphans: drbdsetup status / .res test /
  lvs / zfs list per-satellite probes.
- operator-harness/lib.sh await readers: pod_md5_invariant (md5sum),
  drbd_option (drbdsetup show), disk_state (drbdsetup status --json).

A wedged exec now times out and is treated as 'nothing observed on
that pod' (best-effort semantics already in force) instead of stalling
the whole sweep.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
linstor-csi is a non-terminating CSI gRPC server. As a plain Job
container it kept the pod Running even after csi-sanity exited, so the
Job never reached complete and the harness wait blocked for the full
~1h timeout. Move it to a Kubernetes native sidecar — an initContainer
with restartPolicy: Always — which the kubelet terminates once the
regular containers exit, so the Job completes on csi-sanity's exit
code. Native sidecars are GA from k8s 1.29 (the stand runs 1.34).

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… deltas

Two cli-matrix cells asserted the UPSTREAM contract where blockstor
deliberately diverges; flip them to assert the BS contract and document
the deltas (rows 84, 85).

- r-c-over-tiebreaker-skip-sync: a tiebreaker→diskful promotion runs a
  full SyncTarget, not skip-sync. This is intentional — the
  anyDiskfulPeerHasData gate (pkg/dispatcher) suppresses the
  auto-primary seed when a data-bearing diskful peer exists, so the
  promoted replica SyncTargets the real bytes instead of force-priming
  and minting an unrelated UUID that wedges the pair in StandAlone. The
  cell now asserts SyncTarget→UpToDate convergence (the transit is
  expected) plus the Bug 348 SyncSource shape.
- r-l-conns-shapes sub-test D: a bare SyncSource/SyncTarget literal (no
  (NN%) suffix) is intentional when OutOfSyncKib<=0 — withSyncPercent
  short-circuits rather than emit a misleading (0%)/(100%). The cell now
  accepts a Sync* token bare OR with (NN%), and only flags the real
  regression (a terminal UpToDate carrying a (NN%) suffix).

docs/cli-parity-known-deltas.md gains rows 84 and 85 with the product
rationale and a refresh-history note. Harness + docs only; no product
change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ty-refresh

Environment drift could masquerade as a BS↔upstream divergence in the
parity diff:

- Storage-pool names differ purely by env config (BS stand/lvm-thin/
  zfs-thin/zfs-thick vs the oracle's pool), so an otherwise-identical
  sp l / r l / vd l row diffed solely on the pool column. Normalize
  every side's pool token to a canonical <POOL> on diff-only copies
  (raw artefacts stay verbatim) so only real shape divergence survives.
- The csi-sanity Job leaves StorageClass-backed resource-groups named
  sc-<uuid> behind; they made BS's rg l carry rows the oracle never
  had. Reap that csi-sanity-shaped residue on both sides before seeding
  and diffing. Only sc-<uuid> RGs are reaped — the parity seed and any
  operator/default RG are left untouched.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… per-entry envelope

The Phase 5 partial-failure check invoked the CLI with the colon
rd:snap form, which this linstor client rejects with an argparse usage
error (it wants -r RD... SNAP) naming neither RD, so the error-envelope
assertion failed on a CLI-grammar artefact, not the apiserver contract.
Mirror Phase 3: when the CLI form is rejected, drive the same
good+missing batch through POST /v1/actions/snapshot/multi.

The apiserver is best-effort per-entry: it returns 201 with a JSON
array carrying a SUCCESS for the good RD and an ERROR naming the missing
RD (object not found), creating the good-side snap. That per-entry
error envelope is the acceptable shape — recognise it and only FAIL on
SILENT partial success (one snap created with no error reported).

Harness-only; no product change. The substantive Bug 353 assertions
(shared GroupID, atomic SuspendIo phase advance with 0s slip, Ready,
batch-scoped resume-io, cross-RD counter parity) already pass on big.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…an observer gap)

The 3rd-replica initial SyncTarget intermittently overran the 240s
wait. Investigation on big: the observer projection is correct
throughout (it reports Inconsistent + SyncTarget/PausedSyncT and the
out-of-sync trend accurately, then bare UpToDate + Established once
drained), so this is NOT a BUG-045 observer-projection gap. DRBD
throttles the resync to ~4.3 MiB/s under PausedSyncT rate control, so a
full 1 GiB sync took ~255-270s and crossed the 240s edge even on a
quiet stand. The day0 skip-sync does not always fire for the freshly
added replica, so the cell does exercise a real full sync.

Reduce the volume to 512 MiB: still a real, observable
Inconsistent→SyncTarget→bare-UpToDate window (the Bug 329 transition
under test), but the full sync now completes in ~120s, comfortably
inside the unchanged 240s budget. No assertion weakening, no timeout
inflation, no masking — the read already polls the right condition.

Harness-only; no product change.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR registers two intentional CLI parity deltas (#84: tiebreaker→diskful full SyncTarget, #85: bare SyncSource/SyncTarget rendering), rewrites the associated e2e assertions to match those contracts, adds a bs_exec bounded-timeout helper across both test harness libs, replaces xxd with od/tr for Talos compatibility in snapshot scripts, fixes a range of DRBD/snapshot/resize e2e correctness issues, enhances the parity-refresh harness with output normalization and residue cleanup, and moves linstor-csi to a native sidecar initContainer in the CSI sanity Job manifest.

Changes

E2E Infrastructure and Parity Delta Test Suite Update

Layer / File(s) Summary
bs_exec bounded timeout helper
tests/e2e/cli-matrix/lib.sh, tests/operator-harness/lib.sh
Introduces bs_exec/BS_EXEC_TIMEOUT in both harness libs; wires assert_no_orphans probes (DRBD slot, .res file, LVM scan) and three in-pod assertion probes (md5sum, drbdsetup show, drbdsetup status) through it so kubectl execs are time-bounded.
Delta #84: tiebreaker→diskful full SyncTarget contract
docs/cli-parity-known-deltas.md, tests/e2e/cli-matrix/r-c-over-tiebreaker-skip-sync.sh
Documents delta #84 in the known-deltas file and rewrites the promotion test to expect a full SyncTarget traversal (240 s budget) instead of failing on skip-sync absence; adds delta-specific diagnostics on timeout.
Delta #85: SyncSource/SyncTarget rendering contract
docs/cli-parity-known-deltas.md, tests/e2e/cli-matrix/r-l-conns-shapes.sh
Documents delta #85 and updates Sub-test D to accept bare or (NN%)-suffixed Sync* states while treating UpToDate(NN%) as a Bug A regression.
xxd→od/tr Talos compatibility
tests/e2e/cli-matrix/snap-create-multiple-group-consistency.sh, tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh
Replaces xxd -r -p (write) and xxd -p (read) with Bash printf/sed and od -An -tx1 | tr -d pipelines across both snapshot scripts; also rewrites the partial-failure batch path in lifecycle to fall back to the apiserver REST endpoint when CLI rejects argument grammar.
Snapshot e2e correctness fixes
tests/e2e/cli-matrix/snap-create-inactive-replica-excluded.sh, tests/e2e/cli-matrix/snap-cross-node-consistency.sh, tests/e2e/cli-matrix/snap-single-replica-zfs-restore-u464.sh
Splits three-node wait_uptodate into two sequential calls; derives SEED_MIB from resolved DRBD net device size for consistent seed/md5 comparison; switches single-replica readiness to wait_disk_state with unbound-peer arg.
DRBD/volume e2e robustness fixes
tests/e2e/cli-matrix/toggle-disk-solo-last-copy.sh, tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh, tests/e2e/cli-matrix/vd-resize-single-replica-no-restart.sh, tests/e2e/cli-matrix/vd-resize-with-diskless-peer.sh, tests/e2e/cli-matrix/sync-final-uptodate-transition.sh, tests/e2e/cli-matrix/ps-cdp-zfs-roundtrip.sh
Fixes toggle-disk-solo-last-copy Initialized latch (16 MiB write + 60 s poll); corrects vd-d-prune jq field names; adds META_SLACK_KIB threshold for resize assertion; introduces diskless_state_ok for empty-diskState tolerance; reduces sync-final-uptodate-transition volume from 1 GiB to 512 MiB; captures ps_json once with || true to avoid pipefail abort.
CLI parity refresh harness improvements
tests/operator-harness/cli-parity-refresh.sh
Adds normalize_side_output (pool-name token canonicalization) and reap_residual_resource_groups (deletes leftover sc-* RGs); wires normalization into classify() diff path and runs the reaper in seed() before each parity command.
linstor-csi initContainer sidecar
stand/csi-sanity-job.yaml
Moves linstor-csi from the Job's containers list to a non-terminating initContainers entry (restartPolicy: Always), so the Job completion status follows csi-sanity alone.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐇 Hop, hop, the rabbit said,
No more xxd in our stead!
od and tr shall read our bytes,
bs_exec ends the hang-filled nights.
Delta #84 syncs full by design,
And every test now runs just fine! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: test harness and CLI-matrix/replay/parity fixes for release-gate validation. It is specific, concise, and clearly conveys the scope (batch 3) without being vague or misleading.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/cli-matrix-harness-batch3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several robustness improvements, bug fixes, and test updates to the CLI parity and e2e test suites. Key changes include converting linstor-csi to a Kubernetes native sidecar in the sanity job, introducing a bounded bs_exec wrapper to prevent wedged commands from stalling test sweeps, replacing xxd with portable shell commands, and updating tests to align with intentional behavior (such as full-sync on tiebreaker promotion and bare Sync* literals). Feedback on the changes highlights two issues: a masked dd failure in toggle-disk-solo-last-copy.sh that could cause the test to hang, and a portability issue in cli-parity-refresh.sh where sed -i -E with \b word boundaries is incompatible with macOS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +83 to +89
on_node "$N1" sh -c "
D=${N1_DEV};
drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;
dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;
sync;
drbdadm secondary $RD 2>/dev/null || true;
" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The on_node command runs a subshell where dd's failure is masked. Because there is no set -e inside the sh -c string, and the final command (drbdadm secondary ... || true) is guaranteed to succeed, the overall on_node command will always return exit code 0 even if dd fails (e.g., due to an unsupported oflag=direct or write error). This masks the write failure and causes the test to hang/timeout for 60 seconds waiting for RD.Spec.Initialized to latch true, making debugging difficult.

We should capture dd's exit status and explicitly exit with it at the end of the subshell to ensure failures are propagated immediately.

Suggested change
on_node "$N1" sh -c "
D=${N1_DEV};
drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;
dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;
sync;
drbdadm secondary $RD 2>/dev/null || true;
" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"
on_node "$N1" sh -c "
D=${N1_DEV};
drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true;
dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null;
rc=\$?;
sync;
drbdadm secondary $RD 2>/dev/null || true;
exit \$rc;
" || die "precondition: write to $N1 DRBD device ($N1_DEV) failed"

Comment on lines +128 to +133
local p
for p in $pools; do
[[ -n "$p" ]] || continue
# Word-boundary replace so `stand` doesn't clobber `standby` etc.
sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using sed -i -E with \b word boundaries is not portable to macOS (BSD sed). On macOS, sed -i expects a backup extension argument (otherwise it treats -E as the extension), and BSD sed does not support \b for word boundaries (it uses [[:<:]] and [[:>:]]). This causes the normalization to fail silently on macOS, leading to false-positive test failures due to unnormalized pool names.

Using perl -pi -e is highly portable and behaves identically on both macOS and Linux, supporting both in-place editing and \b word boundaries natively.

Suggested change
local p
for p in $pools; do
[[ -n "$p" ]] || continue
# Word-boundary replace so `stand` doesn't clobber `standby` etc.
sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true
done
local p
for p in $pools; do
[[ -n "$p" ]] || continue
# Word-boundary replace so `stand` doesn't clobber `standby` etc.
# perl is used for portable in-place word-boundary regex across macOS and Linux.
perl -pi -e "s/\\b\\Q${p}\\E\\b/<POOL>/g" "$file" 2>/dev/null || true
done

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh (1)

460-462: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use delta counts for BAD_SNAP validation, not absolute totals.

Line 460 computes pre_count, but Lines 501–520 validate only absolute post_count. If stale BAD_SNAP CRDs already exist, a silent partial success can evade the post_count == 1 failure branch and false-pass.

Proposed fix
 pre_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
     | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
@@
     post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
         | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
+    created_count=$(( post_count - pre_count ))
@@
-    elif (( post_count == 1 )); then
+    elif (( created_count == 1 )); then
         echo "FAIL (Bug 353 partial-fail): create-multiple SILENTLY created 1/2 snaps when one RD was missing (no per-entry error reported)" >&2
@@
-    elif (( post_count == 0 )); then
+    elif (( created_count == 0 )); then
         echo "   → exit 0 but no CRD was created — apiserver fail-fast detected at validation time, acceptable"
     fi
 else
@@
     post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \
         | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}')
-    if (( post_count > 0 )); then
+    created_count=$(( post_count - pre_count ))
+    if (( created_count > 0 )); then
         echo "   note: partial-fail rejected with $post_count orphan CRD(s); cleanup will reap them"
         "${LCTL[@]}" snapshot delete "$RD_A" "$BAD_SNAP" 2>/dev/null || true
     fi
 fi

Also applies to: 501-520

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh` around lines 460 -
462, The validation at lines 501-520 checks absolute post_count values against
fixed expectations, but pre_count is computed at line 460 and never used. This
allows false passes if stale BAD_SNAP snapshots already exist. Modify the
validation logic at lines 501-520 to compute and validate the delta (post_count
minus pre_count) instead of checking absolute counts, ensuring exactly one new
snapshot is created rather than just verifying a specific total count exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh`:
- Around line 107-122: The net_bytes assignment using the on_node command can
cause the script to abort under set -euo pipefail if the probe command fails,
preventing the intended skip path. To fix this, add error handling to the
net_bytes assignment (around line 107) to ensure the command failure does not
abort the script. Use either a trailing || operator to suppress the error or
wrap the assignment with proper error handling so that if the probe fails,
net_bytes is set to an empty string, allowing the subsequent conditional checks
to properly evaluate and trigger the skip path when appropriate.

---

Outside diff comments:
In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh`:
- Around line 460-462: The validation at lines 501-520 checks absolute
post_count values against fixed expectations, but pre_count is computed at line
460 and never used. This allows false passes if stale BAD_SNAP snapshots already
exist. Modify the validation logic at lines 501-520 to compute and validate the
delta (post_count minus pre_count) instead of checking absolute counts, ensuring
exactly one new snapshot is created rather than just verifying a specific total
count exists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7beafefe-7557-40a5-a18b-307196549f7c

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9b648 and fc55750.

📒 Files selected for processing (18)
  • docs/cli-parity-known-deltas.md
  • stand/csi-sanity-job.yaml
  • tests/e2e/cli-matrix/lib.sh
  • tests/e2e/cli-matrix/ps-cdp-zfs-roundtrip.sh
  • tests/e2e/cli-matrix/r-c-over-tiebreaker-skip-sync.sh
  • tests/e2e/cli-matrix/r-l-conns-shapes.sh
  • tests/e2e/cli-matrix/snap-create-inactive-replica-excluded.sh
  • tests/e2e/cli-matrix/snap-create-multiple-group-consistency.sh
  • tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh
  • tests/e2e/cli-matrix/snap-cross-node-consistency.sh
  • tests/e2e/cli-matrix/snap-single-replica-zfs-restore-u464.sh
  • tests/e2e/cli-matrix/sync-final-uptodate-transition.sh
  • tests/e2e/cli-matrix/toggle-disk-solo-last-copy.sh
  • tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh
  • tests/e2e/cli-matrix/vd-resize-single-replica-no-restart.sh
  • tests/e2e/cli-matrix/vd-resize-with-diskless-peer.sh
  • tests/operator-harness/cli-parity-refresh.sh
  • tests/operator-harness/lib.sh

Comment on lines +107 to +122
net_bytes=$(on_node "$N1" bash -c "
dev='$dev'
[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
blockdev --getsize64 \"\$dev\" 2>/dev/null
" 2>/dev/null | tr -dc '0-9')
SEED_MIB=$SIZE_MIB
if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
net_mib=$(( net_bytes / 1024 / 1024 ))
if (( net_mib < SEED_MIB )); then
SEED_MIB=$net_mib
fi
fi
if (( SEED_MIB < 1 )); then
echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
exit 0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden the net-size probe failure path to avoid false FAIL/ENOSPC regressions.

Under set -euo pipefail, Line 107 can abort the script if the probe command fails, and when it returns empty, SEED_MIB remains SIZE_MIB, bypassing the intended skip path and risking the original full-gross write failure.

Proposed fix
-net_bytes=$(on_node "$N1" bash -c "
+net_bytes=$(on_node "$N1" bash -c "
     dev='$dev'
     [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
     blockdev --getsize64 \"\$dev\" 2>/dev/null
-" 2>/dev/null | tr -dc '0-9')
-SEED_MIB=$SIZE_MIB
+" 2>/dev/null | tr -dc '0-9' || true)
+SEED_MIB=0
 if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
     net_mib=$(( net_bytes / 1024 / 1024 ))
-    if (( net_mib < SEED_MIB )); then
-        SEED_MIB=$net_mib
-    fi
+    SEED_MIB=$net_mib
+    (( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB
 fi
 if (( SEED_MIB < 1 )); then
     echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
     exit 0
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
net_bytes=$(on_node "$N1" bash -c "
dev='$dev'
[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
blockdev --getsize64 \"\$dev\" 2>/dev/null
" 2>/dev/null | tr -dc '0-9')
SEED_MIB=$SIZE_MIB
if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
net_mib=$(( net_bytes / 1024 / 1024 ))
if (( net_mib < SEED_MIB )); then
SEED_MIB=$net_mib
fi
fi
if (( SEED_MIB < 1 )); then
echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
exit 0
fi
net_bytes=$(on_node "$N1" bash -c "
dev='$dev'
[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
blockdev --getsize64 \"\$dev\" 2>/dev/null
" 2>/dev/null | tr -dc '0-9' || true)
SEED_MIB=0
if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
net_mib=$(( net_bytes / 1024 / 1024 ))
SEED_MIB=$net_mib
(( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB
fi
if (( SEED_MIB < 1 )); then
echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
exit 0
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh` around lines 107 - 122,
The net_bytes assignment using the on_node command can cause the script to abort
under set -euo pipefail if the probe command fails, preventing the intended skip
path. To fix this, add error handling to the net_bytes assignment (around line
107) to ensure the command failure does not abort the script. Use either a
trailing || operator to suppress the error or wrap the assignment with proper
error handling so that if the probe fails, net_bytes is set to an empty string,
allowing the subsequent conditional checks to properly evaluate and trigger the
skip path when appropriate.

Resolve conflict in tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh:
keep the complete wire-key fix (.flags + .volumes[]) from this branch, which
subsumes the partial .flags-only fix that landed on main via #158. Both sides
intended the same correction; this branch's version is the strict superset.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) merged commit 015374d into main Jun 14, 2026
15 checks passed
@kvaps Andrei Kvapil (kvaps) deleted the test/cli-matrix-harness-batch3 branch June 14, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant