test(harness): re-create cli-matrix/replay/parity release-gate fixes (batch 3)#161
Conversation
wait_uptodate is a 2-peer helper (rd, primary, peer, [vol]). - snap-create-inactive-replica-excluded passed a third NODE as the positional vol, which the kernel ground-truth path fed to jq via --argjson v <node> → jq aborted on a non-numeric value, failing setup. Wait UpToDate pairwise from the primary instead (one query on the primary covers all three replicas). - snap-single-replica-zfs-restore-u464 called it on a single-replica RD with no peer → $3 unbound under set -u. Poll the lone node's observer disk state with wait_disk_state instead. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
xxd is not installed in the Talos satellite containers, so the marker-counter write (printf %016x | xxd -r -p) and read (head -c 8 | xxd -p) were no-ops: the writer's pipe failed and the || break killed the correlated-writer loop on its first iteration, so the cross-RD point-in-time parity assertions read a stale/zero counter and falsely passed. Emit the 8 big-endian counter bytes via the satellite's own bash printf '\xNN' (bash IS present and honours \xNN, unlike /bin/sh dash), and read them back with od -An -tx1 | tr -d ' \n' (same continuous-hex digest xxd -p produced). Affects snap-create-multiple-group-consistency and snap-create-multiple-lifecycle. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…onsistency The cell seeded a full 256 MiB pattern (dd bs=1M count=256) onto a 256 MiB DRBD device. DRBD carves its internal metadata out of the same backing volume, so the net block device presents only ~255.8 MiB — the write overran the last block with ENOSPC and the seed failed before any snapshot was taken (Bug-351 fixture never exercised). Derive the writable size from the device's own net byte count (floor to whole MiB) and use that SEED_MIB for both the seed write and the per-node snapshot md5 read so the comparison stays apples-to-apples. Clean-SKIP if the net size can't be resolved. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The UpToDate-count wait read the resource-list machine-readable rows via the golinstor/upstream .rsc_flags / .vlms spellings, which do not exist on the blockstor wire (pkg/api/v1/resource.go: Flags json:"flags", Volumes json:"volumes"). .rsc_flags//[] excluded nothing and .vlms[]? selected nothing, so count_uptodate stayed 0 and the wait timed out even on a healthy RD. Switch to .flags / .volumes (same wire-key class as BUG-044). Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…KeyError The python 'linstor physical-storage list' render path raises a KeyError/AttributeError inside responses.py when a spare device has no size field (observed for a zero-size zram), exiting non-zero. Under set -o pipefail the no-free-device probe pipeline aborted the whole script as a spurious FAIL before the SKIP gate could fire, and the SKIP path then re-invoked the same throwing plain 'ps l'. Capture the machine-readable stdout into a variable first (|| true), parse it in isolation from the CLI exit status, and dump the machine-readable JSON (never the throwing plain render) on the SKIP path. No-free-device now SKIPs cleanly instead of FAILing. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…olo cell RD.Spec.Initialized does NOT latch on bare UpToDate. The controller (ensureSkipInitSyncDecision) flips it true only when a diskful peer is PROVEN data-bearing — its DRBD CurrentGI observed AND differing from the deterministic day0 GI (anyProvenDataBearingDiskfulPeer / isDay0SeededDiskfulVolume). A freshly-created, never-written pair sits at the day0 GI forever, so the old "UpToDate ⇒ Initialized" precondition always died. Write real data on the primary (promote, dd 16 MiB past day0, demote) to advance the generation and trigger the latch — the same pattern u145-write-then-add-replica-syncs uses — then poll up to 60s for the async latch. The solo diskless→diskful toggle path under test is unchanged. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…e in resize cells The DRBD kernel device is a few MiB smaller than the gross VolumeDefinition size (internal metadata reserve), and a flag-only diskless replica has an empty observer device-state row. - vd-resize-single-replica-no-restart asserted the kernel DRBD device reached the full gross 2 GiB (2097152 KiB); it tops out at ~2096156 KiB net, so the check failed on a perfectly-converged resize. Require within an 8 MiB slack of nominal (same slack assert_resize_converged uses). - vd-resize-with-diskless-peer treated the diskless peer's empty diskState as a hard FAIL; the observer legitimately omits the volumes[] row for a flag-only diskless replica. Gate on the DISKLESS flag + reject only explicit wedge states, accepting Diskless / UpToDate / empty. The load-bearing assertion (no backing LV on the diskless peer) is unchanged. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
A kubectl exec into a satellite can block indefinitely when the pod's kernel/lvm path is wedged — a hung lvs against an lvm2 daemon on a FILE_THIN node stalled a full cli-matrix sweep for ~2h. Add a bs_exec helper that runs the exec under a hard wall-clock cap (timeout, 30s default; falls back to bare exec when timeout is absent) and route the hang-prone probes through it: - cli-matrix/lib.sh assert_no_orphans: drbdsetup status / .res test / lvs / zfs list per-satellite probes. - operator-harness/lib.sh await readers: pod_md5_invariant (md5sum), drbd_option (drbdsetup show), disk_state (drbdsetup status --json). A wedged exec now times out and is treated as 'nothing observed on that pod' (best-effort semantics already in force) instead of stalling the whole sweep. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
linstor-csi is a non-terminating CSI gRPC server. As a plain Job container it kept the pod Running even after csi-sanity exited, so the Job never reached complete and the harness wait blocked for the full ~1h timeout. Move it to a Kubernetes native sidecar — an initContainer with restartPolicy: Always — which the kubelet terminates once the regular containers exit, so the Job completes on csi-sanity's exit code. Native sidecars are GA from k8s 1.29 (the stand runs 1.34). Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… deltas Two cli-matrix cells asserted the UPSTREAM contract where blockstor deliberately diverges; flip them to assert the BS contract and document the deltas (rows 84, 85). - r-c-over-tiebreaker-skip-sync: a tiebreaker→diskful promotion runs a full SyncTarget, not skip-sync. This is intentional — the anyDiskfulPeerHasData gate (pkg/dispatcher) suppresses the auto-primary seed when a data-bearing diskful peer exists, so the promoted replica SyncTargets the real bytes instead of force-priming and minting an unrelated UUID that wedges the pair in StandAlone. The cell now asserts SyncTarget→UpToDate convergence (the transit is expected) plus the Bug 348 SyncSource shape. - r-l-conns-shapes sub-test D: a bare SyncSource/SyncTarget literal (no (NN%) suffix) is intentional when OutOfSyncKib<=0 — withSyncPercent short-circuits rather than emit a misleading (0%)/(100%). The cell now accepts a Sync* token bare OR with (NN%), and only flags the real regression (a terminal UpToDate carrying a (NN%) suffix). docs/cli-parity-known-deltas.md gains rows 84 and 85 with the product rationale and a refresh-history note. Harness + docs only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ty-refresh Environment drift could masquerade as a BS↔upstream divergence in the parity diff: - Storage-pool names differ purely by env config (BS stand/lvm-thin/ zfs-thin/zfs-thick vs the oracle's pool), so an otherwise-identical sp l / r l / vd l row diffed solely on the pool column. Normalize every side's pool token to a canonical <POOL> on diff-only copies (raw artefacts stay verbatim) so only real shape divergence survives. - The csi-sanity Job leaves StorageClass-backed resource-groups named sc-<uuid> behind; they made BS's rg l carry rows the oracle never had. Reap that csi-sanity-shaped residue on both sides before seeding and diffing. Only sc-<uuid> RGs are reaped — the parity seed and any operator/default RG are left untouched. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… per-entry envelope The Phase 5 partial-failure check invoked the CLI with the colon rd:snap form, which this linstor client rejects with an argparse usage error (it wants -r RD... SNAP) naming neither RD, so the error-envelope assertion failed on a CLI-grammar artefact, not the apiserver contract. Mirror Phase 3: when the CLI form is rejected, drive the same good+missing batch through POST /v1/actions/snapshot/multi. The apiserver is best-effort per-entry: it returns 201 with a JSON array carrying a SUCCESS for the good RD and an ERROR naming the missing RD (object not found), creating the good-side snap. That per-entry error envelope is the acceptable shape — recognise it and only FAIL on SILENT partial success (one snap created with no error reported). Harness-only; no product change. The substantive Bug 353 assertions (shared GroupID, atomic SuspendIo phase advance with 0s slip, Ready, batch-scoped resume-io, cross-RD counter parity) already pass on big. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…an observer gap) The 3rd-replica initial SyncTarget intermittently overran the 240s wait. Investigation on big: the observer projection is correct throughout (it reports Inconsistent + SyncTarget/PausedSyncT and the out-of-sync trend accurately, then bare UpToDate + Established once drained), so this is NOT a BUG-045 observer-projection gap. DRBD throttles the resync to ~4.3 MiB/s under PausedSyncT rate control, so a full 1 GiB sync took ~255-270s and crossed the 240s edge even on a quiet stand. The day0 skip-sync does not always fire for the freshly added replica, so the cell does exercise a real full sync. Reduce the volume to 512 MiB: still a real, observable Inconsistent→SyncTarget→bare-UpToDate window (the Bug 329 transition under test), but the full sync now completes in ~120s, comfortably inside the unchanged 240s budget. No assertion weakening, no timeout inflation, no masking — the read already polls the right condition. Harness-only; no product change. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughThis PR registers two intentional CLI parity deltas ( ChangesE2E Infrastructure and Parity Delta Test Suite Update
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several robustness improvements, bug fixes, and test updates to the CLI parity and e2e test suites. Key changes include converting linstor-csi to a Kubernetes native sidecar in the sanity job, introducing a bounded bs_exec wrapper to prevent wedged commands from stalling test sweeps, replacing xxd with portable shell commands, and updating tests to align with intentional behavior (such as full-sync on tiebreaker promotion and bare Sync* literals). Feedback on the changes highlights two issues: a masked dd failure in toggle-disk-solo-last-copy.sh that could cause the test to hang, and a portability issue in cli-parity-refresh.sh where sed -i -E with \b word boundaries is incompatible with macOS.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| on_node "$N1" sh -c " | ||
| D=${N1_DEV}; | ||
| drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true; | ||
| dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null; | ||
| sync; | ||
| drbdadm secondary $RD 2>/dev/null || true; | ||
| " || die "precondition: write to $N1 DRBD device ($N1_DEV) failed" |
There was a problem hiding this comment.
The on_node command runs a subshell where dd's failure is masked. Because there is no set -e inside the sh -c string, and the final command (drbdadm secondary ... || true) is guaranteed to succeed, the overall on_node command will always return exit code 0 even if dd fails (e.g., due to an unsupported oflag=direct or write error). This masks the write failure and causes the test to hang/timeout for 60 seconds waiting for RD.Spec.Initialized to latch true, making debugging difficult.
We should capture dd's exit status and explicitly exit with it at the end of the subshell to ensure failures are propagated immediately.
| on_node "$N1" sh -c " | |
| D=${N1_DEV}; | |
| drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true; | |
| dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null; | |
| sync; | |
| drbdadm secondary $RD 2>/dev/null || true; | |
| " || die "precondition: write to $N1 DRBD device ($N1_DEV) failed" | |
| on_node "$N1" sh -c " | |
| D=${N1_DEV}; | |
| drbdsetup primary $RD --force 2>/dev/null || drbdadm primary --force $RD 2>/dev/null || true; | |
| dd if=/dev/urandom of=\$D bs=1M count=16 oflag=direct 2>/dev/null; | |
| rc=\$?; | |
| sync; | |
| drbdadm secondary $RD 2>/dev/null || true; | |
| exit \$rc; | |
| " || die "precondition: write to $N1 DRBD device ($N1_DEV) failed" |
| local p | ||
| for p in $pools; do | ||
| [[ -n "$p" ]] || continue | ||
| # Word-boundary replace so `stand` doesn't clobber `standby` etc. | ||
| sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true | ||
| done |
There was a problem hiding this comment.
Using sed -i -E with \b word boundaries is not portable to macOS (BSD sed). On macOS, sed -i expects a backup extension argument (otherwise it treats -E as the extension), and BSD sed does not support \b for word boundaries (it uses [[:<:]] and [[:>:]]). This causes the normalization to fail silently on macOS, leading to false-positive test failures due to unnormalized pool names.
Using perl -pi -e is highly portable and behaves identically on both macOS and Linux, supporting both in-place editing and \b word boundaries natively.
| local p | |
| for p in $pools; do | |
| [[ -n "$p" ]] || continue | |
| # Word-boundary replace so `stand` doesn't clobber `standby` etc. | |
| sed -i -E "s/\\b${p}\\b/<POOL>/g" "$file" 2>/dev/null || true | |
| done | |
| local p | |
| for p in $pools; do | |
| [[ -n "$p" ]] || continue | |
| # Word-boundary replace so `stand` doesn't clobber `standby` etc. | |
| # perl is used for portable in-place word-boundary regex across macOS and Linux. | |
| perl -pi -e "s/\\b\\Q${p}\\E\\b/<POOL>/g" "$file" 2>/dev/null || true | |
| done |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh (1)
460-462:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse delta counts for BAD_SNAP validation, not absolute totals.
Line 460 computes
pre_count, but Lines 501–520 validate only absolutepost_count. If staleBAD_SNAPCRDs already exist, a silent partial success can evade thepost_count == 1failure branch and false-pass.Proposed fix
pre_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \ | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}') @@ post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \ | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}') + created_count=$(( post_count - pre_count )) @@ - elif (( post_count == 1 )); then + elif (( created_count == 1 )); then echo "FAIL (Bug 353 partial-fail): create-multiple SILENTLY created 1/2 snaps when one RD was missing (no per-entry error reported)" >&2 @@ - elif (( post_count == 0 )); then + elif (( created_count == 0 )); then echo " → exit 0 but no CRD was created — apiserver fail-fast detected at validation time, acceptable" fi else @@ post_count=$(kubectl get snapshots.blockstor.cozystack.io --no-headers 2>/dev/null \ | awk -v s="$BAD_SNAP" '$1 ~ s {n++} END {print n+0}') - if (( post_count > 0 )); then + created_count=$(( post_count - pre_count )) + if (( created_count > 0 )); then echo " note: partial-fail rejected with $post_count orphan CRD(s); cleanup will reap them" "${LCTL[@]}" snapshot delete "$RD_A" "$BAD_SNAP" 2>/dev/null || true fi fiAlso applies to: 501-520
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh` around lines 460 - 462, The validation at lines 501-520 checks absolute post_count values against fixed expectations, but pre_count is computed at line 460 and never used. This allows false passes if stale BAD_SNAP snapshots already exist. Modify the validation logic at lines 501-520 to compute and validate the delta (post_count minus pre_count) instead of checking absolute counts, ensuring exactly one new snapshot is created rather than just verifying a specific total count exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh`:
- Around line 107-122: The net_bytes assignment using the on_node command can
cause the script to abort under set -euo pipefail if the probe command fails,
preventing the intended skip path. To fix this, add error handling to the
net_bytes assignment (around line 107) to ensure the command failure does not
abort the script. Use either a trailing || operator to suppress the error or
wrap the assignment with proper error handling so that if the probe fails,
net_bytes is set to an empty string, allowing the subsequent conditional checks
to properly evaluate and trigger the skip path when appropriate.
---
Outside diff comments:
In `@tests/e2e/cli-matrix/snap-create-multiple-lifecycle.sh`:
- Around line 460-462: The validation at lines 501-520 checks absolute
post_count values against fixed expectations, but pre_count is computed at line
460 and never used. This allows false passes if stale BAD_SNAP snapshots already
exist. Modify the validation logic at lines 501-520 to compute and validate the
delta (post_count minus pre_count) instead of checking absolute counts, ensuring
exactly one new snapshot is created rather than just verifying a specific total
count exists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7beafefe-7557-40a5-a18b-307196549f7c
📒 Files selected for processing (18)
docs/cli-parity-known-deltas.mdstand/csi-sanity-job.yamltests/e2e/cli-matrix/lib.shtests/e2e/cli-matrix/ps-cdp-zfs-roundtrip.shtests/e2e/cli-matrix/r-c-over-tiebreaker-skip-sync.shtests/e2e/cli-matrix/r-l-conns-shapes.shtests/e2e/cli-matrix/snap-create-inactive-replica-excluded.shtests/e2e/cli-matrix/snap-create-multiple-group-consistency.shtests/e2e/cli-matrix/snap-create-multiple-lifecycle.shtests/e2e/cli-matrix/snap-cross-node-consistency.shtests/e2e/cli-matrix/snap-single-replica-zfs-restore-u464.shtests/e2e/cli-matrix/sync-final-uptodate-transition.shtests/e2e/cli-matrix/toggle-disk-solo-last-copy.shtests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.shtests/e2e/cli-matrix/vd-resize-single-replica-no-restart.shtests/e2e/cli-matrix/vd-resize-with-diskless-peer.shtests/operator-harness/cli-parity-refresh.shtests/operator-harness/lib.sh
| net_bytes=$(on_node "$N1" bash -c " | ||
| dev='$dev' | ||
| [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1) | ||
| blockdev --getsize64 \"\$dev\" 2>/dev/null | ||
| " 2>/dev/null | tr -dc '0-9') | ||
| SEED_MIB=$SIZE_MIB | ||
| if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then | ||
| net_mib=$(( net_bytes / 1024 / 1024 )) | ||
| if (( net_mib < SEED_MIB )); then | ||
| SEED_MIB=$net_mib | ||
| fi | ||
| fi | ||
| if (( SEED_MIB < 1 )); then | ||
| echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)" | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
Harden the net-size probe failure path to avoid false FAIL/ENOSPC regressions.
Under set -euo pipefail, Line 107 can abort the script if the probe command fails, and when it returns empty, SEED_MIB remains SIZE_MIB, bypassing the intended skip path and risking the original full-gross write failure.
Proposed fix
-net_bytes=$(on_node "$N1" bash -c "
+net_bytes=$(on_node "$N1" bash -c "
dev='$dev'
[ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1)
blockdev --getsize64 \"\$dev\" 2>/dev/null
-" 2>/dev/null | tr -dc '0-9')
-SEED_MIB=$SIZE_MIB
+" 2>/dev/null | tr -dc '0-9' || true)
+SEED_MIB=0
if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then
net_mib=$(( net_bytes / 1024 / 1024 ))
- if (( net_mib < SEED_MIB )); then
- SEED_MIB=$net_mib
- fi
+ SEED_MIB=$net_mib
+ (( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB
fi
if (( SEED_MIB < 1 )); then
echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)"
exit 0
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| net_bytes=$(on_node "$N1" bash -c " | |
| dev='$dev' | |
| [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1) | |
| blockdev --getsize64 \"\$dev\" 2>/dev/null | |
| " 2>/dev/null | tr -dc '0-9') | |
| SEED_MIB=$SIZE_MIB | |
| if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then | |
| net_mib=$(( net_bytes / 1024 / 1024 )) | |
| if (( net_mib < SEED_MIB )); then | |
| SEED_MIB=$net_mib | |
| fi | |
| fi | |
| if (( SEED_MIB < 1 )); then | |
| echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)" | |
| exit 0 | |
| fi | |
| net_bytes=$(on_node "$N1" bash -c " | |
| dev='$dev' | |
| [ -n \"\$dev\" ] || dev=\$(ls -1 /dev/drbd* 2>/dev/null | grep -vE 'by-(res|disk)' | head -1) | |
| blockdev --getsize64 \"\$dev\" 2>/dev/null | |
| " 2>/dev/null | tr -dc '0-9' || true) | |
| SEED_MIB=0 | |
| if [[ -n "$net_bytes" && "$net_bytes" -gt 0 ]]; then | |
| net_mib=$(( net_bytes / 1024 / 1024 )) | |
| SEED_MIB=$net_mib | |
| (( SEED_MIB > SIZE_MIB )) && SEED_MIB=$SIZE_MIB | |
| fi | |
| if (( SEED_MIB < 1 )); then | |
| echo "SKIP (Bug 351): could not resolve a usable DRBD net device size on $N1 (shape unavailable)" | |
| exit 0 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/cli-matrix/snap-cross-node-consistency.sh` around lines 107 - 122,
The net_bytes assignment using the on_node command can cause the script to abort
under set -euo pipefail if the probe command fails, preventing the intended skip
path. To fix this, add error handling to the net_bytes assignment (around line
107) to ensure the command failure does not abort the script. Use either a
trailing || operator to suppress the error or wrap the assignment with proper
error handling so that if the probe fails, net_bytes is set to an empty string,
allowing the subsequent conditional checks to properly evaluate and trigger the
skip path when appropriate.
Resolve conflict in tests/e2e/cli-matrix/vd-d-prune-resource-volumes-no-flap.sh: keep the complete wire-key fix (.flags + .volumes[]) from this branch, which subsumes the partial .flags-only fix that landed on main via #158. Both sides intended the same correction; this branch's version is the strict superset. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Harness/docs/stand-only fixes for the remaining release-gate sweep false-negatives. NO product code. All 13 cells validated on stand big 3× (PASS or clean SKIP).
cli-matrix cell fixes
.rsc_flags/.vlms→.flags/.volumes.wait_uptodate$3 unbound.--argjsongot node names → crash.xxdabsent on Talos → marker no-op (replaced).linstor ps lpython KeyError under pipefail in the skip-gate → clean SKIP when no free device.Spec.Initializedprecondition (latches on write/promote, not bare UpToDate).Known-delta reclassification (intentional BS behavior, cells now assert the BS contract)
SyncSource/SyncTargetliteral when OutOfSyncKib≤0.Infra hardening
assert_no_orphanslvs exec (lib.sh, hung ~2h), replay-runner exec, csi-sanity terminating sidecar.sc-<uuid>residue) before diffing so environment drift can't masquerade as divergence.Summary by CodeRabbit
Documentation
Tests