cozystack · Andrei Kvapil (kvaps) · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/docs/cli-parity-known-deltas.md b/docs/cli-parity-known-deltas.md
@@ -50,6 +50,7 @@ Row IDs match the command-catalogue indexes used by `cli-parity-refresh.sh` (see
 | 81 | `rd l` / `rd s` (Layers column / `layer_data` on unplaced RDs) | WIRE_SHAPE | permanent | Bug 349: the python CLI's `rd l` Layers column `,`-joins `layer_data[].type`, but blockstor's k8s store persists only `Spec.LayerStack` — so the read path re-synthesises a flat `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` on EVERY RD GET / list response (`stampRDLayerDataFromStack`, pkg/rest/resource_definitions.go), falling back to the upstream default stack `DRBD,STORAGE` when the stored stack is empty, so the column never renders blank. Upstream LINSTOR 1.33.2 emits `layer_data` on RD responses only once DRBD layer data actually exists (resources placed → port/secret allocated), and then with a volatile `data` payload (`peer_slots`, `port`, `secret`, …) blockstor never synthesises; a bare `rd c` and an RD with only a volume-definition both omit the field entirely (verified live against the dev-stand oracle 2026-06-12). Operator-visible effect: BS renders the Layers column for unplaced RDs where upstream renders an empty cell — BLOCKSTOR_SUPERSET; once resources exist both sides list the same `type` sequence (the CLI renders only the types, never upstream's `data` payload). linstor-csi / piraeus-operator do not gate on RD `layer_data`. The L3 oracle-replay harness tolerates exactly this field on RD-shaped objects (scoped per-field drop in tests/contract/normalize.go keyed on `resource_group_name`; resources' `layer_object` / volumes' `layer_data_list` still diff). Pinned by `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped` (L1) + the 4 RD traces in `tests/contract/testdata/oracle/` (L3). |
 
 | 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-<target>` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone` (cross-node placements use the existing send/recv restore path). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). |
+| 83 | `rg spawn-resources` on an over-committed RG (place_count > available nodes) | BEHAVIOR | permanent | Corner D1. Upstream LINSTOR fails `spawn-resources` SHORT when the RG's place_count exceeds the placeable node count: it returns `FAIL_NOT_ENOUGH_NODES` (ret_code 996, "Not enough available nodes") and places NOTHING. blockstor instead takes a DEFERRED, best-effort autoplace path: it spawns the RD + VDs, places as many diskful replicas as the topology allows (e.g. 3 of 7 on a 3-node cluster), and surfaces the shortfall as an INFO in the SUCCESS envelope (`resource definition spawned, autoplace deferred: <rd>: not enough candidate storage pools: placed N of M`, exit 0). The `RGRebalanceReconciler` (`internal/controller/rg_rebalance_controller.go`) then tops the replica count back up additively once more nodes appear. The over-commit `rg create` itself is ACCEPTED by both controllers (parity — never an early create-time error). Rationale: the deferred path lets the CSI external-provisioner retry loop converge on a created RD instead of hard-failing on a cluster that is one node away from satisfying the request, and is strictly additive (scale-down stays an explicit `r d`). Switching spawn back to the upstream 996 short-fail would be a deliberate API change, not a silent regression. Pinned by `pkg/rest/spawn_test.go::TestSpawnImpossiblePlacementReturnsActionableError` + `TestSpawnPartialFlagAllowsShortPlacement` (L1) and `tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh` (L6). |
 
 ## Open (block merge until addressed)
 

diff --git a/tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh b/tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh
@@ -0,0 +1,160 @@
+#!/usr/bin/env bash
+#
+# usage: rg-c-overcommit-spawn-defers.sh WORK_DIR
+#
+# L6 cli-matrix cell — corner-case campaign D1
+# (UG9 linstor-administration.adoc ~916-931).
+#
+# Over-commit contract on blockstor (DELIBERATE divergence from upstream
+# LINSTOR — see docs/cli-parity-known-deltas.md row 83):
+#
+#   - `resource-group create --place-count 7` on a 3-node cluster is
+#     ACCEPTED (parity with upstream — the shortfall is never an early
+#     create-time error; size the RG for a cluster that will grow).
+#
+#   - `resource-group spawn-resources` on that over-committed RG does
+#     NOT fail short the way upstream LINSTOR does (upstream returns
+#     FAIL_NOT_ENOUGH_NODES / ret_code 996 and places nothing). BS
+#     instead takes a DEFERRED, best-effort autoplace path: it spawns
+#     the RD + VDs, places as many diskful replicas as the topology
+#     allows (3 of 7 here), and surfaces the shortfall as an INFO in the
+#     success envelope ("autoplace deferred: <rd>: not enough candidate
+#     storage pools: placed 3 of 7"), exit 0. The RGRebalanceReconciler
+#     then tops the replica count back up additively once more nodes
+#     appear (internal/controller/rg_rebalance_controller.go).
+#
+# Reproduction (3-node cluster, --place-count 7):
+#
+#   $ linstor resource-group create rg7 --place-count 7 -s stand
+#   SUCCESS                                       <-- accepted, no early fail
+#   $ linstor resource-group spawn-resources rg7 rg7r 32M
+#   SUCCESS                                       <-- exit 0, deferred
+#     resource definition spawned, autoplace deferred: rg7r:
+#     not enough candidate storage pools: placed 3 of 7
+#
+# Why the contract flipped here (this cell used to be
+# rg-c-overcommit-spawn-fails and asserted the upstream 996 short-fail):
+# the 996 short-fail path is upstream-faithful but BS deliberately chose
+# the deferred best-effort path so the CSI external-provisioner retry
+# loop sees a created RD it can converge on, not a hard failure on a
+# cluster that is one node away from satisfying the request. The
+# behaviour is pinned at the unit tier by
+# pkg/rest/spawn_test.go::TestSpawnImpossiblePlacementReturnsActionableError
+# (asserts the "autoplace deferred" + "placed N of M" envelope and that
+# the RD survives) — switching spawn back to writeAutoplaceShortfall
+# would be a deliberate API change, NOT a silent regression, so this
+# cell pins the deferred contract end-to-end through the real CLI.
+#
+# What a REAL regression would look like (and this cell would catch):
+#   - the over-committed create being REJECTED early (breaks "size for
+#     growth"), or
+#   - spawn placing MORE diskful than the topology allows (over-place),
+#     or zero diskful when at least 3 nodes can host (under-place), or
+#   - spawn returning a non-zero exit (the old upstream short-fail
+#     creeping back in without the deliberate-API-change ceremony).
+#
+# Unit pin: pkg/rest/spawn_test.go::TestSpawnImpossiblePlacementReturnsActionableError
+# + pkg/rest/bug_367_rg_place_count_validation_test.go (place_count
+# input validation).
+
+set -euo pipefail
+
+WORK_DIR=${1:?work_dir required}
+export KUBECONFIG="$WORK_DIR/kubeconfig"
+
+SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
+# shellcheck source=lib.sh
+source "$SCRIPT_DIR/lib.sh"
+
+require_workers 3
+
+linstor_cli_setup
+
+RG=cli-matrix-d1-rg
+RD=cli-matrix-d1-rd
+POOL=${POOL:-stand}
+
+cleanup() {
+    "${LCTL[@]}" resource-definition delete "$RD" >/dev/null 2>&1 || true
+    "${LCTL[@]}" resource-group delete "$RG" >/dev/null 2>&1 || true
+    assert_no_orphans "$RD"
+    linstor_cli_teardown
+}
+trap cleanup EXIT
+
+echo ">> [D1] over-commit: rg create --place-count 7 on a 3-node cluster MUST be accepted"
+if ! "${LCTL[@]}" resource-group create "$RG" --place-count 7 --storage-pool="$POOL"; then
+    echo "FAIL (D1 regression): rg create --place-count 7 was REJECTED — BS accepts it" >&2
+    exit 1
+fi
+
+# Confirm the RG persisted with place_count=7.
+pc=$("${LCTL[@]}" --machine-readable resource-group list --resource-groups "$RG" 2>/dev/null \
+    | jq -r '[.[][]? | .select_filter.place_count] | .[0] // empty' 2>/dev/null || echo "")
+if [[ "$pc" != "7" ]]; then
+    echo "FAIL (D1): RG persisted place_count=$pc, want 7" >&2
+    "${LCTL[@]}" resource-group list --resource-groups "$RG" 2>&1 | tail -20 >&2
+    exit 1
+fi
+echo ">> RG created with place_count=7 (over-committed, accepted) — OK"
+
+echo ">> [D1] spawn MUST succeed with a DEFERRED best-effort autoplace (BS delta, not the upstream 996 short-fail)"
+out_file=$(mktemp)
+if ! "${LCTL[@]}" resource-group spawn-resources "$RG" "$RD" 32M >"$out_file" 2>&1; then
+    echo "FAIL (D1 regression): spawn of an over-committed RG returned NON-ZERO" >&2
+    echo "  BS contract is deferred best-effort autoplace (exit 0), not the upstream 'Not enough nodes' short-fail." >&2
+    cat "$out_file" >&2
+    rm -f "$out_file"
+    exit 1
+fi
+
+# The success envelope must carry the deferred-autoplace shortfall INFO.
+if ! grep -qiE 'autoplace deferred|not enough candidate storage pools|placed [0-9]+ of [0-9]+' "$out_file"; then
+    echo "FAIL (D1): spawn succeeded but without the deferred-autoplace shortfall envelope" >&2
+    echo "  expected an 'autoplace deferred: ... placed N of M' INFO surfacing the partial placement." >&2
+    cat "$out_file" >&2
+    rm -f "$out_file"
+    exit 1
+fi
+echo ">> spawn succeeded with the deferred-autoplace shortfall envelope — OK"
+cat "$out_file"
+rm -f "$out_file"
+
+# The RD must survive (the CSI retry loop converges on it).
+if ! kubectl get "resourcedefinitions.blockstor.cozystack.io/${RD}" >/dev/null 2>&1; then
+    echo "FAIL (D1): deferred spawn did not leave the RD behind — CSI retry loop has nothing to converge on" >&2
+    exit 1
+fi
+
+# Best-effort placement: at least one diskful, never more than the
+# topology allows (3 nodes). Count diskful (non-DISKLESS / non-TIE_BREAKER)
+# Resource CRDs for the RD, polling briefly so the placer has landed.
+nodes=3
+diskful=0
+for _ in $(seq 1 15); do
+    diskful=$(kubectl get resources.blockstor.cozystack.io -o json 2>/dev/null \
+        | RD="$RD" python3 -c "import json,sys,os
+d=json.load(sys.stdin)
+rd=os.environ['RD']
+n=0
+for it in d.get('items',[]):
+    if it.get('spec',{}).get('resourceDefinitionName')!=rd: continue
+    flags=it.get('spec',{}).get('flags',[]) or []
+    if 'DISKLESS' in flags or 'TIE_BREAKER' in flags: continue
+    n+=1
+print(n)" 2>/dev/null || echo 0)
+    [[ "$diskful" -ge 1 ]] && break
+    sleep 2
+done
+
+if [[ "$diskful" -lt 1 ]]; then
+    echo "FAIL (D1): deferred spawn placed ZERO diskful replicas on a cluster that can host $nodes" >&2
+    exit 1
+fi
+if [[ "$diskful" -gt "$nodes" ]]; then
+    echo "FAIL (D1): deferred spawn OVER-placed $diskful diskful on a $nodes-node cluster" >&2
+    exit 1
+fi
+echo ">> deferred spawn placed $diskful diskful (best-effort, <= $nodes nodes) — OK"
+
+echo ">> rg-c-overcommit-spawn-defers OK (D1 pinned: create accepts pc=7, spawn defers best-effort with exit 0)"
diff --git a/tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh b/tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh