cozystack · Andrei Kvapil (kvaps) · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/.golangci.yml b/.golangci.yml
@@ -378,6 +378,9 @@ linters:
       - linters:
           - godox
         path: pkg/dispatcher/
+      - linters:
+          - godox
+        path: pkg/placer/
       - linters:
           - godox
         path: pkg/version/

diff --git a/docs/cli-parity-known-deltas.md b/docs/cli-parity-known-deltas.md
@@ -49,7 +49,7 @@ Row IDs match the command-catalogue indexes used by `cli-parity-refresh.sh` (see
 | 80 | `r td <dst> <rd> --migrate-from <src>` (disklessOnRemaining on the vacated source) | BEHAVIOR | permanent | Upstream-issue U435 (P2, "No diskless replica recreated on source after `--migrate-from` despite disklessOnRemaining"). RG `disklessOnRemaining` IS implemented in blockstor — the placer's `placeDisklessOnRemaining` (`pkg/placer/placer.go`) fans a DISKLESS witness onto every healthy node not already hosting a replica, on the AUTOPLACE path (pinned by `pkg/placer/placer_test.go::TestPlaceDisklessOnRemaining`). The divergence is scoped strictly to the `--migrate-from` path: the `ResourceMigrationReconciler` (`internal/controller/resource_migration_controller.go`) prunes the source's DISKFUL Resource CRD once the destination is UpToDate but does NOT itself re-stamp a `disklessOnRemaining` witness on the vacated source — it relies on the auto-tiebreaker machinery instead. On a cluster where the post-migration replica set has even parity (e.g. the canonical 2-diskful result), `ensureTiebreaker` immediately RE-OCCUPIES the vacated source node with a DISKLESS `TIE_BREAKER` Resource — the same node-presence outcome `disklessOnRemaining` intends (every node retains DRBD-network presence / a quorum voter; see row 78). The witness's FLAG differs (`TIE_BREAKER` vs a plain `disklessOnRemaining` DISKLESS stub) but the operator-visible effect — the source node is left with a diskless replica, not "nothing" (the U435 symptom) — holds. Where parity is odd (no auto-TB needed), the vacated node is genuinely left bare, matching upstream's pre-fix behaviour for that case. CSI/Cozystack never sets `disklessOnRemaining` on its RGs (it relies on auto-TB for quorum), so the migrate-path gap is operator-invisible for the target workload. The auto-TB re-occupation of the migrate source is pinned by `tests/operator-harness/replay/toggle-disk-migrate-from.yaml` (`assert-src-removed` awaits `replica_diskless` on the source node) + `tests/e2e/cli-matrix/toggle-disk-migrate-from.sh`. No fix required; documented so a future `disklessOnRemaining + --migrate-from` report maps here. |
 | 81 | `rd l` / `rd s` (Layers column / `layer_data` on unplaced RDs) | WIRE_SHAPE | permanent | Bug 349: the python CLI's `rd l` Layers column `,`-joins `layer_data[].type`, but blockstor's k8s store persists only `Spec.LayerStack` — so the read path re-synthesises a flat `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` on EVERY RD GET / list response (`stampRDLayerDataFromStack`, pkg/rest/resource_definitions.go), falling back to the upstream default stack `DRBD,STORAGE` when the stored stack is empty, so the column never renders blank. Upstream LINSTOR 1.33.2 emits `layer_data` on RD responses only once DRBD layer data actually exists (resources placed → port/secret allocated), and then with a volatile `data` payload (`peer_slots`, `port`, `secret`, …) blockstor never synthesises; a bare `rd c` and an RD with only a volume-definition both omit the field entirely (verified live against the dev-stand oracle 2026-06-12). Operator-visible effect: BS renders the Layers column for unplaced RDs where upstream renders an empty cell — BLOCKSTOR_SUPERSET; once resources exist both sides list the same `type` sequence (the CLI renders only the types, never upstream's `data` payload). linstor-csi / piraeus-operator do not gate on RD `layer_data`. The L3 oracle-replay harness tolerates exactly this field on RD-shaped objects (scoped per-field drop in tests/contract/normalize.go keyed on `resource_group_name`; resources' `layer_object` / volumes' `layer_data_list` still diff). Pinned by `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped` (L1) + the 4 RD traces in `tests/contract/testdata/oracle/` (L3). |
 
-| 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-<target>` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone` (cross-node placements use the existing send/recv restore path). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). |
+| 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-<target>` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone`. Bug 038: clone/restore replicas are stamped on the snapshot-holding nodes in the SOURCE replica's storage pool (upstream restore semantics, oracle-verified), and the placer refuses to land later replicas of a restore-marked RD on a different backend — the snapshot stream formats are not interchangeable (FILE_THIN stream into `zfs recv` loops on bad magic). Upstream `rd clone` likewise operates on the source's own pools (it refuses unsupported source pools: "Clone source contains unsupported storage pools"). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` + `pkg/rest/clone_placement_bug038_test.go` + `pkg/placer/restore_source_constraints_bug038_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). |
 | 83 | `rg spawn-resources` on an over-committed RG (place_count > available nodes) | BEHAVIOR | permanent | Corner D1. Upstream LINSTOR fails `spawn-resources` SHORT when the RG's place_count exceeds the placeable node count: it returns `FAIL_NOT_ENOUGH_NODES` (ret_code 996, "Not enough available nodes") and places NOTHING. blockstor instead takes a DEFERRED, best-effort autoplace path: it spawns the RD + VDs, places as many diskful replicas as the topology allows (e.g. 3 of 7 on a 3-node cluster), and surfaces the shortfall as an INFO in the SUCCESS envelope (`resource definition spawned, autoplace deferred: <rd>: not enough candidate storage pools: placed N of M`, exit 0). The `RGRebalanceReconciler` (`internal/controller/rg_rebalance_controller.go`) then tops the replica count back up additively once more nodes appear. The over-commit `rg create` itself is ACCEPTED by both controllers (parity — never an early create-time error). Rationale: the deferred path lets the CSI external-provisioner retry loop converge on a created RD instead of hard-failing on a cluster that is one node away from satisfying the request, and is strictly additive (scale-down stays an explicit `r d`). Switching spawn back to the upstream 996 short-fail would be a deliberate API change, not a silent regression. Pinned by `pkg/rest/spawn_test.go::TestSpawnImpossiblePlacementReturnsActionableError` + `TestSpawnPartialFlagAllowsShortPlacement` (L1) and `tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh` (L6). |
 
 | 07 | `rd l --resource-definitions <rd>` (Layers column on a placed RD) | WIRE_SHAPE | permanent | Same `stampRDLayerDataFromStack` behaviour as row 81 — the flag-qualified `rd l --resource-definitions <rd>` catalogue cell (harness index 07) the bare `rd l` whitelist string does not literally cover. BS re-synthesises `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` from `Spec.LayerStack` on every RD read, so the python CLI's Layers column renders `DRBD,STORAGE` even for a fresh `parity-rd` with only a volume-definition; upstream 1.33.2 leaves the column blank until DRBD layer data is actually allocated. BLOCKSTOR_SUPERSET, operator-friendly, and linstor-csi / piraeus-operator do not gate on RD `layer_data`. See row 81 for the full rationale and the L1/L3 pins (`stampRDLayerDataFromStack`, `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped`). |

diff --git a/pkg/placer/placer.go b/pkg/placer/placer.go
@@ -159,6 +159,13 @@ func New(st store.Store) *Placer {
 // the migration semantic: even if 2 replicas exist, if one is on an
 // evicted node the placer will create a third on a healthy peer.
 func (p *Placer) Place(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) (int, int, error) {
+	// Bug 038: an RD born from a snapshot restore / clone must never
+	// gain a diskful replica on a foreign backend or a snapshot-less
+	// node — enforce the restore-source constraints HERE so every
+	// Place caller (REST autoplace, RG apply, RG rebalance, node
+	// replacement) honours them, not just the REST edge.
+	filter = p.constrainFilterToRestoreSource(ctx, rdName, filter)
+
 	plan, err := p.buildPlan(ctx, rdName, filter)
 	if err != nil {
 		return 0, 0, err
@@ -1734,3 +1741,117 @@ func tupleKey(tuple map[string]string) string {
 
 	return buf.String()
 }
+
+// restoreFromSnapshotProp is the RD prop the snapshot-restore /
+// clone REST handlers stamp on a restored target
+// (`<srcRD>:<snapName>`). The placer keys its Bug 038 restore-source
+// constraints on it; the literal mirrors pkg/rest and pkg/dispatcher.
+const restoreFromSnapshotProp = "BlockstorRestoreFromSnapshot"
+
+// constrainFilterToRestoreSource applies the restore-source BACKEND
+// pin for an RD that was materialised from a snapshot
+// (BlockstorRestoreFromSnapshot prop): candidate pools are pinned to
+// the source replica's ProviderKind. Returns the caller's filter
+// untouched when the RD carries no restore marker (the overwhelmingly
+// common case).
+//
+// Bug 038 (release gate): the REST autoplace handler applied the
+// provider pin at the wire edge, but the controller-side Place callers
+// (ResourceGroup apply, RG rebalance, node replacement) fed the placer
+// the RAW RG SelectFilter. On a clone whose source lived on a FILE_THIN
+// pool, the RG reconciler's pass landed the target replica on a ZFS
+// pool — the satellite then piped the FILE_THIN snapshot stream into
+// `zfs recv`, which looped forever on `cannot receive: invalid stream
+// (bad magic number)` and the clone never reached UpToDate. Enforcing
+// the BACKEND pin inside Place closes the cross-backend gap for every
+// current and future caller.
+//
+// The provider pin overrides any caller-supplied ProviderList (same
+// contract as the REST autoplace handler): snapshot streams of
+// different backends are not interchangeable, so a wider caller list
+// can never be honoured safely.
+//
+// We deliberately DO NOT pin NodeNameList to the snapshot's nodes here.
+// An earlier revision did, but that broke legitimate STAGED cross-node
+// bring-up: after the node-local restore lands on the snapshot nodes,
+// an operator (or linstor-csi) adds a further replica on a fresh node
+// via `rd ap`, and the satellite populates it over the wire — DRBD
+// network resync from the restored peer (or the CrossNodeFetcher
+// `zfs send | recv` ship path). Hard-pinning the candidate set to the
+// snapshot's nodes made the controller-side RG reconciler refuse to
+// ever place that cross-node replica. The same-backend pin is what
+// prevents the bad-magic loop; node selection stays the operator's /
+// autoplacer's call, and the REST `--node-name` Bug 397 guard already
+// rejects an explicit diskful replica on a snapshot-less node up front.
+//
+// Every lookup is best-effort: when the marker is malformed, the
+// snapshot is gone, or the source has no diskful replica left, the
+// filter passes through unchanged and the regular placement gates
+// apply (satellite-side Bug 397 blank-fallback seeding still protects
+// data integrity in those edge cases).
+func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {
+	rd, err := p.store.ResourceDefinitions().Get(ctx, rdName)
+	if err != nil {
+		return filter
+	}
+
+	stamp := rd.Props[restoreFromSnapshotProp]
+	if stamp == "" {
+		return filter
+	}
+
+	srcRD, _, ok := strings.Cut(stamp, ":")
+	if !ok || srcRD == "" {
+		return filter
+	}
+
+	kind := SourceProviderKind(ctx, p.store, srcRD)
+	if kind == "" {
+		return filter
+	}
+
+	// Copy so the constraint never leaks into the caller's filter
+	// (RG reconcilers reuse their filter across sibling RDs).
+	out := *filter
+	out.ProviderList = []string{kind}
+
+	return &out
+}
+
+// SourceProviderKind returns the ProviderKind of the pool backing the
+// named RD's first non-Diskless replica, or "" when none is
+// resolvable. Shared by the placer's restore-source constraint and
+// the REST autoplace handler's clone provider pin — `zfs send` and
+// dd/lvm payloads are not interchangeable, so a clone target must
+// stay on the source's backend.
+func SourceProviderKind(ctx context.Context, st store.Store, srcRDName string) string {
+	resList, err := st.Resources().ListByDefinition(ctx, srcRDName)
+	if err != nil {
+		return ""
+	}
+
+	for i := range resList {
+		res := &resList[i]
+		if slices.Contains(res.Flags, apiv1.ResourceFlagDiskless) {
+			continue
+		}
+
+		pool := res.Props[storPoolNameProp]
+		if pool == "" {
+			continue
+		}
+
+		sp, err := st.StoragePools().Get(ctx, res.NodeName, pool)
+		if err != nil {
+			continue
+		}
+
+		if sp.ProviderKind == "" || sp.ProviderKind == apiv1.StoragePoolKindDiskless {
+			continue
+		}
+
+		return sp.ProviderKind
+	}
+
+	return ""
+}