Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .golangci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,9 @@ linters:
- linters:
- godox
path: pkg/dispatcher/
- linters:
- godox
path: pkg/placer/
- linters:
- godox
path: pkg/version/
Expand Down
2 changes: 1 addition & 1 deletion docs/cli-parity-known-deltas.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Row IDs match the command-catalogue indexes used by `cli-parity-refresh.sh` (see
| 80 | `r td <dst> <rd> --migrate-from <src>` (disklessOnRemaining on the vacated source) | BEHAVIOR | permanent | Upstream-issue U435 (P2, "No diskless replica recreated on source after `--migrate-from` despite disklessOnRemaining"). RG `disklessOnRemaining` IS implemented in blockstor — the placer's `placeDisklessOnRemaining` (`pkg/placer/placer.go`) fans a DISKLESS witness onto every healthy node not already hosting a replica, on the AUTOPLACE path (pinned by `pkg/placer/placer_test.go::TestPlaceDisklessOnRemaining`). The divergence is scoped strictly to the `--migrate-from` path: the `ResourceMigrationReconciler` (`internal/controller/resource_migration_controller.go`) prunes the source's DISKFUL Resource CRD once the destination is UpToDate but does NOT itself re-stamp a `disklessOnRemaining` witness on the vacated source — it relies on the auto-tiebreaker machinery instead. On a cluster where the post-migration replica set has even parity (e.g. the canonical 2-diskful result), `ensureTiebreaker` immediately RE-OCCUPIES the vacated source node with a DISKLESS `TIE_BREAKER` Resource — the same node-presence outcome `disklessOnRemaining` intends (every node retains DRBD-network presence / a quorum voter; see row 78). The witness's FLAG differs (`TIE_BREAKER` vs a plain `disklessOnRemaining` DISKLESS stub) but the operator-visible effect — the source node is left with a diskless replica, not "nothing" (the U435 symptom) — holds. Where parity is odd (no auto-TB needed), the vacated node is genuinely left bare, matching upstream's pre-fix behaviour for that case. CSI/Cozystack never sets `disklessOnRemaining` on its RGs (it relies on auto-TB for quorum), so the migrate-path gap is operator-invisible for the target workload. The auto-TB re-occupation of the migrate source is pinned by `tests/operator-harness/replay/toggle-disk-migrate-from.yaml` (`assert-src-removed` awaits `replica_diskless` on the source node) + `tests/e2e/cli-matrix/toggle-disk-migrate-from.sh`. No fix required; documented so a future `disklessOnRemaining + --migrate-from` report maps here. |
| 81 | `rd l` / `rd s` (Layers column / `layer_data` on unplaced RDs) | WIRE_SHAPE | permanent | Bug 349: the python CLI's `rd l` Layers column `,`-joins `layer_data[].type`, but blockstor's k8s store persists only `Spec.LayerStack` — so the read path re-synthesises a flat `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` on EVERY RD GET / list response (`stampRDLayerDataFromStack`, pkg/rest/resource_definitions.go), falling back to the upstream default stack `DRBD,STORAGE` when the stored stack is empty, so the column never renders blank. Upstream LINSTOR 1.33.2 emits `layer_data` on RD responses only once DRBD layer data actually exists (resources placed → port/secret allocated), and then with a volatile `data` payload (`peer_slots`, `port`, `secret`, …) blockstor never synthesises; a bare `rd c` and an RD with only a volume-definition both omit the field entirely (verified live against the dev-stand oracle 2026-06-12). Operator-visible effect: BS renders the Layers column for unplaced RDs where upstream renders an empty cell — BLOCKSTOR_SUPERSET; once resources exist both sides list the same `type` sequence (the CLI renders only the types, never upstream's `data` payload). linstor-csi / piraeus-operator do not gate on RD `layer_data`. The L3 oracle-replay harness tolerates exactly this field on RD-shaped objects (scoped per-field drop in tests/contract/normalize.go keyed on `resource_group_name`; resources' `layer_object` / volumes' `layer_data_list` still diff). Pinned by `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped` (L1) + the 4 RD traces in `tests/contract/testdata/oracle/` (L3). |

| 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-<target>` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone` (cross-node placements use the existing send/recv restore path). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). |
| 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-<target>` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone`. Bug 038: clone/restore replicas are stamped on the snapshot-holding nodes in the SOURCE replica's storage pool (upstream restore semantics, oracle-verified), and the placer refuses to land later replicas of a restore-marked RD on a different backend — the snapshot stream formats are not interchangeable (FILE_THIN stream into `zfs recv` loops on bad magic). Upstream `rd clone` likewise operates on the source's own pools (it refuses unsupported source pools: "Clone source contains unsupported storage pools"). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` + `pkg/rest/clone_placement_bug038_test.go` + `pkg/placer/restore_source_constraints_bug038_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). |
| 83 | `rg spawn-resources` on an over-committed RG (place_count > available nodes) | BEHAVIOR | permanent | Corner D1. Upstream LINSTOR fails `spawn-resources` SHORT when the RG's place_count exceeds the placeable node count: it returns `FAIL_NOT_ENOUGH_NODES` (ret_code 996, "Not enough available nodes") and places NOTHING. blockstor instead takes a DEFERRED, best-effort autoplace path: it spawns the RD + VDs, places as many diskful replicas as the topology allows (e.g. 3 of 7 on a 3-node cluster), and surfaces the shortfall as an INFO in the SUCCESS envelope (`resource definition spawned, autoplace deferred: <rd>: not enough candidate storage pools: placed N of M`, exit 0). The `RGRebalanceReconciler` (`internal/controller/rg_rebalance_controller.go`) then tops the replica count back up additively once more nodes appear. The over-commit `rg create` itself is ACCEPTED by both controllers (parity — never an early create-time error). Rationale: the deferred path lets the CSI external-provisioner retry loop converge on a created RD instead of hard-failing on a cluster that is one node away from satisfying the request, and is strictly additive (scale-down stays an explicit `r d`). Switching spawn back to the upstream 996 short-fail would be a deliberate API change, not a silent regression. Pinned by `pkg/rest/spawn_test.go::TestSpawnImpossiblePlacementReturnsActionableError` + `TestSpawnPartialFlagAllowsShortPlacement` (L1) and `tests/e2e/cli-matrix/rg-c-overcommit-spawn-defers.sh` (L6). |

| 07 | `rd l --resource-definitions <rd>` (Layers column on a placed RD) | WIRE_SHAPE | permanent | Same `stampRDLayerDataFromStack` behaviour as row 81 — the flag-qualified `rd l --resource-definitions <rd>` catalogue cell (harness index 07) the bare `rd l` whitelist string does not literally cover. BS re-synthesises `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` from `Spec.LayerStack` on every RD read, so the python CLI's Layers column renders `DRBD,STORAGE` even for a fresh `parity-rd` with only a volume-definition; upstream 1.33.2 leaves the column blank until DRBD layer data is actually allocated. BLOCKSTOR_SUPERSET, operator-friendly, and linstor-csi / piraeus-operator do not gate on RD `layer_data`. See row 81 for the full rationale and the L1/L3 pins (`stampRDLayerDataFromStack`, `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped`). |
Expand Down
121 changes: 121 additions & 0 deletions pkg/placer/placer.go
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,13 @@ func New(st store.Store) *Placer {
// the migration semantic: even if 2 replicas exist, if one is on an
// evicted node the placer will create a third on a healthy peer.
func (p *Placer) Place(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) (int, int, error) {
// Bug 038: an RD born from a snapshot restore / clone must never
// gain a diskful replica on a foreign backend or a snapshot-less
// node — enforce the restore-source constraints HERE so every
// Place caller (REST autoplace, RG apply, RG rebalance, node
// replacement) honours them, not just the REST edge.
filter = p.constrainFilterToRestoreSource(ctx, rdName, filter)

plan, err := p.buildPlan(ctx, rdName, filter)
if err != nil {
return 0, 0, err
Expand Down Expand Up @@ -1734,3 +1741,117 @@ func tupleKey(tuple map[string]string) string {

return buf.String()
}

// restoreFromSnapshotProp is the RD prop the snapshot-restore /
// clone REST handlers stamp on a restored target
// (`<srcRD>:<snapName>`). The placer keys its Bug 038 restore-source
// constraints on it; the literal mirrors pkg/rest and pkg/dispatcher.
const restoreFromSnapshotProp = "BlockstorRestoreFromSnapshot"

// constrainFilterToRestoreSource applies the restore-source BACKEND
// pin for an RD that was materialised from a snapshot
// (BlockstorRestoreFromSnapshot prop): candidate pools are pinned to
// the source replica's ProviderKind. Returns the caller's filter
// untouched when the RD carries no restore marker (the overwhelmingly
// common case).
//
// Bug 038 (release gate): the REST autoplace handler applied the
// provider pin at the wire edge, but the controller-side Place callers
// (ResourceGroup apply, RG rebalance, node replacement) fed the placer
// the RAW RG SelectFilter. On a clone whose source lived on a FILE_THIN
// pool, the RG reconciler's pass landed the target replica on a ZFS
// pool — the satellite then piped the FILE_THIN snapshot stream into
// `zfs recv`, which looped forever on `cannot receive: invalid stream
// (bad magic number)` and the clone never reached UpToDate. Enforcing
// the BACKEND pin inside Place closes the cross-backend gap for every
// current and future caller.
//
// The provider pin overrides any caller-supplied ProviderList (same
// contract as the REST autoplace handler): snapshot streams of
// different backends are not interchangeable, so a wider caller list
// can never be honoured safely.
//
// We deliberately DO NOT pin NodeNameList to the snapshot's nodes here.
// An earlier revision did, but that broke legitimate STAGED cross-node
// bring-up: after the node-local restore lands on the snapshot nodes,
// an operator (or linstor-csi) adds a further replica on a fresh node
// via `rd ap`, and the satellite populates it over the wire — DRBD
// network resync from the restored peer (or the CrossNodeFetcher
// `zfs send | recv` ship path). Hard-pinning the candidate set to the
// snapshot's nodes made the controller-side RG reconciler refuse to
// ever place that cross-node replica. The same-backend pin is what
// prevents the bad-magic loop; node selection stays the operator's /
// autoplacer's call, and the REST `--node-name` Bug 397 guard already
// rejects an explicit diskful replica on a snapshot-less node up front.
//
// Every lookup is best-effort: when the marker is malformed, the
// snapshot is gone, or the source has no diskful replica left, the
// filter passes through unchanged and the regular placement gates
// apply (satellite-side Bug 397 blank-fallback seeding still protects
// data integrity in those edge cases).
func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filter parameter is dereferenced as out := *filter on line 1799. If filter is nil, this will cause a nil pointer dereference panic. To prevent this, add a defensive nil check at the beginning of the function.

func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {
	if filter == nil {
		return nil
	}

rd, err := p.store.ResourceDefinitions().Get(ctx, rdName)
if err != nil {
return filter
}

stamp := rd.Props[restoreFromSnapshotProp]
if stamp == "" {
return filter
}

srcRD, _, ok := strings.Cut(stamp, ":")
if !ok || srcRD == "" {
return filter
}

kind := SourceProviderKind(ctx, p.store, srcRD)
if kind == "" {
return filter
}

// Copy so the constraint never leaks into the caller's filter
// (RG reconcilers reuse their filter across sibling RDs).
out := *filter
out.ProviderList = []string{kind}

return &out
}

// SourceProviderKind returns the ProviderKind of the pool backing the
// named RD's first non-Diskless replica, or "" when none is
// resolvable. Shared by the placer's restore-source constraint and
// the REST autoplace handler's clone provider pin — `zfs send` and
// dd/lvm payloads are not interchangeable, so a clone target must
// stay on the source's backend.
func SourceProviderKind(ctx context.Context, st store.Store, srcRDName string) string {
resList, err := st.Resources().ListByDefinition(ctx, srcRDName)
if err != nil {
return ""
}

for i := range resList {
res := &resList[i]
if slices.Contains(res.Flags, apiv1.ResourceFlagDiskless) {
continue
}

pool := res.Props[storPoolNameProp]
if pool == "" {
continue
}

sp, err := st.StoragePools().Get(ctx, res.NodeName, pool)
if err != nil {
continue
}

if sp.ProviderKind == "" || sp.ProviderKind == apiv1.StoragePoolKindDiskless {
continue
}

return sp.ProviderKind
}

return ""
}
Loading