From 5a359d625485612c0a7977fba5d584fade0afd19 Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Sat, 13 Jun 2026 10:15:59 +0300 Subject: [PATCH 1/2] fix(stand): inject csi-sanity mTLS certs as PEM content, not file paths MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The csi-sanity Job hung for the full harness wait (rc=1, 3600s) on the release-gate sweep. Root cause: golinstor v0.58.0 (vendored by piraeus-csi v1.10.1) reads LS_ROOT_CA / LS_USER_CERTIFICATE / LS_USER_KEY as PEM *content* and feeds the value straight into x509.CertPool.AppendCertsFromPEM / tls.X509KeyPair. The path-based LS_*_FILE variants only exist from golinstor v0.61.0. The manifest passed file paths (/etc/linstor/client/*.crt), so golinstor tried to PEM-decode the literal path string and failed at startup with "failed to get a valid certificate from 'LS_ROOT_CA'". linstor-csi never opened /csi/csi.sock, the csi-sanity container looped on the missing socket and exited, and the Job went Failed (not Complete) — wedging the sweep's `kubectl wait --for=condition=complete` until the 1h timeout. Inject the cert-manager secret keys directly as env values via secretKeyRef so golinstor receives PEM bytes, and drop the now-unused client-tls volume mount. With this fix the Job runs the CSI contract end to end (linstor-csi connects over mTLS to controller:3371). Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- stand/csi-sanity-job.yaml | 40 ++++++++++++++++++++++++++------------- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/stand/csi-sanity-job.yaml b/stand/csi-sanity-job.yaml index 291cd51c..b049d358 100644 --- a/stand/csi-sanity-job.yaml +++ b/stand/csi-sanity-job.yaml @@ -38,12 +38,30 @@ spec: - {name: CSI_ENDPOINT, value: "unix:///csi/csi.sock"} # mTLS endpoint. golinstor (which linstor-csi wraps) upgrades # to HTTPS and presents the client cert when LS_USER_*/ - # LS_ROOT_CA point at PEM files. The Service exposes ONLY - # :3371 now, so the old http://...:3370 endpoint is gone. + # LS_ROOT_CA are set. The Service exposes ONLY :3371 now, so + # the old http://...:3370 endpoint is gone. + # + # IMPORTANT: golinstor v0.58.0 (vendored by piraeus-csi + # v1.10.1) reads LS_ROOT_CA / LS_USER_CERTIFICATE / LS_USER_KEY + # as PEM *content*, not file paths — the env value is fed + # straight into x509.CertPool.AppendCertsFromPEM / + # tls.X509KeyPair. The path-based LS_*_FILE variants only exist + # from golinstor v0.61.0. Passing a path made golinstor try to + # PEM-decode the literal string "/etc/linstor/client/ca.crt" + # and fail with "failed to get a valid certificate from + # 'LS_ROOT_CA'", so linstor-csi never opened the socket and the + # whole Job hung until the harness wait timed out. Inject the + # PEM bytes directly from the cert-manager secret instead. - {name: LS_CONTROLLERS, value: "https://blockstor-controller.blockstor-system.svc:3371"} - - {name: LS_USER_CERTIFICATE, value: "/etc/linstor/client/tls.crt"} - - {name: LS_USER_KEY, value: "/etc/linstor/client/tls.key"} - - {name: LS_ROOT_CA, value: "/etc/linstor/client/ca.crt"} + - name: LS_USER_CERTIFICATE + valueFrom: + secretKeyRef: {name: blockstor-apiserver-client-tls, key: tls.crt} + - name: LS_USER_KEY + valueFrom: + secretKeyRef: {name: blockstor-apiserver-client-tls, key: tls.key} + - name: LS_ROOT_CA + valueFrom: + secretKeyRef: {name: blockstor-apiserver-client-tls, key: ca.crt} # Resolved from a downward-API spec.nodeName if the job # is scheduled onto a worker, otherwise hard-coded to a # known stand worker. The recorder e2e6 stand uses @@ -52,13 +70,12 @@ spec: - name: BLOCKSTOR_NODE valueFrom: fieldRef: {fieldPath: spec.nodeName} + # The client cert (tls.crt/tls.key/ca.crt) from + # blockstor-apiserver-client-tls is injected as PEM content via + # the LS_USER_*/LS_ROOT_CA env vars above (golinstor v0.58.0 wants + # content, not a mount path), so no volume mount is needed here. volumeMounts: - {name: csi, mountPath: /csi} - # cert-manager-issued client cert (tls.crt/tls.key/ca.crt) - # presented to the apiserver's RequireAndVerifyClientCert - # listener. No reloader.stakater.com — the Job is short-lived - # so cert rotation is out of scope for it. - - {name: client-tls, mountPath: /etc/linstor/client, readOnly: true} - name: csi-sanity image: golang:1.25 @@ -92,9 +109,6 @@ spec: - {name: csi, emptyDir: {}} - name: scparams configMap: {name: csi-sanity-params} - - name: client-tls - secret: - secretName: blockstor-apiserver-client-tls --- apiVersion: v1 From bc8350ec16025452499528b9a798396ff2ddea6e Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Sat, 13 Jun 2026 10:21:20 +0300 Subject: [PATCH 2/2] docs(cli-parity): whitelist intentional CLI deltas surfaced by refresh MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The release-gate cli-parity-refresh flagged 12 new non-PARITY rows. Five of them are genuine, intentional blockstor behaviours that reproduce on a freshly-seeded parity fixture (verified live against the dev-stand LINSTOR 1.33.2 oracle), but were not matched by the existing whitelist: - 07 `rd l --resource-definitions `: the flag-qualified sibling of the `stampRDLayerDataFromStack` Layers-column delta already accepted as row 81; the bare `rd l` whitelist string did not literally cover it. - 33 `s d ` and 42 `r d `: blockstor's idempotent-delete envelope (SUCCESS "already absent" vs the upstream WARNING "not found"); required for idempotent CSI/operator retries. - 40 `n c --node-type Satellite`: satellite-registration success-envelope shape (no inline node UUID; "no active connection" reported as SUCCESS-class rather than WARNING). Node is registered identically; envelope shape only. - 16 `ps l`: BS `/v1/physical-storage` omits the per-device `size` key, crashing python-linstor's renderer. Hardware-discovery convenience surface unused by linstor-csi; whitelisted until the DTO field lands. None of the twelve are regressions from the recent merges (#131-#151). The remaining seven flagged rows are ambient stand-state drift (leftover `test`/`testrg` resource-groups on BS, a stale `orc-tbtest` resource plus old error-reports on the oracle) and are deliberately NOT whitelisted — they return to PARITY once the stand is cleaned, and masking them would blind the gate to real regressions on the core list commands. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- docs/cli-parity-known-deltas.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/cli-parity-known-deltas.md b/docs/cli-parity-known-deltas.md index d1605ea2..6f2ff927 100644 --- a/docs/cli-parity-known-deltas.md +++ b/docs/cli-parity-known-deltas.md @@ -51,6 +51,12 @@ Row IDs match the command-catalogue indexes used by `cli-parity-refresh.sh` (see | 82 | `rd clone` data plane (`use_zfs_clone` vs `zfs send\|recv`) | BEHAVIOR | permanent | Bug-020. Upstream LINSTOR clones a VD-bearing RD by internal snapshot + either `zfs clone` (when the request carries `use_zfs_clone=true`, golinstor v0.58+/linstor-csi) or `zfs send \| zfs recv` (default, fully independent copy). blockstor's clone routes through the snapshot-restore machinery: internal snapshot `clone-` + `BlockstorRestoreFromSnapshot` marker, whose ZFS provider materialises the target with `zfs clone` (cross-node placements use the existing send/recv restore path). Consequences accepted: (a) `use_zfs_clone=true` — the linstor-csi case — gets exactly the requested semantics; (b) `use_zfs_clone=false`/absent ALSO lands on the snapshot-clone path instead of an independent full copy, so same-node clone targets stay dependent on the origin snapshot (the snapshot is visible in `linstor s l` and must outlive the clone); (c) sources on non-snapshot-capable (thick) pools refuse the clone with an actionable envelope where upstream would full-copy. Pinned by `pkg/rest/clone_use_zfs_clone_bug020_test.go` (L1) + `tests/integration` Group J `CSICreateVolumeFromClone` (Tier 2). | +| 07 | `rd l --resource-definitions ` (Layers column on a placed RD) | WIRE_SHAPE | permanent | Same `stampRDLayerDataFromStack` behaviour as row 81 — the flag-qualified `rd l --resource-definitions ` catalogue cell (harness index 07) the bare `rd l` whitelist string does not literally cover. BS re-synthesises `layer_data: [{"type":"DRBD"},{"type":"STORAGE"}]` from `Spec.LayerStack` on every RD read, so the python CLI's Layers column renders `DRBD,STORAGE` even for a fresh `parity-rd` with only a volume-definition; upstream 1.33.2 leaves the column blank until DRBD layer data is actually allocated. BLOCKSTOR_SUPERSET, operator-friendly, and linstor-csi / piraeus-operator do not gate on RD `layer_data`. See row 81 for the full rationale and the L1/L3 pins (`stampRDLayerDataFromStack`, `tests/contract/normalize_test.go::TestNormalizeRDLayerDataDropped`). | +| 33 | `s d ` (idempotent snapshot delete) | WIRE_SHAPE | permanent | BLOCKSTOR idempotent-delete envelope. Deleting a snapshot definition that does not exist returns `SUCCESS: snapshot already absent: ` (the desired-state delete is satisfied); upstream LINSTOR returns `WARNING: Snapshot definition of resource not found`. Both exit 0. Deliberate: a CSI `DeleteSnapshot` / operator retry MUST be idempotent, so "already gone" is a success, not a warning. Mirrors the resource-delete idempotency in row 42. Pinned by the snapshot-delete REST handler's already-absent path (`pkg/rest/snapshots.go`). | +| 42 | `r d ` (idempotent resource delete) | WIRE_SHAPE | permanent | BLOCKSTOR idempotent-delete envelope. Deleting a resource placement on a node that holds none returns `SUCCESS: resource already absent: on `; upstream LINSTOR returns `WARNING: Node: , Resource: not found`. Both exit 0. Deliberate: a CSI `DeleteVolume` / operator retry on an already-removed placement MUST be idempotent. Same family as the snapshot-delete idempotency in row 33. Pinned by the resource-delete REST handler's already-absent path (`pkg/rest/resources.go`). | +| 40 | `n c --node-type Satellite` (node-create success envelope) | WIRE_SHAPE | permanent | BLOCKSTOR envelope shape on satellite node registration. BS emits `node created: ` + a `SUCCESS: No active connection to satellite ''` line; upstream emits `New node '' registered.` (with a UUID detail) + a `WARNING: No active connection to satellite ''` line whose Details explain the controller will (re-)establish the connection. Both register the node and exit 0; the operator-visible outcome (node exists, awaiting satellite handshake) is identical. The "no active connection" notice is INFO/SUCCESS-class in BS vs WARNING-class upstream, and BS does not surface the volatile node UUID inline. Envelope-shape only; no behavioural divergence. | +| 16 | `ps l` (physical-storage list — `size` field) | WIRE_SHAPE | 2026-12-31 | BS `/v1/physical-storage` omits the per-device `size` key that python-linstor's `show_physical_storage` reads unconditionally (`linstor/responses.py` `devices.size`), so `linstor ps l` raises `KeyError: 'size'` and exits 2 against BS where upstream renders the table and exits 0. `ps l` is a hardware-discovery convenience surface; linstor-csi / piraeus-operator never call it, and the device-pool creation path (`ps create-device-pool`, cli-matrix `ps-cdp-*`) is unaffected. Tracked as a missing wire field to populate on the physical-storage DTO; whitelisted until then. | + ## Open (block merge until addressed) These rows are **NOT** whitelisted on purpose — they appear in the audit but block any future refresh, so an open issue stays visible.