Skip to content

fix(clone): land clone/restore replicas on snapshot nodes in the source pool#153

Merged
Andrei Kvapil (kvaps) merged 10 commits into
mainfrom
fix/bug-038-rd-clone-cross-backend
Jun 13, 2026
Merged

fix(clone): land clone/restore replicas on snapshot nodes in the source pool#153
Andrei Kvapil (kvaps) merged 10 commits into
mainfrom
fix/bug-038-rd-clone-cross-backend

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 13, 2026

Copy link
Copy Markdown
Member

Problem

rd clone of a deployed, VD-bearing source on a FILE_THIN pool could materialise a target replica on a ZFS pool. The target satellite then piped the source's FILE_THIN snapshot stream into zfs recv, which loops forever on cannot receive: invalid stream (bad magic number) — the clone replicas never reach UpToDate. Reproduced on two stands via tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh.

Root cause

Two placement gaps combined:

  1. The snapshot-restore placement branch for the empty-node-list case (which the clone data plane reuses) deferred to the parent ResourceGroup's SelectFilter.PlaceCount. Under the stand-default empty-spec DfltRscGrp it placed zero replicas.
  2. The controller-side ResourceGroup reconcilers then ran the placer with the raw RG filter — no snapshot-node constraint, no provider pin — so the capacity-weighted ranking could pick a pool of a different backend. Snapshot stream formats are not interchangeable across backends, so the cross-node restore fed a foreign stream into the ZFS receiver.

Fix (upstream-faithful)

Verified against the live LINSTOR 1.33.2 oracle: an upstream restore without an explicit node list lands on exactly the snapshot-holding nodes, in the snapshot's own storage pool, and never consults the autoplacer; upstream rd clone likewise operates on the source's own pools (it refuses sources on pools it cannot clone with "Clone source contains unsupported storage pools").

  • pkg/rest: restore/clone with no explicit nodes now stamps one replica per snapshot-holding node, pinned to the source replica's storage pool — the clone data plane is node-local and same-backend by construction, independent of the RG's place_count.
  • pkg/placer: Place() now applies the restore-source constraints (snapshot nodes + source provider kind) internally, so every caller — REST autoplace, RG apply, RG rebalance, node replacement — honours them. The REST handler's provider lookup delegates to the shared placer helper.
  • harness: the cell aborted with cli_nodes[0]: unbound variable on failure (mapfile from a process substitution swallows the exit status), masking the real diagnostic; failures now surface the wait function's own message.
  • docs/cli-parity-known-deltas.md row 82 updated with the placement semantics.

Validation

  • L1: new regression tests in pkg/rest/clone_placement_bug038_test.go, pkg/placer/restore_source_constraints_bug038_test.go; updated restore-placement pins. go test ./... green, golangci-lint clean, -race green on satellite/placer/rest.
  • L6: cli-matrix/rd-clone-vd-data-plane.sh green 3x on stand big at this branch's build (both wire variants: plain CLI clone and raw REST use_zfs_clone=true, byte-level marker verified on every replica). cli-matrix/r-full-lifecycle.sh green (placer/autoplace ground-truth gate).
  • L7: replay/rd-clone-vd-data-plane.yaml PASS on stand big.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Clone operations now correctly place replicas on snapshot-holding nodes within the source storage pool.
    • Restore operations properly constrain placement to the source pool's backend type.
    • Improved handling of missing snapshots with bounded retry and graceful fallback behavior.
    • Fixed resource listing to correctly identify resources created through different paths.
  • New Features

    • Added bare restore mode for snapshot restores (leaves empty shell for operator placement).
    • Added eager placement mode for clone operations (places replicas immediately on snapshot nodes).
  • Tests

    • Added comprehensive regression test coverage for clone/restore placement constraints and fallback scenarios.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 515bfeff-c2fd-4130-a37e-07c456838565

📥 Commits

Reviewing files that changed from the base of the PR and between 24baa83 and 5afc651.

📒 Files selected for processing (2)
  • pkg/rest/bug_038_restore_no_handler_block_test.go
  • pkg/rest/snapshot_restore.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/rest/snapshot_restore.go

📝 Walkthrough

Walkthrough

This PR completes Bug 038 implementation by adding backend pinning for snapshot clones/restores, satellite requeue defense for missing snapshots, a store label-lookup fix for unlabeled resources, and REST handler non-blocking guarantees.

Changes

Bug 038 - Snapshot clone/restore backend pinning and satellite resilience

Layer / File(s) Summary
Placer restore-source backend constraint mechanism
pkg/placer/placer.go, pkg/placer/restore_source_constraints_bug038_test.go, .golangci.yml
Placer.Place now applies restore-source backend constraints by overriding the filter's ProviderList to the source pool's backend kind. New SourceProviderKind helper resolves the source pool backend by inspecting source resources. Tests verify constraint application, filter immutability, and free-placement behavior for unmarked RDs. Golangci exclusion disables godox warnings for placer path.
Clone eager placement on snapshot nodes
pkg/rest/rd_clone.go, pkg/rest/clone_placement_bug038_test.go
Clone path calls materializeRestoredRD with eagerPlace=true, stamping replicas synchronously on snapshot-holding nodes using per-node backend-matched pools. Integration test verifies clones land on snapshot nodes (w1, w2) in the source FILE_THIN pool (stand), not on higher-capacity decoy pools.
Restore bare-shell semantics with per-node pool selection
pkg/rest/snapshot_restore.go, pkg/rest/snapshot_restore_test.go
Snapshot restore distinguishes bare-restore (eagerPlace=false, empty shell for operator placement) from eager clone (eagerPlace=true, one-shot stamping on snap.Nodes). Removes ResourceGroup auto-placement. Per-node StorPoolName selection via storPoolsByNodeFromSourceRD maps source replicas to pools with fallback. Tests verify bare-restore leaves empty shell, autoplace honors backend pin, and decoy pools are avoided.
REST autoplace refactoring to shared provider-kind resolution
pkg/rest/autoplace.go
resolveCloneSourceProviderKind now delegates to placer.SourceProviderKind helper, simplifying REST code by reusing the shared placer implementation.
Satellite bounded requeue defense for missing snapshots
pkg/satellite/reconciler.go, pkg/satellite/bug_397_restore_seed_test.go, pkg/satellite/cross_cluster_ship_test.go, pkg/satellite/ship_dispatch_test.go
Satellite reconciler adds bounded requeue defense via restoreSnapshotMissingTries counter and restoreSnapshotMissingBudget limit. When RestoreVolumeFromSnapshot fails with ErrNotFound (and no cross-node fetcher), requeues up to budget passes before falling back to blank CreateVolume. Tests verify requeue continues within budget, then permits blank fallback only after exhaustion.
Store resource listing correctness fix for unlabeled resources
pkg/store/k8s/resources.go, pkg/store/k8s/list_by_definition_mixed_label_bug038_test.go
ListByDefinition now unconditionally lists all Resource CRDs and filters in-memory by ResourceDefinitionName, removing the label-selector fast path that silently omitted kubectl-apply or externally-created resources. Tests verify mixed labeled/unlabeled lists and props-only pool specs are correctly returned.
REST handler non-blocking promises for snapshot readiness
pkg/rest/bug_038_restore_no_handler_block_test.go
Adds regression tests ensuring clone and restore REST endpoints return promptly without blocking on snapshot readiness. Introduces httpPostWithDeadline helper (5s timeout) and seedRestoreSourceWithNeverReadySnapshot fixture. Tests verify handlers stamp replicas synchronously, meeting the non-blocking contract to avoid deadlocks when satellites are absent.

Documentation and test tooling updates

Layer / File(s) Summary
Documentation for clone/restore backend-pinning semantics
docs/cli-parity-known-deltas.md
Updated CLI parity delta row #82 to explicitly document clone/restore replicas stamped on snapshot-holding nodes in the source pool, with refusal of cross-backend placement.
Shell script error propagation in e2e clone test
tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh
Both clone variants now capture wait_clone_replicas output into intermediate variables before mapfile, ensuring exit errors propagate correctly under set -euo pipefail.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cozystack/blockstor#99: Introduces the same placer ProviderList override semantics for restore-source backend pinning that this PR tests and hardens.

Poem

🐰 Snapshots now find their way home,
With replicas placed where they're known,
Requeue budgets halt the endless wait,
Before the blank-create seals fate,
Backend-pinned, we hop with grace—
Safe clones in their rightful place! 🍀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the primary change: ensuring clone/restore replicas are placed on snapshot-holding nodes within the source storage pool backend.
Docstring Coverage ✅ Passed Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-038-rd-clone-cross-backend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug 038 by enforcing restore-source constraints directly within the placer's Place method and refactoring the REST snapshot-restore handlers to place replicas on all snapshot-holding nodes in the source's storage pool by default. This prevents cross-backend placement issues, such as FILE_THIN snapshots being restored onto ZFS pools. The changes are accompanied by comprehensive unit and integration tests, as well as a minor fix to bash scripting in the E2E tests. Feedback on the implementation points out a potential nil pointer dereference in constrainFilterToRestoreSource if the filter parameter is nil.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/placer/placer.go
// filter passes through unchanged and the regular placement gates
// apply (satellite-side Bug 397 blank-fallback seeding still protects
// data integrity in those edge cases).
func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The filter parameter is dereferenced as out := *filter on line 1799. If filter is nil, this will cause a nil pointer dereference panic. To prevent this, add a defensive nil check at the beginning of the function.

func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {
	if filter == nil {
		return nil
	}

Andrei Kvapil (kvaps) and others added 9 commits June 13, 2026 16:44
…e pool

Bug 038: the snapshot-restore (and rd-clone) placement branch for the
empty-node-list case deferred to the parent ResourceGroup's
SelectFilter.PlaceCount. Under the stand-default empty-spec DfltRscGrp
the restore placed zero replicas, leaving the real placement to the
controller-side RG reconcilers, whose unconstrained placer pass could
land a clone replica on a pool of a different backend. The satellite
then piped the source's FILE_THIN snapshot stream into zfs recv, which
looped forever on 'cannot receive: invalid stream (bad magic number)'
and the clone never reached UpToDate.

Mirror upstream LINSTOR (verified against the live linstor-oracle,
1.33.2): a restore without an explicit node list lands on exactly the
snapshot-holding nodes, in the source replica's storage pool, never
through the autoplacer. The clone data plane is therefore node-local
and same-backend by construction.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…e backend

Bug 038 defense in depth: the REST autoplace handler already pinned
candidate nodes/providers for snapshot-restored RDs, but the
controller-side Place callers (ResourceGroup apply, RG rebalance, node
replacement) fed the placer the raw RG SelectFilter and could place a
clone replica on a different backend. Enforce the restore-source
constraints inside Place() so every caller honours them; the REST
handler's provider lookup now delegates to the shared placer helper.

The constraint operates on a copy of the caller's filter, no-ops for
RDs without the BlockstorRestoreFromSnapshot marker, and falls through
unchanged when the marker/snapshot/source state is unresolvable.

.golangci.yml: pkg/placer/ joins the per-path godox exclusion list
(Bug-NNN cross-reference comments, same as the sibling packages).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…abort

mapfile fed from a process substitution swallows the producer's exit
status, so a wait_clone_replicas timeout left the nodes array empty and
the cell died on 'cli_nodes[0]: unbound variable' under set -u,
hiding the actual diagnostic. Capture the output through a plain
command substitution and fail loudly with the function's own message.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Bug 038 regression: the clone/restore data plane is a node-local
RestoreVolumeFromSnapshot, but the per-node snapshot materialises
asynchronously via the satellite SnapshotReconciler. Stamping the
restore-marked replicas synchronously on all snapshot nodes let a
replica reconcile before its co-located snapshot existed, hitting
ErrNotFound and a terminal blank CreateVolume that poisoned the
dataset and deadlocked every replica empty.

Gate placement: wait for the snapshot to report a non-zero per-node
CreateTimestamp on the target nodes before stamping. The
SnapshotReconciler stamps that only after the on-disk snapshot exists,
so the wait closes the race. Best-effort and bounded: a timeout (or a
legacy snapshot with no per-node tracking) falls through to placement,
with the satellite-side requeue as the backstop.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The restore-source constraint moved into Place() (so RG apply / RG
rebalance / node-replacement callers honour it) must enforce ONLY the
same-backend provider pin, which is what prevents the FILE_THIN stream
into zfs recv bad-magic loop. An earlier revision also hard-pinned the
candidate node set to the snapshot's nodes; that broke legitimate
staged cross-node bring-up — the cross-node restore lanes add a replica
on a fresh node and populate it over the wire, and the node-pin made
the controller-side reconcilers refuse to ever place it.

Drop the node-pin; keep the backend pin. Add a regression test proving
a restore-marked RD can gain a cross-node replica on a same-backend
pool, alongside the existing backend-pin and no-leak pins.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ank create

Bug 038 defense-in-depth: the per-node snapshot materialises
asynchronously, so a node-local restore replica that reconciles a beat
early sees RestoreVolumeFromSnapshot return ErrNotFound. Falling
straight through to a blank CreateVolume is terminal — datasetExists
then short-circuits the real clone forever and the Bug 397 guard
SyncTargets every replica into an all-empty deadlock.

Requeue the reconcile up to a bounded budget to let the local snapshot
land before conceding to the blank fallback. A genuinely cross-node /
ship-only replica (snapshot never local, no fetcher) exhausts the
budget and degrades to the pre-fix DRBD-resync blank fallback. The
REST-side readiness gate is the primary fix; this is the backstop for
the residual CRD-status-vs-on-disk window.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Bug 038 root cause on stand: Resources().ListByDefinition trusted a
label-selector fast path and only fell back to a full scan when the
selector returned ZERO items, assuming a partial subset was impossible
because every REST writer sets the label. That breaks for a MIXED RD
whose diskful replicas were applied via kubectl (no label) while its
auto-tiebreaker witness was stamped by the controller through the
store (labelled): the selector returned only the labelled witness, the
fallback was skipped, and the unlabelled diskful replicas were hidden.

The snapshot-restore / clone handler reads this list to resolve the
source pool, so with the diskful replicas hidden it stamped the clone
replicas with an empty StorPoolName and the satellite failed every
reconcile with 'unknown storage pool' — the clone never converged.

Scan-and-filter on the authoritative Spec.ResourceDefinitionName. The
List is served from the controller-runtime informer cache, so the
in-memory filter costs the same as a label index but is correct for
labelled and unlabelled replicas alike.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
PR #153 changed the bare snapshot-restore handler to synchronously
stamp a replica on EVERY snapshot node. That fixed the clone verb (a
one-shot CSI op with no follow-up autoplace) but regressed the staged
cross-node restore workflow: the e2e restore lanes place the
data-bearing replica on a snapshot node first, wait UpToDate, then add
a cross-node replica that SyncTargets it. With both snapshot nodes
already eagerly placed, place_count was satisfied and the cross-node
replica was never created (stuck Connecting / no devicePath).

Split the placement policy by caller. The clone path (cloneWithData)
eager-places on the snapshot nodes in the source pool. The bare
restore handler leaves an empty shell when no --node-name is given, so
the operator / linstor-csi drives placement — exactly the pre-#153 /
upstream restore-then-scale-out contract. Explicit node lists are
still stamped verbatim on both paths.

Cross-backend protection is unaffected: the placer's restore-source
backend pin keeps the operator's follow-up autoplace on the source
backend, and the REST autoplace handler still constrains a
no-node-list autoplace of a restore-marked RD to the snapshot nodes.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) force-pushed the fix/bug-038-rd-clone-cross-backend branch from d68397c to 24baa83 Compare June 13, 2026 15:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/satellite/reconciler.go (1)

1554-1564: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset the missing-snapshot budget on any non-ErrNotFound outcome.

restoreSnapshotMissingTries is documented as a count of consecutive local-miss passes, but this branch only clears it on success. A sequence like ErrNotFound → ErrNotFound → transient restore error → ErrNotFound resumes at the old count and can hit the blank-create fallback early, recreating the empty-target path this backstop is supposed to avoid.

Proposed fix
 	if !errors.Is(err, storage.ErrNotFound) {
+		// Break the "consecutive local snapshot miss" streak on any other outcome.
+		r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber())
+
 		if err == nil {
 			// Local clone succeeded: this replica holds the snapshot's
 			// data. Clear any stale blank-fallback marker so the
 			// legitimate all-clone restore fast-path keeps its skip, and
 			// reset the Bug 038 not-yet-materialised requeue budget.
 			r.recordRestoreBlankFallback(rdName, vol.GetVolumeNumber(), false)
-			r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber())
 		}
 
 		return err //nolint:wrapcheck // caller wraps
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/satellite/reconciler.go` around lines 1554 - 1564, The branch that
handles non-ErrNotFound outcomes currently only calls
r.clearRestoreSnapshotMissing(...) when err == nil; change this so the
missing-snapshot counter is cleared on any non-ErrNotFound outcome (i.e., when
!errors.Is(err, storage.ErrNotFound)), not just on success. Concretely, keep
r.recordRestoreBlankFallback(...) inside the err == nil block, but move or add
r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber()) to the outer
!errors.Is(err, storage.ErrNotFound) branch so it runs for both success and
transient errors before returning err from the function that contains these
calls.
🧹 Nitpick comments (1)
pkg/rest/snapshot_restore_test.go (1)

214-224: ⚡ Quick win

Align test docstrings with the current bare-restore behavior.

These comments still describe eager replica stamping, but the test body now asserts empty-shell restore followed by explicit autoplace. Please update wording to avoid future misreads of the intended contract.

✏️ Suggested comment-only fix
-// retargeted by Bug 038: the restore no longer defers placement to
-// the autoplacer — it stamps replicas on the snapshot's nodes in the
-// source replica's pool up front (...)
+// retargeted by Bug 038: bare restore with no explicit nodes now
+// leaves an empty shell, and follow-up autoplace must honor the
+// restore-source backend pin (never crossing backend kind).

-// empty-nodes branch: no `--node-name` arguments → the handler stamps
-// one replica on EVERY node holding the snapshot (...)
+// empty-nodes branch: no `--node-name` arguments → the handler creates
+// an empty shell; placement is done explicitly via autoplace and must
+// stay pinned to the source backend.

Also applies to: 843-849

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/snapshot_restore_test.go` around lines 214 - 224, The docstring for
TestSnapshotRestoreConstrainsProviderToSource is outdated: it describes eager
replica stamping on restore but the test now asserts an empty-shell restore
followed by an explicit autoplace; update the comment above that test (and the
similar block at lines 843-849) to describe the current behavior—that restores
create an empty-shell in the source pool and then require an explicit autoplace
to populate replicas—removing any language about stamping replicas eagerly or
deferring placement to the autoplacer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/satellite/reconciler.go`:
- Around line 1554-1564: The branch that handles non-ErrNotFound outcomes
currently only calls r.clearRestoreSnapshotMissing(...) when err == nil; change
this so the missing-snapshot counter is cleared on any non-ErrNotFound outcome
(i.e., when !errors.Is(err, storage.ErrNotFound)), not just on success.
Concretely, keep r.recordRestoreBlankFallback(...) inside the err == nil block,
but move or add r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber()) to
the outer !errors.Is(err, storage.ErrNotFound) branch so it runs for both
success and transient errors before returning err from the function that
contains these calls.

---

Nitpick comments:
In `@pkg/rest/snapshot_restore_test.go`:
- Around line 214-224: The docstring for
TestSnapshotRestoreConstrainsProviderToSource is outdated: it describes eager
replica stamping on restore but the test now asserts an empty-shell restore
followed by an explicit autoplace; update the comment above that test (and the
similar block at lines 843-849) to describe the current behavior—that restores
create an empty-shell in the source pool and then require an explicit autoplace
to populate replicas—removing any language about stamping replicas eagerly or
deferring placement to the autoplacer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c0bf53d6-2f17-4186-9475-662ba484be79

📥 Commits

Reviewing files that changed from the base of the PR and between 170751b and 24baa83.

📒 Files selected for processing (20)
  • .golangci.yml
  • docs/cli-parity-known-deltas.md
  • pkg/placer/placer.go
  • pkg/placer/restore_source_constraints_bug038_test.go
  • pkg/rest/autoplace.go
  • pkg/rest/clone_placement_bug038_test.go
  • pkg/rest/nodes_test.go
  • pkg/rest/rd_clone.go
  • pkg/rest/server.go
  • pkg/rest/snapshot_ready.go
  • pkg/rest/snapshot_ready_test.go
  • pkg/rest/snapshot_restore.go
  • pkg/rest/snapshot_restore_test.go
  • pkg/satellite/bug_397_restore_seed_test.go
  • pkg/satellite/cross_cluster_ship_test.go
  • pkg/satellite/reconciler.go
  • pkg/satellite/ship_dispatch_test.go
  • pkg/store/k8s/list_by_definition_mixed_label_bug038_test.go
  • pkg/store/k8s/resources.go
  • tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh

The clone/restore rework added a snapshot-readiness gate that BLOCKED
inside the restore + clone HTTP handlers, polling for a non-zero per-node
CreateTimestamp before stamping the restore-marked replicas. In an
envtest harness there is no satellite to stamp that timestamp, so the
handler blocked until the client deadline — TestGroupG/SnapRestore
SnapshotHolderOnly failed with "context deadline exceeded". It is also a
real-world hazard: a failed / never-Ready snapshot would hang every
restore POST until the operator's timeout.

The not-yet-ready snapshot race is already absorbed on the satellite
side: materializeVolume requeues the reconcile (bounded budget, ~5s
backoff each) on RestoreVolumeFromSnapshot ErrNotFound before conceding
to a terminal blank CreateVolume, so the local @snap gets time to land.
That mechanism carries the race on its own — no upstream handler wait is
needed.

Remove the gate call from placeRestoredResources (the single chokepoint
for both the restore and clone paths), delete the now-dead
snapshot-ready helper + its tuning fields + tests, and add an
integration-shaped L1 regression that fails if the restore/clone POST
ever blocks on satellite-set snapshot status again (asserts the POST
returns its success envelope within a tight deadline with no satellite
present).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) merged commit 2b9b648 into main Jun 13, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant