fix(clone): land clone/restore replicas on snapshot nodes in the source pool by kvaps · Pull Request #153 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-13T01:44:53Z

Problem

rd clone of a deployed, VD-bearing source on a FILE_THIN pool could materialise a target replica on a ZFS pool. The target satellite then piped the source's FILE_THIN snapshot stream into zfs recv, which loops forever on cannot receive: invalid stream (bad magic number) — the clone replicas never reach UpToDate. Reproduced on two stands via tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh.

Root cause

Two placement gaps combined:

The snapshot-restore placement branch for the empty-node-list case (which the clone data plane reuses) deferred to the parent ResourceGroup's SelectFilter.PlaceCount. Under the stand-default empty-spec DfltRscGrp it placed zero replicas.
The controller-side ResourceGroup reconcilers then ran the placer with the raw RG filter — no snapshot-node constraint, no provider pin — so the capacity-weighted ranking could pick a pool of a different backend. Snapshot stream formats are not interchangeable across backends, so the cross-node restore fed a foreign stream into the ZFS receiver.

Fix (upstream-faithful)

Verified against the live LINSTOR 1.33.2 oracle: an upstream restore without an explicit node list lands on exactly the snapshot-holding nodes, in the snapshot's own storage pool, and never consults the autoplacer; upstream rd clone likewise operates on the source's own pools (it refuses sources on pools it cannot clone with "Clone source contains unsupported storage pools").

pkg/rest: restore/clone with no explicit nodes now stamps one replica per snapshot-holding node, pinned to the source replica's storage pool — the clone data plane is node-local and same-backend by construction, independent of the RG's place_count.
pkg/placer: Place() now applies the restore-source constraints (snapshot nodes + source provider kind) internally, so every caller — REST autoplace, RG apply, RG rebalance, node replacement — honours them. The REST handler's provider lookup delegates to the shared placer helper.
harness: the cell aborted with cli_nodes[0]: unbound variable on failure (mapfile from a process substitution swallows the exit status), masking the real diagnostic; failures now surface the wait function's own message.
docs/cli-parity-known-deltas.md row 82 updated with the placement semantics.

Validation

L1: new regression tests in pkg/rest/clone_placement_bug038_test.go, pkg/placer/restore_source_constraints_bug038_test.go; updated restore-placement pins. go test ./... green, golangci-lint clean, -race green on satellite/placer/rest.
L6: cli-matrix/rd-clone-vd-data-plane.sh green 3x on stand big at this branch's build (both wire variants: plain CLI clone and raw REST use_zfs_clone=true, byte-level marker verified on every replica). cli-matrix/r-full-lifecycle.sh green (placer/autoplace ground-truth gate).
L7: replay/rd-clone-vd-data-plane.yaml PASS on stand big.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Clone operations now correctly place replicas on snapshot-holding nodes within the source storage pool.
- Restore operations properly constrain placement to the source pool's backend type.
- Improved handling of missing snapshots with bounded retry and graceful fallback behavior.
- Fixed resource listing to correctly identify resources created through different paths.
New Features
- Added bare restore mode for snapshot restores (leaves empty shell for operator placement).
- Added eager placement mode for clone operations (places replicas immediately on snapshot nodes).
Tests
- Added comprehensive regression test coverage for clone/restore placement constraints and fallback scenarios.

coderabbitai · 2026-06-13T01:45:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 515bfeff-c2fd-4130-a37e-07c456838565

📥 Commits

Reviewing files that changed from the base of the PR and between 24baa83 and 5afc651.

📒 Files selected for processing (2)

pkg/rest/bug_038_restore_no_handler_block_test.go
pkg/rest/snapshot_restore.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/rest/snapshot_restore.go

📝 Walkthrough

Walkthrough

This PR completes Bug 038 implementation by adding backend pinning for snapshot clones/restores, satellite requeue defense for missing snapshots, a store label-lookup fix for unlabeled resources, and REST handler non-blocking guarantees.

Changes

Bug 038 - Snapshot clone/restore backend pinning and satellite resilience

Layer / File(s)	Summary
Placer restore-source backend constraint mechanism `pkg/placer/placer.go`, `pkg/placer/restore_source_constraints_bug038_test.go`, `.golangci.yml`	Placer.Place now applies restore-source backend constraints by overriding the filter's ProviderList to the source pool's backend kind. New SourceProviderKind helper resolves the source pool backend by inspecting source resources. Tests verify constraint application, filter immutability, and free-placement behavior for unmarked RDs. Golangci exclusion disables godox warnings for placer path.
Clone eager placement on snapshot nodes `pkg/rest/rd_clone.go`, `pkg/rest/clone_placement_bug038_test.go`	Clone path calls materializeRestoredRD with eagerPlace=true, stamping replicas synchronously on snapshot-holding nodes using per-node backend-matched pools. Integration test verifies clones land on snapshot nodes (`w1`, `w2`) in the source FILE_THIN pool (`stand`), not on higher-capacity decoy pools.
Restore bare-shell semantics with per-node pool selection `pkg/rest/snapshot_restore.go`, `pkg/rest/snapshot_restore_test.go`	Snapshot restore distinguishes bare-restore (eagerPlace=false, empty shell for operator placement) from eager clone (eagerPlace=true, one-shot stamping on snap.Nodes). Removes ResourceGroup auto-placement. Per-node StorPoolName selection via storPoolsByNodeFromSourceRD maps source replicas to pools with fallback. Tests verify bare-restore leaves empty shell, autoplace honors backend pin, and decoy pools are avoided.
REST autoplace refactoring to shared provider-kind resolution `pkg/rest/autoplace.go`	resolveCloneSourceProviderKind now delegates to placer.SourceProviderKind helper, simplifying REST code by reusing the shared placer implementation.
Satellite bounded requeue defense for missing snapshots `pkg/satellite/reconciler.go`, `pkg/satellite/bug_397_restore_seed_test.go`, `pkg/satellite/cross_cluster_ship_test.go`, `pkg/satellite/ship_dispatch_test.go`	Satellite reconciler adds bounded requeue defense via restoreSnapshotMissingTries counter and restoreSnapshotMissingBudget limit. When RestoreVolumeFromSnapshot fails with ErrNotFound (and no cross-node fetcher), requeues up to budget passes before falling back to blank CreateVolume. Tests verify requeue continues within budget, then permits blank fallback only after exhaustion.
Store resource listing correctness fix for unlabeled resources `pkg/store/k8s/resources.go`, `pkg/store/k8s/list_by_definition_mixed_label_bug038_test.go`	ListByDefinition now unconditionally lists all Resource CRDs and filters in-memory by ResourceDefinitionName, removing the label-selector fast path that silently omitted kubectl-apply or externally-created resources. Tests verify mixed labeled/unlabeled lists and props-only pool specs are correctly returned.
REST handler non-blocking promises for snapshot readiness `pkg/rest/bug_038_restore_no_handler_block_test.go`	Adds regression tests ensuring clone and restore REST endpoints return promptly without blocking on snapshot readiness. Introduces httpPostWithDeadline helper (5s timeout) and seedRestoreSourceWithNeverReadySnapshot fixture. Tests verify handlers stamp replicas synchronously, meeting the non-blocking contract to avoid deadlocks when satellites are absent.

Documentation and test tooling updates

Layer / File(s)	Summary
Documentation for clone/restore backend-pinning semantics `docs/cli-parity-known-deltas.md`	Updated CLI parity delta row `#82` to explicitly document clone/restore replicas stamped on snapshot-holding nodes in the source pool, with refusal of cross-backend placement.
Shell script error propagation in e2e clone test `tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh`	Both clone variants now capture wait_clone_replicas output into intermediate variables before mapfile, ensuring exit errors propagate correctly under set -euo pipefail.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cozystack/blockstor#99: Introduces the same placer ProviderList override semantics for restore-source backend pinning that this PR tests and hardens.

Poem

🐰 Snapshots now find their way home,
With replicas placed where they're known,
Requeue budgets halt the endless wait,
Before the blank-create seals fate,
Backend-pinned, we hop with grace—
Safe clones in their rightful place! 🍀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the primary change: ensuring clone/restore replicas are placed on snapshot-holding nodes within the source storage pool backend.
Docstring Coverage	✅ Passed	Docstring coverage is 96.77% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/bug-038-rd-clone-cross-backend

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses Bug 038 by enforcing restore-source constraints directly within the placer's Place method and refactoring the REST snapshot-restore handlers to place replicas on all snapshot-holding nodes in the source's storage pool by default. This prevents cross-backend placement issues, such as FILE_THIN snapshots being restored onto ZFS pools. The changes are accompanied by comprehensive unit and integration tests, as well as a minor fix to bash scripting in the E2E tests. Feedback on the implementation points out a potential nil pointer dereference in constrainFilterToRestoreSource if the filter parameter is nil.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T01:46:19Z

+// filter passes through unchanged and the regular placement gates
+// apply (satellite-side Bug 397 blank-fallback seeding still protects
+// data integrity in those edge cases).
+func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter {


The filter parameter is dereferenced as out := *filter on line 1799. If filter is nil, this will cause a nil pointer dereference panic. To prevent this, add a defensive nil check at the beginning of the function.

func (p *Placer) constrainFilterToRestoreSource(ctx context.Context, rdName string, filter *apiv1.AutoSelectFilter) *apiv1.AutoSelectFilter { if filter == nil { return nil }

…e pool Bug 038: the snapshot-restore (and rd-clone) placement branch for the empty-node-list case deferred to the parent ResourceGroup's SelectFilter.PlaceCount. Under the stand-default empty-spec DfltRscGrp the restore placed zero replicas, leaving the real placement to the controller-side RG reconcilers, whose unconstrained placer pass could land a clone replica on a pool of a different backend. The satellite then piped the source's FILE_THIN snapshot stream into zfs recv, which looped forever on 'cannot receive: invalid stream (bad magic number)' and the clone never reached UpToDate. Mirror upstream LINSTOR (verified against the live linstor-oracle, 1.33.2): a restore without an explicit node list lands on exactly the snapshot-holding nodes, in the source replica's storage pool, never through the autoplacer. The clone data plane is therefore node-local and same-backend by construction. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…e backend Bug 038 defense in depth: the REST autoplace handler already pinned candidate nodes/providers for snapshot-restored RDs, but the controller-side Place callers (ResourceGroup apply, RG rebalance, node replacement) fed the placer the raw RG SelectFilter and could place a clone replica on a different backend. Enforce the restore-source constraints inside Place() so every caller honours them; the REST handler's provider lookup now delegates to the shared placer helper. The constraint operates on a copy of the caller's filter, no-ops for RDs without the BlockstorRestoreFromSnapshot marker, and falls through unchanged when the marker/snapshot/source state is unresolvable. .golangci.yml: pkg/placer/ joins the per-path godox exclusion list (Bug-NNN cross-reference comments, same as the sibling packages). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…abort mapfile fed from a process substitution swallows the producer's exit status, so a wait_clone_replicas timeout left the nodes array empty and the cell died on 'cli_nodes[0]: unbound variable' under set -u, hiding the actual diagnostic. Capture the output through a plain command substitution and fail loudly with the function's own message. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Bug 038 regression: the clone/restore data plane is a node-local RestoreVolumeFromSnapshot, but the per-node snapshot materialises asynchronously via the satellite SnapshotReconciler. Stamping the restore-marked replicas synchronously on all snapshot nodes let a replica reconcile before its co-located snapshot existed, hitting ErrNotFound and a terminal blank CreateVolume that poisoned the dataset and deadlocked every replica empty. Gate placement: wait for the snapshot to report a non-zero per-node CreateTimestamp on the target nodes before stamping. The SnapshotReconciler stamps that only after the on-disk snapshot exists, so the wait closes the race. Best-effort and bounded: a timeout (or a legacy snapshot with no per-node tracking) falls through to placement, with the satellite-side requeue as the backstop. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

The restore-source constraint moved into Place() (so RG apply / RG rebalance / node-replacement callers honour it) must enforce ONLY the same-backend provider pin, which is what prevents the FILE_THIN stream into zfs recv bad-magic loop. An earlier revision also hard-pinned the candidate node set to the snapshot's nodes; that broke legitimate staged cross-node bring-up — the cross-node restore lanes add a replica on a fresh node and populate it over the wire, and the node-pin made the controller-side reconcilers refuse to ever place it. Drop the node-pin; keep the backend pin. Add a regression test proving a restore-marked RD can gain a cross-node replica on a same-backend pool, alongside the existing backend-pin and no-leak pins. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…ank create Bug 038 defense-in-depth: the per-node snapshot materialises asynchronously, so a node-local restore replica that reconciles a beat early sees RestoreVolumeFromSnapshot return ErrNotFound. Falling straight through to a blank CreateVolume is terminal — datasetExists then short-circuits the real clone forever and the Bug 397 guard SyncTargets every replica into an all-empty deadlock. Requeue the reconcile up to a bounded budget to let the local snapshot land before conceding to the blank fallback. A genuinely cross-node / ship-only replica (snapshot never local, no fetcher) exhausts the budget and degrades to the pre-fix DRBD-resync blank fallback. The REST-side readiness gate is the primary fix; this is the backstop for the residual CRD-status-vs-on-disk window. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Bug 038 root cause on stand: Resources().ListByDefinition trusted a label-selector fast path and only fell back to a full scan when the selector returned ZERO items, assuming a partial subset was impossible because every REST writer sets the label. That breaks for a MIXED RD whose diskful replicas were applied via kubectl (no label) while its auto-tiebreaker witness was stamped by the controller through the store (labelled): the selector returned only the labelled witness, the fallback was skipped, and the unlabelled diskful replicas were hidden. The snapshot-restore / clone handler reads this list to resolve the source pool, so with the diskful replicas hidden it stamped the clone replicas with an empty StorPoolName and the satellite failed every reconcile with 'unknown storage pool' — the clone never converged. Scan-and-filter on the authoritative Spec.ResourceDefinitionName. The List is served from the controller-runtime informer cache, so the in-memory filter costs the same as a label index but is correct for labelled and unlabelled replicas alike. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

PR #153 changed the bare snapshot-restore handler to synchronously stamp a replica on EVERY snapshot node. That fixed the clone verb (a one-shot CSI op with no follow-up autoplace) but regressed the staged cross-node restore workflow: the e2e restore lanes place the data-bearing replica on a snapshot node first, wait UpToDate, then add a cross-node replica that SyncTargets it. With both snapshot nodes already eagerly placed, place_count was satisfied and the cross-node replica was never created (stuck Connecting / no devicePath). Split the placement policy by caller. The clone path (cloneWithData) eager-places on the snapshot nodes in the source pool. The bare restore handler leaves an empty shell when no --node-name is given, so the operator / linstor-csi drives placement — exactly the pre-#153 / upstream restore-then-scale-out contract. Explicit node lists are still stamped verbatim on both paths. Cross-backend protection is unaffected: the placer's restore-source backend pin keeps the operator's follow-up autoplace on the source backend, and the REST autoplace handler still constrains a no-node-list autoplace of a restore-marked RD to the snapshot nodes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/satellite/reconciler.go (1)

1554-1564: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset the missing-snapshot budget on any non-ErrNotFound outcome.

restoreSnapshotMissingTries is documented as a count of consecutive local-miss passes, but this branch only clears it on success. A sequence like ErrNotFound → ErrNotFound → transient restore error → ErrNotFound resumes at the old count and can hit the blank-create fallback early, recreating the empty-target path this backstop is supposed to avoid.

Proposed fix

 	if !errors.Is(err, storage.ErrNotFound) {
+		// Break the "consecutive local snapshot miss" streak on any other outcome.
+		r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber())
+
 		if err == nil {
 			// Local clone succeeded: this replica holds the snapshot's
 			// data. Clear any stale blank-fallback marker so the
 			// legitimate all-clone restore fast-path keeps its skip, and
 			// reset the Bug 038 not-yet-materialised requeue budget.
 			r.recordRestoreBlankFallback(rdName, vol.GetVolumeNumber(), false)
-			r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber())
 		}
 
 		return err //nolint:wrapcheck // caller wraps
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/satellite/reconciler.go` around lines 1554 - 1564, The branch that
handles non-ErrNotFound outcomes currently only calls
r.clearRestoreSnapshotMissing(...) when err == nil; change this so the
missing-snapshot counter is cleared on any non-ErrNotFound outcome (i.e., when
!errors.Is(err, storage.ErrNotFound)), not just on success. Concretely, keep
r.recordRestoreBlankFallback(...) inside the err == nil block, but move or add
r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber()) to the outer
!errors.Is(err, storage.ErrNotFound) branch so it runs for both success and
transient errors before returning err from the function that contains these
calls.

🧹 Nitpick comments (1)

pkg/rest/snapshot_restore_test.go (1)

214-224: ⚡ Quick win

Align test docstrings with the current bare-restore behavior.

These comments still describe eager replica stamping, but the test body now asserts empty-shell restore followed by explicit autoplace. Please update wording to avoid future misreads of the intended contract.

✏️ Suggested comment-only fix

-// retargeted by Bug 038: the restore no longer defers placement to
-// the autoplacer — it stamps replicas on the snapshot's nodes in the
-// source replica's pool up front (...)
+// retargeted by Bug 038: bare restore with no explicit nodes now
+// leaves an empty shell, and follow-up autoplace must honor the
+// restore-source backend pin (never crossing backend kind).

-// empty-nodes branch: no `--node-name` arguments → the handler stamps
-// one replica on EVERY node holding the snapshot (...)
+// empty-nodes branch: no `--node-name` arguments → the handler creates
+// an empty shell; placement is done explicitly via autoplace and must
+// stay pinned to the source backend.

Also applies to: 843-849

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/snapshot_restore_test.go` around lines 214 - 224, The docstring for
TestSnapshotRestoreConstrainsProviderToSource is outdated: it describes eager
replica stamping on restore but the test now asserts an empty-shell restore
followed by an explicit autoplace; update the comment above that test (and the
similar block at lines 843-849) to describe the current behavior—that restores
create an empty-shell in the source pool and then require an explicit autoplace
to populate replicas—removing any language about stamping replicas eagerly or
deferring placement to the autoplacer.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/satellite/reconciler.go`:
- Around line 1554-1564: The branch that handles non-ErrNotFound outcomes
currently only calls r.clearRestoreSnapshotMissing(...) when err == nil; change
this so the missing-snapshot counter is cleared on any non-ErrNotFound outcome
(i.e., when !errors.Is(err, storage.ErrNotFound)), not just on success.
Concretely, keep r.recordRestoreBlankFallback(...) inside the err == nil block,
but move or add r.clearRestoreSnapshotMissing(rdName, vol.GetVolumeNumber()) to
the outer !errors.Is(err, storage.ErrNotFound) branch so it runs for both
success and transient errors before returning err from the function that
contains these calls.

---

Nitpick comments:
In `@pkg/rest/snapshot_restore_test.go`:
- Around line 214-224: The docstring for
TestSnapshotRestoreConstrainsProviderToSource is outdated: it describes eager
replica stamping on restore but the test now asserts an empty-shell restore
followed by an explicit autoplace; update the comment above that test (and the
similar block at lines 843-849) to describe the current behavior—that restores
create an empty-shell in the source pool and then require an explicit autoplace
to populate replicas—removing any language about stamping replicas eagerly or
deferring placement to the autoplacer.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c0bf53d6-2f17-4186-9475-662ba484be79

📥 Commits

Reviewing files that changed from the base of the PR and between 170751b and 24baa83.

📒 Files selected for processing (20)

.golangci.yml
docs/cli-parity-known-deltas.md
pkg/placer/placer.go
pkg/placer/restore_source_constraints_bug038_test.go
pkg/rest/autoplace.go
pkg/rest/clone_placement_bug038_test.go
pkg/rest/nodes_test.go
pkg/rest/rd_clone.go
pkg/rest/server.go
pkg/rest/snapshot_ready.go
pkg/rest/snapshot_ready_test.go
pkg/rest/snapshot_restore.go
pkg/rest/snapshot_restore_test.go
pkg/satellite/bug_397_restore_seed_test.go
pkg/satellite/cross_cluster_ship_test.go
pkg/satellite/reconciler.go
pkg/satellite/ship_dispatch_test.go
pkg/store/k8s/list_by_definition_mixed_label_bug038_test.go
pkg/store/k8s/resources.go
tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh

@snap

The clone/restore rework added a snapshot-readiness gate that BLOCKED inside the restore + clone HTTP handlers, polling for a non-zero per-node CreateTimestamp before stamping the restore-marked replicas. In an envtest harness there is no satellite to stamp that timestamp, so the handler blocked until the client deadline — TestGroupG/SnapRestore SnapshotHolderOnly failed with "context deadline exceeded". It is also a real-world hazard: a failed / never-Ready snapshot would hang every restore POST until the operator's timeout. The not-yet-ready snapshot race is already absorbed on the satellite side: materializeVolume requeues the reconcile (bounded budget, ~5s backoff each) on RestoreVolumeFromSnapshot ErrNotFound before conceding to a terminal blank CreateVolume, so the local @snap gets time to land. That mechanism carries the race on its own — no upstream handler wait is needed. Remove the gate call from placeRestoredResources (the single chokepoint for both the restore and clone paths), delete the now-dead snapshot-ready helper + its tuning fields + tests, and add an integration-shaped L1 regression that fails if the restore/clone POST ever blocks on satellite-set snapshot status again (asserts the POST returns its success envelope within a tight deadline with no satellite present). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

Andrei Kvapil (kvaps) and others added 9 commits June 13, 2026 16:44

docs(parity): record same-pool clone placement semantics in delta row 82

fa6a8da

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Andrei Kvapil (kvaps) force-pushed the fix/bug-038-rd-clone-cross-backend branch from d68397c to 24baa83 Compare June 13, 2026 15:26

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

Andrei Kvapil (kvaps) merged commit 2b9b648 into main Jun 13, 2026
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clone): land clone/restore replicas on snapshot nodes in the source pool#153

fix(clone): land clone/restore replicas on snapshot nodes in the source pool#153
Andrei Kvapil (kvaps) merged 10 commits into
mainfrom
fix/bug-038-rd-clone-cross-backend

Andrei Kvapil (kvaps) commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Andrei Kvapil (kvaps) commented Jun 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix (upstream-faithful)

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Andrei Kvapil (kvaps) commented Jun 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading