Skip to content

fix(snapshot): make consistency-group snapshot atomic via coordinated suspend barrier (BUG-046)#160

Merged
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/bug-046-consistency-group-snapshot-atomic
Jun 14, 2026
Merged

fix(snapshot): make consistency-group snapshot atomic via coordinated suspend barrier (BUG-046)#160
Andrei Kvapil (kvaps) merged 3 commits into
mainfrom
fix/bug-046-consistency-group-snapshot-atomic

Conversation

@kvaps

Copy link
Copy Markdown
Member

Summary

linstor snapshot create-multiple (consistency-group snapshot across multiple RDs) was not point-in-time atomic. The apiserver stamped a shared GroupID + SuspendIO, but the controller never coordinated phase ENTRY across siblings, so they entered SuspendIO ~15s apart (budget ≤5s). A DB's data + WAL volumes on separate RDs were captured at different instants → group restore could be inconsistent.

Root cause (two staggering points, both per-sibling)

  1. Suspend entry: store stamped SuspendIO at Create time; apiserver creates sibling CRDs sequentially → each satellite froze the instant its own CRD landed (seconds apart).
  2. Take/resume: Phase 1→2 take and Phase 2→3 resume flipped only self per sibling → staggered take/resume reopened the write window before all snapshots were captured.

Fix — coordinated group barrier (non-empty GroupID)

  • Store no longer stamps SuspendIO at Create; apiserver stamps Spec.GroupSize on every sibling.
  • Controller holds the group at Phase 0 until every member is observable (groupAssembled), then opens SuspendIO on EVERY sibling in one reconcile pass (suspendGroup).
  • take/resume/abort fan out across the whole group in one pass (flipGroup) → take-all and resume-all atomic vs the writer.
  • Single-snapshot (empty GroupID) path unchanged.

No-stuck-suspended safety (deadlock-class)

All-suspended-before-any-snapshot (take still gates on every node's suspend ack); refuse-rather-than-freeze (non-UpToDate group resumes without freezing); bounded suspend (existing deadline force-aborts + resumes all); abort cascades resume-all on any per-node Failed; legacy GroupSize-missing falls back to observed-siblings count (no hang).

Tests

  • L1 snapshot_bug_046_test.go: barrier enters all together, holds until assembled, take/resume fan-out, abort/resume-all, legacy fallback + store/apiserver pins. go test ./..., -race, lint green.
  • L7 replay/snap-create-multiple-group-consistency.yaml PASS on stand big.
  • Stand big 3×: suspend slip 0.087/0.072/0.053s (≤5s), cross-RD counter diff=0 (point-in-time proven).

Harness note

cli-matrix/snap-create-multiple-lifecycle has pre-existing harness bugs (dd oflag=direct bs=8 on a block dev zeroes the marker; Phase 5 colon CLI form) addressed separately; the product behavior was verified directly + via a working buffered writer.

Andrei Kvapil (kvaps) and others added 3 commits June 14, 2026 02:36
…oup snapshots (BUG-046)

`snapshot create-multiple` (consistency-group snapshot across multiple
RDs) was not point-in-time atomic. The store stamped Spec.SuspendIO=true
at Create time, and the apiserver creates the sibling CRDs sequentially,
so each sibling's satellite started `drbdsetup suspend-io` the instant
its own CRD landed. On a busy stand the group's volumes froze ~15s apart
— far over the <=5s consistency budget — so a DB's data and WAL volumes
on separate RDs were captured at different instants and a group restore
could be inconsistent.

Add a controller-side suspend barrier for grouped snapshots: the store
no longer stamps SuspendIO=true for a grouped Snapshot (GroupID set);
instead the controller holds the whole group at Phase 0 until every
member is observable (new Spec.GroupSize denominator stamped by the
apiserver), then opens SuspendIO=true on every sibling in a single
reconcile pass so they all enter suspend within one controller cycle.
This bounds the suspend-entry slip far under budget while keeping the
existing all-suspended-before-any-take and resume-on-failure guarantees.

No-stuck-suspended safety is preserved: the barrier checks UpToDate
before freezing anything (refuse rather than suspend for an impossible
snapshot), the existing suspend/take deadline and abort cascade still
clear SuspendIO across the whole batch on any failure, and a legacy
grouped Snapshot with no GroupSize falls back to the observed-siblings
count so it can never hang on a missing denominator.

The single-snapshot (Bug-351) path is unchanged — empty GroupID keeps
the Create-time SuspendIO stamp and the self-only phase walks.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
L1: assert handleSnapshotCreateMulti stamps a shared GroupID + GroupSize
on every batch entry, and that the store defers SuspendIO for grouped
snapshots (no Create-time freeze) while keeping it for single snapshots.

L7: snap-create-multiple-group-consistency replay YAML codifies the
operator `snapshot create-multiple` sequence + the no-stuck-suspended /
no-orphans convergence contract for a two-RD consistency group.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…end entry (BUG-046)

Stand validation surfaced a second staggering point: with only the
suspend entry group-coordinated, the Phase 1->2 take promotion and the
Phase 2->3 resume still fired per-sibling at staggered reconcile times
(~15s apart observed), because each sibling flipped only its own
Spec.TakeSnapshot when the controller happened to reconcile it. A
sibling that resumed I/O while another was still mid-take reopened the
write window before every snapshot was captured.

Fan every grouped phase transition out across the whole batch in one
reconcile pass (flipGroup): take-all together, resume-all together. The
suspend barrier's suspendGroup is now the SuspendIO=true special case of
the same helper. Single-snapshot snapshots still flip self only.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@kvaps, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 36 minutes and 35 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a9a4d3d-f11c-48f9-af6d-f0b07da6da09

📥 Commits

Reviewing files that changed from the base of the PR and between 2b9b648 and 4c4fc14.

📒 Files selected for processing (12)
  • api/v1alpha1/snapshot_types.go
  • config/crd/bases/blockstor.cozystack.io_snapshots.yaml
  • internal/controller/snapshot_bug_046_test.go
  • internal/controller/snapshot_controller.go
  • internal/controller/snapshot_timeout.go
  • pkg/api/v1/snapshot.go
  • pkg/rest/snapshot_multi.go
  • pkg/rest/snapshot_multi_test.go
  • pkg/store/k8s/snapshots.go
  • pkg/store/k8s/snapshots_internal_test.go
  • pkg/store/k8s/typed_field_carry_across_test.go
  • tests/operator-harness/replay/snap-create-multiple-group-consistency.yaml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-046-consistency-group-snapshot-atomic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a coordinated suspend barrier for grouped snapshots to ensure point-in-time consistency across multiple volume resources, addressing Bug 046 and Bug-353. It adds a GroupSize field to track batch size and ensures that all siblings in a group enter the suspend, take, and resume phases together in a single reconcile pass. Feedback on these changes highlights two key areas for improvement: first, implementing an assembly timeout check to prevent an infinite requeue loop if a group is never fully assembled; second, optimizing the replica status check to avoid redundant API list calls by pre-fetching resources.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +331 to +337
if !groupAssembled(snap, siblings) {
logger.V(1).Info("snapshot group still assembling; holding suspend barrier",
"group_id", snap.Spec.GroupID,
"observed", len(siblings), "expected", snap.Spec.GroupSize)

return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If one or more siblings in a grouped snapshot batch fail to be created (e.g., due to a validation error or store failure in the API server), the group will never be fully assembled. Because SuspendIO is false during assembly, the group is not considered to be in the suspend phase, meaning the 2-minute snapshotSuspendDeadline is never armed. This causes the remaining siblings to hang in Phase 0 forever, continuously requeuing every 1 second via groupAssembleRequeueAfter. Adding an assembly timeout check using suspendDeadlineExceeded prevents this infinite loop and safely aborts the incomplete group.

	if !groupAssembled(snap, siblings) {
		if suspendDeadlineExceeded(snap, time.Now()) {
			logger.Info("aborting snapshot group: assembly deadline exceeded",
				"group_id", snap.Spec.GroupID,
				"observed", len(siblings), "expected", snap.Spec.GroupSize)
			return r.abortGroupWithReason(ctx, siblings,
				"snapshot aborted: group assembly timed out; not all siblings were created")
		}

		logger.V(1).Info("snapshot group still assembling; holding suspend barrier",
			"group_id", snap.Spec.GroupID,
			"observed", len(siblings), "expected", snap.Spec.GroupSize)

		return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil
	}

return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil
}

ok, offending, err := r.allTargetedReplicasUpToDate(ctx, siblings)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling allTargetedReplicasUpToDate results in listing all Resource objects in the cluster for each sibling in the group. Since siblingReplicasUpToDate performs a r.List(ctx, &resList) call internally, this leads to $O(N)$ API list calls per reconcile pass (where $N$ is the number of siblings). To optimize this and avoid unnecessary API server load, consider listing the resources once in allTargetedReplicasUpToDate and passing the pre-fetched list down to siblingReplicasUpToDate.

@kvaps Andrei Kvapil (kvaps) merged commit d1d583f into main Jun 14, 2026
15 checks passed
@kvaps Andrei Kvapil (kvaps) deleted the fix/bug-046-consistency-group-snapshot-atomic branch June 14, 2026 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant