fix(snapshot): make consistency-group snapshot atomic via coordinated suspend barrier (BUG-046)#160
Conversation
…oup snapshots (BUG-046) `snapshot create-multiple` (consistency-group snapshot across multiple RDs) was not point-in-time atomic. The store stamped Spec.SuspendIO=true at Create time, and the apiserver creates the sibling CRDs sequentially, so each sibling's satellite started `drbdsetup suspend-io` the instant its own CRD landed. On a busy stand the group's volumes froze ~15s apart — far over the <=5s consistency budget — so a DB's data and WAL volumes on separate RDs were captured at different instants and a group restore could be inconsistent. Add a controller-side suspend barrier for grouped snapshots: the store no longer stamps SuspendIO=true for a grouped Snapshot (GroupID set); instead the controller holds the whole group at Phase 0 until every member is observable (new Spec.GroupSize denominator stamped by the apiserver), then opens SuspendIO=true on every sibling in a single reconcile pass so they all enter suspend within one controller cycle. This bounds the suspend-entry slip far under budget while keeping the existing all-suspended-before-any-take and resume-on-failure guarantees. No-stuck-suspended safety is preserved: the barrier checks UpToDate before freezing anything (refuse rather than suspend for an impossible snapshot), the existing suspend/take deadline and abort cascade still clear SuspendIO across the whole batch on any failure, and a legacy grouped Snapshot with no GroupSize falls back to the observed-siblings count so it can never hang on a missing denominator. The single-snapshot (Bug-351) path is unchanged — empty GroupID keeps the Create-time SuspendIO stamp and the self-only phase walks. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
L1: assert handleSnapshotCreateMulti stamps a shared GroupID + GroupSize on every batch entry, and that the store defers SuspendIO for grouped snapshots (no Create-time freeze) while keeping it for single snapshots. L7: snap-create-multiple-group-consistency replay YAML codifies the operator `snapshot create-multiple` sequence + the no-stuck-suspended / no-orphans convergence contract for a two-RD consistency group. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…end entry (BUG-046) Stand validation surfaced a second staggering point: with only the suspend entry group-coordinated, the Phase 1->2 take promotion and the Phase 2->3 resume still fired per-sibling at staggered reconcile times (~15s apart observed), because each sibling flipped only its own Spec.TakeSnapshot when the controller happened to reconcile it. A sibling that resumed I/O while another was still mid-take reopened the write window before every snapshot was captured. Fan every grouped phase transition out across the whole batch in one reconcile pass (flipGroup): take-all together, resume-all together. The suspend barrier's suspendGroup is now the SuspendIO=true special case of the same helper. Single-snapshot snapshots still flip self only. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Warning Review limit reached
More reviews will be available in 36 minutes and 35 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a coordinated suspend barrier for grouped snapshots to ensure point-in-time consistency across multiple volume resources, addressing Bug 046 and Bug-353. It adds a GroupSize field to track batch size and ensures that all siblings in a group enter the suspend, take, and resume phases together in a single reconcile pass. Feedback on these changes highlights two key areas for improvement: first, implementing an assembly timeout check to prevent an infinite requeue loop if a group is never fully assembled; second, optimizing the replica status check to avoid redundant API list calls by pre-fetching resources.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if !groupAssembled(snap, siblings) { | ||
| logger.V(1).Info("snapshot group still assembling; holding suspend barrier", | ||
| "group_id", snap.Spec.GroupID, | ||
| "observed", len(siblings), "expected", snap.Spec.GroupSize) | ||
|
|
||
| return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil | ||
| } |
There was a problem hiding this comment.
If one or more siblings in a grouped snapshot batch fail to be created (e.g., due to a validation error or store failure in the API server), the group will never be fully assembled. Because SuspendIO is false during assembly, the group is not considered to be in the suspend phase, meaning the 2-minute snapshotSuspendDeadline is never armed. This causes the remaining siblings to hang in Phase 0 forever, continuously requeuing every 1 second via groupAssembleRequeueAfter. Adding an assembly timeout check using suspendDeadlineExceeded prevents this infinite loop and safely aborts the incomplete group.
if !groupAssembled(snap, siblings) {
if suspendDeadlineExceeded(snap, time.Now()) {
logger.Info("aborting snapshot group: assembly deadline exceeded",
"group_id", snap.Spec.GroupID,
"observed", len(siblings), "expected", snap.Spec.GroupSize)
return r.abortGroupWithReason(ctx, siblings,
"snapshot aborted: group assembly timed out; not all siblings were created")
}
logger.V(1).Info("snapshot group still assembling; holding suspend barrier",
"group_id", snap.Spec.GroupID,
"observed", len(siblings), "expected", snap.Spec.GroupSize)
return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil
}| return ctrl.Result{RequeueAfter: groupAssembleRequeueAfter}, nil | ||
| } | ||
|
|
||
| ok, offending, err := r.allTargetedReplicasUpToDate(ctx, siblings) |
There was a problem hiding this comment.
Calling allTargetedReplicasUpToDate results in listing all Resource objects in the cluster for each sibling in the group. Since siblingReplicasUpToDate performs a r.List(ctx, &resList) call internally, this leads to allTargetedReplicasUpToDate and passing the pre-fetched list down to siblingReplicasUpToDate.
Summary
linstor snapshot create-multiple(consistency-group snapshot across multiple RDs) was not point-in-time atomic. The apiserver stamped a shared GroupID + SuspendIO, but the controller never coordinated phase ENTRY across siblings, so they entered SuspendIO ~15s apart (budget ≤5s). A DB's data + WAL volumes on separate RDs were captured at different instants → group restore could be inconsistent.Root cause (two staggering points, both per-sibling)
Fix — coordinated group barrier (non-empty GroupID)
Spec.GroupSizeon every sibling.No-stuck-suspended safety (deadlock-class)
All-suspended-before-any-snapshot (take still gates on every node's suspend ack); refuse-rather-than-freeze (non-UpToDate group resumes without freezing); bounded suspend (existing deadline force-aborts + resumes all); abort cascades resume-all on any per-node Failed; legacy GroupSize-missing falls back to observed-siblings count (no hang).
Tests
Harness note
cli-matrix/snap-create-multiple-lifecycle has pre-existing harness bugs (dd oflag=direct bs=8 on a block dev zeroes the marker; Phase 5 colon CLI form) addressed separately; the product behavior was verified directly + via a working buffered writer.