fix: reduce checkpoint lock hold time with staged publish#16
fix: reduce checkpoint lock hold time with staged publish#16PhantomInTheWire wants to merge 1 commit intomasterfrom
Conversation
- Split checkpoint into stage/publish phases outside collection RwLock - Add revision tracking to detect concurrent writes during flush - Staged checkpoint rollback no longer restores active WAL bytes - WAL reset failure returns error instead of silent success - Two-phase manifest publish preserves prior manifest on reset failure - Add regression test for WAL reset failure recoverability P0: checkpoint blocks all reads and writes
📝 WalkthroughWalkthroughThe PR restructures checkpointing into distinct staging and publishing phases, introduces atomic revision tracking to detect mutations during checkpoint, replaces manifest cleanup with explicit public API, and updates delete-by-filter to report whether deletions occurred. Staged checkpoints support rollback on publish failures. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Collection
participant ChkptLock as Checkpoint Lock
participant StageChk as stage_checkpoint
participant PublishChk as publish_checkpoint
participant Rollback as Rollback
Client->>Collection: checkpoint()
Collection->>ChkptLock: acquire()
ChkptLock-->>Collection: lock acquired
Collection->>Collection: snapshot revision & state
Collection->>StageChk: stage_checkpoint(cloned_state)
StageChk->>StageChk: seal segment, rebuild indexes
StageChk->>StageChk: capture rollback data
StageChk->>StageChk: write checkpoint data
StageChk-->>Collection: StagedCheckpoint
alt revision matches snapshot
Collection->>PublishChk: publish_checkpoint(staged)
PublishChk->>PublishChk: write manifest
PublishChk->>PublishChk: reset WAL
PublishChk->>PublishChk: remove stale manifests
PublishChk-->>Collection: Ok(runtime)
Collection->>Collection: update runtime
else revision changed (concurrent mutation)
Collection->>Rollback: discard_staged_checkpoint(staged)
Rollback->>Rollback: restore via captured rollback
Rollback-->>Collection: Ok(())
end
Collection->>ChkptLock: release()
Collection-->>Client: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs (1)
282-285: Test depends on the internal tmp-filename ofreset_wal; add a comment.Creating a directory at
data.wal.tmpblocksreset_walonly because that is the exact temp path it uses for its atomic-rename. Ifreset_wal's tmp naming ever changes, this test will silently stop exercising the failure path (theis_err()assertion would simply flip without notice). A short comment makes the implementation dependency explicit and easier to update.♻️ Suggested clarifying comment
let wal_path = segment_wal_path(&collection.path(), WRITING_SEGMENT_ID); + // `reset_wal` writes to `<wal>.tmp` and atomically renames it over the WAL file. + // Pre-creating a directory at that path forces the temp-file write to fail and + // exercises the publish-phase WAL-reset failure branch. fs::create_dir(wal_path.with_file_name("data.wal.tmp")).expect("block wal temp file");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs` around lines 282 - 285, Add a brief inline comment above the test's creation of the "data.wal.tmp" directory explaining that this intentionally blocks reset_wal's atomic-rename by matching its current temporary filename (the test uses segment_wal_path(...) and wal_path.with_file_name("data.wal.tmp")); note that if reset_wal's tmp naming convention changes the test must be updated because the assert!(collection.flush().is_err()) relies on that exact temp name to exercise the failure path.crates/garuda-engine/src/checkpoint_service.rs (2)
56-78: Two-phase publish design looks correct, but error handling and cleanup observability can be tightened.Two notes on this function:
Lost original error on rollback failure (lines 62-70). Same pattern as in
stage_checkpoint: ifwrite_checkpoint_manifestorreset_walfails androllback.restore()?also fails, the propagated error is the rollback error, not the originating manifest/WAL-reset error. Preserve the primary error and surface the restore failure separately.Silent cleanup errors (lines 72-75). Four
let _ = …calls discardResult<(), Status>. For correctness this is fine (next checkpoint will retry cleanup), but failures here become invisible — over time disk could accumulate stale manifests/snapshots/segment dirs without any signal. Consider routing these through a logger so operators can detect persistent cleanup failures.The overall ordering (manifest → WAL reset → cleanup) correctly preserves the prior manifest on reset failure because
capture_checkpoint_filescaptures the new manifest path asMissingand rollback will remove it.♻️ Suggested change for the error-handling part
- if let Err(status) = write_checkpoint_manifest(&state, &version_manager) { - rollback.restore()?; - return Err(status); - } - - if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) { - rollback.restore()?; - return Err(status); - } + if let Err(status) = write_checkpoint_manifest(&state, &version_manager) { + if let Err(restore_err) = rollback.restore() { + // TODO: log restore_err + return Err(restore_err); + } + return Err(status); + } + + if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) { + if let Err(restore_err) = rollback.restore() { + // TODO: log restore_err + return Err(restore_err); + } + return Err(status); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/garuda-engine/src/checkpoint_service.rs` around lines 56 - 78, When publish_checkpoint encounters an error from write_checkpoint_manifest or reset_wal, preserve that original error while still attempting rollback: call rollback.restore() but if it fails, return a composite result that surfaces the primary error (from write_checkpoint_manifest or reset_wal) and also logs or attaches the rollback.restore() failure instead of replacing it; specifically update the error paths around write_checkpoint_manifest and reset_wal to capture the first Err(status) into a variable (e.g., primary_err), call rollback.restore() and if that returns Err(restore_err) return or log a combined error (primary_err with restore_err as context) while still returning the primary failure in the API. Also stop swallowing results from remove_stale_manifests, remove_old_snapshots, and remove_stale_segment_dirs: log any Err returned from these calls (referencing version_manager.remove_stale_manifests, remove_old_snapshots, and remove_stale_segment_dirs) so persistent cleanup failures are visible to operators.
41-47: Original persist error is dropped ifrollback.restore()itself fails.When
write_checkpoint_data_filesfails androllback.restore()?then also fails, the?propagates the restore error and the originalstatusis silently lost. That makes diagnosing the root cause much harder. Consider preserving the original error and logging the rollback failure instead.♻️ Suggested change
if let Err(status) = persist_result { - rollback.restore()?; + if let Err(restore_err) = rollback.restore() { + // TODO: log restore_err alongside the primary failure + return Err(restore_err); + } return Err(status); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/garuda-engine/src/checkpoint_service.rs` around lines 41 - 47, The current flow loses the original persist error when rollback.restore() fails; update the persist handling in the checkpoint routine so that after let persist_result = write_checkpoint_data_files(&state); if let Err(status) = persist_result { let restore_err = rollback.restore().err(); if let Some(re) = restore_err { process or log the rollback failure (including re) but return the original status (or attach restore_err as context) } else { return Err(status) } } — i.e., call rollback.restore(), capture any error instead of using ? to propagate it, log or attach the rollback failure, and ensure write_checkpoint_data_files' original status is returned (or wrapped) so the original error is not dropped; use symbols capture_checkpoint_files, write_checkpoint_data_files, rollback, rollback.restore, persist_result and status to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/garuda-engine/src/checkpoint_service.rs`:
- Around line 56-78: When publish_checkpoint encounters an error from
write_checkpoint_manifest or reset_wal, preserve that original error while still
attempting rollback: call rollback.restore() but if it fails, return a composite
result that surfaces the primary error (from write_checkpoint_manifest or
reset_wal) and also logs or attaches the rollback.restore() failure instead of
replacing it; specifically update the error paths around
write_checkpoint_manifest and reset_wal to capture the first Err(status) into a
variable (e.g., primary_err), call rollback.restore() and if that returns
Err(restore_err) return or log a combined error (primary_err with restore_err as
context) while still returning the primary failure in the API. Also stop
swallowing results from remove_stale_manifests, remove_old_snapshots, and
remove_stale_segment_dirs: log any Err returned from these calls (referencing
version_manager.remove_stale_manifests, remove_old_snapshots, and
remove_stale_segment_dirs) so persistent cleanup failures are visible to
operators.
- Around line 41-47: The current flow loses the original persist error when
rollback.restore() fails; update the persist handling in the checkpoint routine
so that after let persist_result = write_checkpoint_data_files(&state); if let
Err(status) = persist_result { let restore_err = rollback.restore().err(); if
let Some(re) = restore_err { process or log the rollback failure (including re)
but return the original status (or attach restore_err as context) } else {
return Err(status) } } — i.e., call rollback.restore(), capture any error
instead of using ? to propagate it, log or attach the rollback failure, and
ensure write_checkpoint_data_files' original status is returned (or wrapped) so
the original error is not dropped; use symbols capture_checkpoint_files,
write_checkpoint_data_files, rollback, rollback.restore, persist_result and
status to locate the code.
In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`:
- Around line 282-285: Add a brief inline comment above the test's creation of
the "data.wal.tmp" directory explaining that this intentionally blocks
reset_wal's atomic-rename by matching its current temporary filename (the test
uses segment_wal_path(...) and wal_path.with_file_name("data.wal.tmp")); note
that if reset_wal's tmp naming convention changes the test must be updated
because the assert!(collection.flush().is_err()) relies on that exact temp name
to exercise the failure path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ab7a5a82-640c-457b-8a03-e080c95bc3c6
📒 Files selected for processing (5)
crates/garuda-engine/src/checkpoint_service.rscrates/garuda-engine/src/lib.rscrates/garuda-engine/src/write_service.rscrates/garuda-engine/tests/test_collection_recovery_crash_semantics.rscrates/garuda-storage/src/version.rs
| } | ||
|
|
||
| discard_staged_checkpoint(staged) |
There was a problem hiding this comment.
🔴 flush() silently returns Ok(()) when checkpoint is discarded due to concurrent write
When checkpoint() detects a concurrent write (revision changed between read and publish), it discards the staged checkpoint and returns the result of discard_staged_checkpoint(staged), which is Ok(()) on successful rollback. This means flush() returns success to the caller without having actually persisted a checkpoint.
The race window is: (1) checkpoint() reads state and revision under read lock at lib.rs:262-266, (2) releases the read lock, (3) performs expensive staging I/O, (4) re-acquires write lock at lib.rs:270, (5) finds revision changed at lib.rs:271. Any concurrent insert/upsert/update/delete/delete_by_filter that completes between steps 2 and 4 bumps the revision, causing the staged checkpoint to be silently discarded. Data remains only WAL-backed, violating the caller's expectation that flush() guarantees a clean checkpoint.
Prompt for agents
In Collection::checkpoint() (lib.rs:257-279), when the revision check at line 271 fails (meaning a concurrent write occurred), the method falls through to discard_staged_checkpoint(staged) which returns Ok(()) on success. This means flush() returns Ok(()) without having created a checkpoint.
The fix should ensure flush() either succeeds in creating a checkpoint or signals failure. Two possible approaches:
1. Retry: After discarding, loop back to re-read state and re-stage. Add a bounded retry count to prevent infinite loops if writes keep arriving. On exhaustion, return an error.
2. Return an error: Instead of propagating Ok(()) from discard_staged_checkpoint, return an Err status indicating the checkpoint was invalidated by concurrent writes. The caller can then decide whether to retry.
Approach 1 is more user-friendly since flush() would eventually succeed. The retry loop should discard the staged checkpoint, re-read the state, and re-stage. The checkpoint_lock is already held, so no other checkpoint can interfere.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) { | ||
| rollback.restore()?; | ||
| return Err(status); | ||
| } |
There was a problem hiding this comment.
🔴 WAL backup removed from rollback but rollback now runs after WAL may have been modified
The PR removes the WAL file from capture_checkpoint_files (old checkpoint_service.rs:204-207) but introduces a new rollback path in publish_checkpoint at checkpoint_service.rs:67-70 that executes after reset_wal may have partially succeeded. reset_wal calls write_file_atomically (io.rs:45-61), which performs rename_path then sync_directory. If the rename succeeds but sync fails, write_file_atomically returns Err, triggering rollback.restore() — which removes the new manifest and restores data files, but cannot restore the WAL (no backup). The result is: old manifest (restored) + old data files (restored) + empty WAL = potential data loss on recovery, since WAL entries for unflushed writes are gone.
Failure sequence
publish_checkpointwrites manifest successfullyreset_wal→write_file_atomically→ rename succeeds (WAL is now empty) →sync_directoryfails → returns Errrollback.restore()removes new manifest, restores data files, but WAL remains empty- On recovery: old manifest + old data + empty WAL → pending writes lost
Prompt for agents
In publish_checkpoint (checkpoint_service.rs:56-78), when reset_wal fails at line 67-70, the rollback is called. However, the WAL file is no longer backed up in capture_checkpoint_files (it was removed in this PR). If reset_wal's write_file_atomically partially succeeds (rename ok, sync fails), the WAL is already overwritten with empty content, but the rollback cannot restore it.
Two possible fixes:
1. Re-add the WAL backup: In capture_checkpoint_files, add back the capture_path for segment_wal_path. This ensures the rollback can restore the WAL if reset_wal partially modifies it.
2. Don't rollback on WAL reset failure: After the manifest is successfully written, treat WAL reset failure as non-critical (similar to the old code which did `if reset_wal(...).is_err() { return Ok(()); }`). The manifest and data files are already committed, so the checkpoint is complete. The stale WAL entries would be handled by is_redundant_wal_op during recovery.
Approach 2 is simpler and matches the old behavior, but approach 1 is more conservative.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
P0 Issue
Checkpoints block all collection reads and writes because they hold the collection write lock while sealing/rebuilding/writing all files.
Summary by CodeRabbit
Bug Fixes
Tests