Skip to content

fix: reduce checkpoint lock hold time with staged publish#16

Open
PhantomInTheWire wants to merge 1 commit intomasterfrom
codex/p0-async-checkpoint
Open

fix: reduce checkpoint lock hold time with staged publish#16
PhantomInTheWire wants to merge 1 commit intomasterfrom
codex/p0-async-checkpoint

Conversation

@PhantomInTheWire
Copy link
Copy Markdown
Owner

@PhantomInTheWire PhantomInTheWire commented Apr 26, 2026

Summary

  • Split checkpoint into stage/publish phases outside collection RwLock to reduce blocking
  • Add revision tracking to detect concurrent writes during flush
  • Staged checkpoint rollback no longer restores active WAL bytes (preserves concurrent writes)
  • WAL reset failure returns error instead of silent success
  • Two-phase manifest publish preserves prior manifest on reset failure
  • Add regression test for WAL reset failure recoverability

P0 Issue

Checkpoints block all collection reads and writes because they hold the collection write lock while sealing/rebuilding/writing all files.


Open in Devin Review

Summary by CodeRabbit

  • Bug Fixes

    • Improved checkpoint resilience with enhanced crash recovery mechanisms for interrupted operations.
    • Better state tracking and recovery during checkpoint failures.
    • Enhanced handling of checkpoint-related system failures.
  • Tests

    • Added regression test validating data preservation during checkpoint recovery scenarios.

- Split checkpoint into stage/publish phases outside collection RwLock
- Add revision tracking to detect concurrent writes during flush
- Staged checkpoint rollback no longer restores active WAL bytes
- WAL reset failure returns error instead of silent success
- Two-phase manifest publish preserves prior manifest on reset failure
- Add regression test for WAL reset failure recoverability

P0: checkpoint blocks all reads and writes
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

📝 Walkthrough

Walkthrough

The PR restructures checkpointing into distinct staging and publishing phases, introduces atomic revision tracking to detect mutations during checkpoint, replaces manifest cleanup with explicit public API, and updates delete-by-filter to report whether deletions occurred. Staged checkpoints support rollback on publish failures.

Changes

Cohort / File(s) Summary
Checkpoint Staging & Publishing
crates/garuda-engine/src/checkpoint_service.rs
Splits checkpoint into stage_checkpoint (seals segment, rebuilds indexes, captures rollback data, writes data) and publish_checkpoint (writes manifest, resets WAL, cleans stale manifests). New StagedCheckpoint struct holds intermediate state. discard_staged_checkpoint exposes rollback restoration. WAL reset moved from data-write to publish phase.
Concurrency Control & Checkpoint Orchestration
crates/garuda-engine/src/lib.rs
Adds atomic revision counter and checkpoint_lock mutex to Collection. Mutation operations now increment revision conditionally. checkpoint() acquires lock, snapshots revision and runtime, stages work, and conditionally publishes only if revision matches—otherwise discards staged checkpoint to handle concurrent mutations.
Delete Operation Return Type
crates/garuda-engine/src/write_service.rs
apply_delete_by_filter return type changed from Result<(), Status> to Result<bool, Status> to indicate whether any documents were matched and deleted.
Manifest Lifecycle Management
crates/garuda-storage/src/version.rs
remove_stale_manifests changed from private to pub, enabling explicit caller control. write_manifest no longer auto-triggers stale manifest cleanup.
Crash Recovery Test
crates/garuda-engine/src/tests/test_collection_recovery_crash_semantics.rs
New regression test validates recovery when collection.flush() fails due to WAL reset issues. Injects temporary WAL file to trigger error, verifies post-recovery document visibility.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Collection
    participant ChkptLock as Checkpoint Lock
    participant StageChk as stage_checkpoint
    participant PublishChk as publish_checkpoint
    participant Rollback as Rollback

    Client->>Collection: checkpoint()
    Collection->>ChkptLock: acquire()
    ChkptLock-->>Collection: lock acquired
    Collection->>Collection: snapshot revision & state
    Collection->>StageChk: stage_checkpoint(cloned_state)
    StageChk->>StageChk: seal segment, rebuild indexes
    StageChk->>StageChk: capture rollback data
    StageChk->>StageChk: write checkpoint data
    StageChk-->>Collection: StagedCheckpoint
    
    alt revision matches snapshot
        Collection->>PublishChk: publish_checkpoint(staged)
        PublishChk->>PublishChk: write manifest
        PublishChk->>PublishChk: reset WAL
        PublishChk->>PublishChk: remove stale manifests
        PublishChk-->>Collection: Ok(runtime)
        Collection->>Collection: update runtime
    else revision changed (concurrent mutation)
        Collection->>Rollback: discard_staged_checkpoint(staged)
        Rollback->>Rollback: restore via captured rollback
        Rollback-->>Collection: Ok(())
    end
    
    Collection->>ChkptLock: release()
    Collection-->>Client: result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Feat/storage #4 — Overlapping modifications to delete-by-filter return type, checkpoint/flush flow control, and WAL/checkpoint publishing behavior.
  • Refactor/cleanup #5 — Overlapping refactoring of checkpoint_service functions, CollectionRuntime/segment management, and WAL checkpoint lifecycle.

Poem

🐰 A checkpoint staged with care and grace,
Revision locks the publish race,
When mutations skip between the phases,
Rollback whispers through the traces,
Atomically we hop, hop, hop! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: restructuring checkpoint operations into staged and publish phases to reduce lock contention.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/p0-async-checkpoint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs (1)

282-285: Test depends on the internal tmp-filename of reset_wal; add a comment.

Creating a directory at data.wal.tmp blocks reset_wal only because that is the exact temp path it uses for its atomic-rename. If reset_wal's tmp naming ever changes, this test will silently stop exercising the failure path (the is_err() assertion would simply flip without notice). A short comment makes the implementation dependency explicit and easier to update.

♻️ Suggested clarifying comment
     let wal_path = segment_wal_path(&collection.path(), WRITING_SEGMENT_ID);
+    // `reset_wal` writes to `<wal>.tmp` and atomically renames it over the WAL file.
+    // Pre-creating a directory at that path forces the temp-file write to fail and
+    // exercises the publish-phase WAL-reset failure branch.
     fs::create_dir(wal_path.with_file_name("data.wal.tmp")).expect("block wal temp file");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`
around lines 282 - 285, Add a brief inline comment above the test's creation of
the "data.wal.tmp" directory explaining that this intentionally blocks
reset_wal's atomic-rename by matching its current temporary filename (the test
uses segment_wal_path(...) and wal_path.with_file_name("data.wal.tmp")); note
that if reset_wal's tmp naming convention changes the test must be updated
because the assert!(collection.flush().is_err()) relies on that exact temp name
to exercise the failure path.
crates/garuda-engine/src/checkpoint_service.rs (2)

56-78: Two-phase publish design looks correct, but error handling and cleanup observability can be tightened.

Two notes on this function:

  1. Lost original error on rollback failure (lines 62-70). Same pattern as in stage_checkpoint: if write_checkpoint_manifest or reset_wal fails and rollback.restore()? also fails, the propagated error is the rollback error, not the originating manifest/WAL-reset error. Preserve the primary error and surface the restore failure separately.

  2. Silent cleanup errors (lines 72-75). Four let _ = … calls discard Result<(), Status>. For correctness this is fine (next checkpoint will retry cleanup), but failures here become invisible — over time disk could accumulate stale manifests/snapshots/segment dirs without any signal. Consider routing these through a logger so operators can detect persistent cleanup failures.

The overall ordering (manifest → WAL reset → cleanup) correctly preserves the prior manifest on reset failure because capture_checkpoint_files captures the new manifest path as Missing and rollback will remove it.

♻️ Suggested change for the error-handling part
-    if let Err(status) = write_checkpoint_manifest(&state, &version_manager) {
-        rollback.restore()?;
-        return Err(status);
-    }
-
-    if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) {
-        rollback.restore()?;
-        return Err(status);
-    }
+    if let Err(status) = write_checkpoint_manifest(&state, &version_manager) {
+        if let Err(restore_err) = rollback.restore() {
+            // TODO: log restore_err
+            return Err(restore_err);
+        }
+        return Err(status);
+    }
+
+    if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) {
+        if let Err(restore_err) = rollback.restore() {
+            // TODO: log restore_err
+            return Err(restore_err);
+        }
+        return Err(status);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/checkpoint_service.rs` around lines 56 - 78, When
publish_checkpoint encounters an error from write_checkpoint_manifest or
reset_wal, preserve that original error while still attempting rollback: call
rollback.restore() but if it fails, return a composite result that surfaces the
primary error (from write_checkpoint_manifest or reset_wal) and also logs or
attaches the rollback.restore() failure instead of replacing it; specifically
update the error paths around write_checkpoint_manifest and reset_wal to capture
the first Err(status) into a variable (e.g., primary_err), call
rollback.restore() and if that returns Err(restore_err) return or log a combined
error (primary_err with restore_err as context) while still returning the
primary failure in the API. Also stop swallowing results from
remove_stale_manifests, remove_old_snapshots, and remove_stale_segment_dirs: log
any Err returned from these calls (referencing
version_manager.remove_stale_manifests, remove_old_snapshots, and
remove_stale_segment_dirs) so persistent cleanup failures are visible to
operators.

41-47: Original persist error is dropped if rollback.restore() itself fails.

When write_checkpoint_data_files fails and rollback.restore()? then also fails, the ? propagates the restore error and the original status is silently lost. That makes diagnosing the root cause much harder. Consider preserving the original error and logging the rollback failure instead.

♻️ Suggested change
     if let Err(status) = persist_result {
-        rollback.restore()?;
+        if let Err(restore_err) = rollback.restore() {
+            // TODO: log restore_err alongside the primary failure
+            return Err(restore_err);
+        }
         return Err(status);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/garuda-engine/src/checkpoint_service.rs` around lines 41 - 47, The
current flow loses the original persist error when rollback.restore() fails;
update the persist handling in the checkpoint routine so that after let
persist_result = write_checkpoint_data_files(&state); if let Err(status) =
persist_result { let restore_err = rollback.restore().err(); if let Some(re) =
restore_err { process or log the rollback failure (including re) but return the
original status (or attach restore_err as context) } else { return Err(status) }
} — i.e., call rollback.restore(), capture any error instead of using ? to
propagate it, log or attach the rollback failure, and ensure
write_checkpoint_data_files' original status is returned (or wrapped) so the
original error is not dropped; use symbols capture_checkpoint_files,
write_checkpoint_data_files, rollback, rollback.restore, persist_result and
status to locate the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/garuda-engine/src/checkpoint_service.rs`:
- Around line 56-78: When publish_checkpoint encounters an error from
write_checkpoint_manifest or reset_wal, preserve that original error while still
attempting rollback: call rollback.restore() but if it fails, return a composite
result that surfaces the primary error (from write_checkpoint_manifest or
reset_wal) and also logs or attaches the rollback.restore() failure instead of
replacing it; specifically update the error paths around
write_checkpoint_manifest and reset_wal to capture the first Err(status) into a
variable (e.g., primary_err), call rollback.restore() and if that returns
Err(restore_err) return or log a combined error (primary_err with restore_err as
context) while still returning the primary failure in the API. Also stop
swallowing results from remove_stale_manifests, remove_old_snapshots, and
remove_stale_segment_dirs: log any Err returned from these calls (referencing
version_manager.remove_stale_manifests, remove_old_snapshots, and
remove_stale_segment_dirs) so persistent cleanup failures are visible to
operators.
- Around line 41-47: The current flow loses the original persist error when
rollback.restore() fails; update the persist handling in the checkpoint routine
so that after let persist_result = write_checkpoint_data_files(&state); if let
Err(status) = persist_result { let restore_err = rollback.restore().err(); if
let Some(re) = restore_err { process or log the rollback failure (including re)
but return the original status (or attach restore_err as context) } else {
return Err(status) } } — i.e., call rollback.restore(), capture any error
instead of using ? to propagate it, log or attach the rollback failure, and
ensure write_checkpoint_data_files' original status is returned (or wrapped) so
the original error is not dropped; use symbols capture_checkpoint_files,
write_checkpoint_data_files, rollback, rollback.restore, persist_result and
status to locate the code.

In `@crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs`:
- Around line 282-285: Add a brief inline comment above the test's creation of
the "data.wal.tmp" directory explaining that this intentionally blocks
reset_wal's atomic-rename by matching its current temporary filename (the test
uses segment_wal_path(...) and wal_path.with_file_name("data.wal.tmp")); note
that if reset_wal's tmp naming convention changes the test must be updated
because the assert!(collection.flush().is_err()) relies on that exact temp name
to exercise the failure path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab7a5a82-640c-457b-8a03-e080c95bc3c6

📥 Commits

Reviewing files that changed from the base of the PR and between 1d36c34 and 70d2057.

📒 Files selected for processing (5)
  • crates/garuda-engine/src/checkpoint_service.rs
  • crates/garuda-engine/src/lib.rs
  • crates/garuda-engine/src/write_service.rs
  • crates/garuda-engine/tests/test_collection_recovery_crash_semantics.rs
  • crates/garuda-storage/src/version.rs

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +276 to +278
}

discard_staged_checkpoint(staged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 flush() silently returns Ok(()) when checkpoint is discarded due to concurrent write

When checkpoint() detects a concurrent write (revision changed between read and publish), it discards the staged checkpoint and returns the result of discard_staged_checkpoint(staged), which is Ok(()) on successful rollback. This means flush() returns success to the caller without having actually persisted a checkpoint.

The race window is: (1) checkpoint() reads state and revision under read lock at lib.rs:262-266, (2) releases the read lock, (3) performs expensive staging I/O, (4) re-acquires write lock at lib.rs:270, (5) finds revision changed at lib.rs:271. Any concurrent insert/upsert/update/delete/delete_by_filter that completes between steps 2 and 4 bumps the revision, causing the staged checkpoint to be silently discarded. Data remains only WAL-backed, violating the caller's expectation that flush() guarantees a clean checkpoint.

Prompt for agents
In Collection::checkpoint() (lib.rs:257-279), when the revision check at line 271 fails (meaning a concurrent write occurred), the method falls through to discard_staged_checkpoint(staged) which returns Ok(()) on success. This means flush() returns Ok(()) without having created a checkpoint.

The fix should ensure flush() either succeeds in creating a checkpoint or signals failure. Two possible approaches:

1. Retry: After discarding, loop back to re-read state and re-stage. Add a bounded retry count to prevent infinite loops if writes keep arriving. On exhaustion, return an error.

2. Return an error: Instead of propagating Ok(()) from discard_staged_checkpoint, return an Err status indicating the checkpoint was invalidated by concurrent writes. The caller can then decide whether to retry.

Approach 1 is more user-friendly since flush() would eventually succeed. The retry loop should discard the staged checkpoint, re-read the state, and re-stage. The checkpoint_lock is already held, so no other checkpoint can interfere.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +67 to +70
if let Err(status) = reset_wal(&state.path, WRITING_SEGMENT_ID) {
rollback.restore()?;
return Err(status);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 WAL backup removed from rollback but rollback now runs after WAL may have been modified

The PR removes the WAL file from capture_checkpoint_files (old checkpoint_service.rs:204-207) but introduces a new rollback path in publish_checkpoint at checkpoint_service.rs:67-70 that executes after reset_wal may have partially succeeded. reset_wal calls write_file_atomically (io.rs:45-61), which performs rename_path then sync_directory. If the rename succeeds but sync fails, write_file_atomically returns Err, triggering rollback.restore() — which removes the new manifest and restores data files, but cannot restore the WAL (no backup). The result is: old manifest (restored) + old data files (restored) + empty WAL = potential data loss on recovery, since WAL entries for unflushed writes are gone.

Failure sequence
  1. publish_checkpoint writes manifest successfully
  2. reset_walwrite_file_atomically → rename succeeds (WAL is now empty) → sync_directory fails → returns Err
  3. rollback.restore() removes new manifest, restores data files, but WAL remains empty
  4. On recovery: old manifest + old data + empty WAL → pending writes lost
Prompt for agents
In publish_checkpoint (checkpoint_service.rs:56-78), when reset_wal fails at line 67-70, the rollback is called. However, the WAL file is no longer backed up in capture_checkpoint_files (it was removed in this PR). If reset_wal's write_file_atomically partially succeeds (rename ok, sync fails), the WAL is already overwritten with empty content, but the rollback cannot restore it.

Two possible fixes:

1. Re-add the WAL backup: In capture_checkpoint_files, add back the capture_path for segment_wal_path. This ensures the rollback can restore the WAL if reset_wal partially modifies it.

2. Don't rollback on WAL reset failure: After the manifest is successfully written, treat WAL reset failure as non-critical (similar to the old code which did `if reset_wal(...).is_err() { return Ok(()); }`). The manifest and data files are already committed, so the checkpoint is complete. The stale WAL entries would be handled by is_redundant_wal_op during recovery.

Approach 2 is simpler and matches the old behavior, but approach 1 is more conservative.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant