Skip to content

fuzz: cover deferred writing in chanmon_consistency#4465

Open
joostjager wants to merge 2 commits intolightningdevkit:mainfrom
joostjager:chain-mon-internal-deferred-writes-with-fuzz
Open

fuzz: cover deferred writing in chanmon_consistency#4465
joostjager wants to merge 2 commits intolightningdevkit:mainfrom
joostjager:chain-mon-internal-deferred-writes-with-fuzz

Conversation

@joostjager
Copy link
Copy Markdown
Contributor

@joostjager joostjager commented Mar 6, 2026

Adds fuzz coverage for #4351

@ldk-reviews-bot
Copy link
Copy Markdown

ldk-reviews-bot commented Mar 6, 2026

👋 I see @wpaulino was un-assigned.
If you'd like another reviewer assignment, please click here.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.10%. Comparing base (75ac90a) to head (af7e8ee).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4465      +/-   ##
==========================================
- Coverage   86.12%   86.10%   -0.02%     
==========================================
  Files         157      157              
  Lines      108824   108824              
  Branches   108824   108824              
==========================================
- Hits        93727    93707      -20     
- Misses      12480    12499      +19     
- Partials     2617     2618       +1     
Flag Coverage Δ
tests 86.10% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from 51afc25 to 0aebb10 Compare March 19, 2026 07:51
@joostjager
Copy link
Copy Markdown
Contributor Author

Rebased

@joostjager
Copy link
Copy Markdown
Contributor Author

Will hold off on this until pre-existing fuzz failures #4496 and #4472 are in

@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from 0aebb10 to 6274ba0 Compare March 20, 2026 17:52
@joostjager
Copy link
Copy Markdown
Contributor Author

Prerequisites are in, rebased.

@TheBlueMatt
Copy link
Copy Markdown
Collaborator

This LGTM, why is it draft?

@joostjager
Copy link
Copy Markdown
Contributor Author

I first wanted to do a serious local run, but then it turned out there are so many pre-existing fuzz failures that it is hard to see what's new. I've bisected the failures to the various PRs that introduced them.

@joostjager
Copy link
Copy Markdown
Contributor Author

joostjager commented Mar 22, 2026

Although for this PR I could just see if anything pops up that doesnt repro on main. Will do that.

@joostjager
Copy link
Copy Markdown
Contributor Author

joostjager commented Mar 23, 2026

Several newly introduced fuzz failures to address:

7084a32219ffa61b1919192119197084a32219ff4400001d228012ffab
10fa0040801170704040198040111911191f1f1170bbff1170ffb3b2b6
10000150804211be707040198011191f1119401f401f0aa97210b6ff25
10000150804211707040198011191f11191f1a4040a91f7210b6ff25

@joostjager
Copy link
Copy Markdown
Contributor Author

Zoomed in on one of those sequences, and it seems it is reproducible with another string without deferred mode too.

0270801109191109191f1f10b6ff

@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch 2 times, most recently from c6da16e to d4bf3e0 Compare April 15, 2026 04:36
@joostjager
Copy link
Copy Markdown
Contributor Author

Dependency: #4520

@joostjager joostjager self-assigned this Apr 16, 2026
@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from d4bf3e0 to a798d74 Compare May 6, 2026 11:42
@joostjager
Copy link
Copy Markdown
Contributor Author

joostjager commented May 7, 2026

Interestingly the non-deferred mode reproducer 0270801109191109191f1f10b6ff doesn't fail on main anymore, with #4520 not yet merged. Bisected the fixing PR to #4529 (@tankyleo).

It seems the increased headroom made the bug unobservable to the fuzzer?

I ran the fuzzer on this branch overnight, and no failures were found.

@joostjager
Copy link
Copy Markdown
Contributor Author

I'll try to add a new invariant to the fuzzer so it catches the problem in a more robust way.

@joostjager
Copy link
Copy Markdown
Contributor Author

Invariant addition: #4601

@joostjager joostjager marked this pull request as ready for review May 7, 2026 13:23
@ldk-reviews-bot ldk-reviews-bot requested a review from wpaulino May 7, 2026 13:23
@joostjager
Copy link
Copy Markdown
Contributor Author

Will rebase this after #4571

@joostjager joostjager removed the request for review from wpaulino May 7, 2026 13:23
Comment thread fuzz/src/chanmon_consistency.rs Outdated
Comment on lines +366 to +367
/// This simulates the pattern of snapshotting the pending count, persisting the
/// `ChannelManager`, then flushing the queued monitor writes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The docstring says this "simulates the pattern of snapshotting the pending count, persisting the ChannelManager, then flushing the queued monitor writes." However, in the actual fuzz loop, the flush happens during event processing (called from release_pending_monitor_events), while the ChannelManager is serialized at the end of each loop iteration (lines 3023-3031) — i.e., after the flush, not before.

This means the fuzzer always tests the scenario where the ChannelManager snapshot captures post-flush state. The more interesting crash scenario — serializing the ChannelManager while monitor writes are still queued, then crashing before the flush — is not directly exercised by this ordering. (The fuzzer does partially cover stale-monitor restarts through the use_old_mons selection in reload_node, but that's a different axis.)

Consider either updating the docstring to reflect what actually happens, or restructuring so the ChannelManager is serialized at the flush call-site to match the documented pattern.

Comment thread fuzz/src/chanmon_consistency.rs Outdated
Comment on lines +379 to +386
if persister_res == chain::ChannelMonitorUpdateStatus::Completed {
for (_channel_id, state) in self.latest_monitors.lock().unwrap().iter_mut() {
if let Some((id, data)) = state.pending_monitors.drain(..).last() {
state.persisted_monitor_id = id;
state.persisted_monitor = data;
}
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When persister_res == Completed, this drains pending_monitors for all channels, not just those whose operations were included in the current flush(count) call. This is correct today because update_ret is constant for the lifetime of a node (set at creation/reload, never mutated by the mon_style fuzz commands at 0x00-0x06). That invariant ensures:

  • If update_ret == Completed: every prior flush also used Completed, so there are no "orphaned" pending entries from an earlier InProgress flush.
  • If update_ret == InProgress: this branch is never taken.

But this is a subtle invariant — if a future change allows update_ret to be toggled mid-run (e.g., fuzz commands updating the live persister), this would silently promote monitors whose corresponding channel_monitor_updated calls were never made by the flush. Consider adding a comment documenting this assumption, or tightening the logic to only promote monitors that correspond to channels with operations in the flushed batch.

@ldk-claude-review-bot
Copy link
Copy Markdown
Collaborator

ldk-claude-review-bot commented May 7, 2026

After a thorough re-review of every hunk in this diff, I have no new issues to report.

Correction of prior review: My previous inline comments (at lines 414-421 and 452) flagged a bug in mark_completion_finished calling mark_persisted, claiming it removes older pending_monitors entries that "will never be acknowledged with ChainMonitor::channel_monitor_updated." Upon deeper analysis, this was incorrect. The PR intentionally separates two concerns into two distinct lists:

  • pending_monitors (line 322): restart candidates — which old monitors can safely be reloaded
  • pending_monitor_completions (line 324): ChainMonitor acknowledgment tracking — which updates still need channel_monitor_updated calls

mark_persisted (line 344) only cleans up pending_monitors via retain, and does not touch pending_monitor_completions. So when completing update 5 out-of-order (while update 3 is still pending), update 3 is correctly removed from pending_monitors (can't restart with it after telling LDK update 5 is done), but remains in pending_monitor_completions for future channel_monitor_updated(3) acknowledgment. Confirmed by ChainMonitor::channel_monitor_updated at chainmonitor.rs:783 which uses retain(|update_id| *update_id != completed_update_id) — only removing the specific ID.

The other prior observations about docstring accuracy (line 367) and the Completed drain-all subtlety (line 386) remain valid as minor notes but are not bugs.

No issues found.

@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from a798d74 to 20275f6 Compare May 8, 2026 14:30
Comment on lines +414 to 421
fn mark_update_completed(
&self, channel_id: ChannelId, monitor_id: u64, serialized_monitor: Vec<u8>,
) {
if let Some(state) = self.latest_monitors.lock().unwrap().get_mut(&channel_id) {
// Once LDK acknowledges update N as completed, any older pending monitor blob is fully
// superseded and must not be offered back on restart.
state.mark_persisted(monitor_id, serialized_monitor);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: mark_persisted removes all pending entries with id <= monitor_id from the shadow's pending_monitors (via retain), but only the single selected entry was acknowledged with ChainMonitor::channel_monitor_updated. The older entries that are removed will never be acknowledged.

ChainMonitor::channel_monitor_updated removes only the specific completed_update_id from its internal pending_monitor_updates list, and only emits MonitorEvent::Completed when all pending updates have been individually acknowledged.

Regression from old code: The old complete_monitor_update only removed the selected entry (via remove(0), remove(1), or pop()) and left other entries intact for future completion. The new code's reuse of mark_persisted here also cleans up older entries that still need ChainMonitor acknowledgment.

Trigger scenario:

  1. persistence_style = InProgress, two updates accumulate: pending_monitors = [(3, data3), (5, data5)]
  2. Fuzz command 0xf2/0xf6/0xfa/0xfe calls complete_monitor_update(Last)
  3. take_pending(Last) removes (5, data5)pending_monitors = [(3, data3)]
  4. channel_monitor_updated(5) — ChainMonitor removes 5, still has 3 pending
  5. mark_persisted(5, data5)retain removes (3, data3)pending_monitors = []
  6. Shadow has no record of update 3, ChainMonitor still has 3 pending forever
  7. MonitorEvent::Completed is never emitted → channel operations blocked → process_all_events panics at 100 iterations

Fix: mark_update_completed should only update persisted_monitor_id/persisted_monitor without removing other pending entries. The retain cleanup in mark_persisted is correct for track_monitor_update (persister callback context) but wrong here (explicit completion context). Consider splitting the logic or adding a dedicated method that doesn't retain-prune.

joostjager added 2 commits May 8, 2026 17:06
Replace the test persister with harness-owned monitor persistence.

Track completed and in-progress monitor snapshots for reloads.

Keep completion obligations separate from restart candidates.

Completing a newer monitor must not drop older callbacks.
Enable deferred ChainMonitor writes in chanmon_consistency.

Checkpoint the ChannelManager before flushing captured monitor writes.

Treat checkpoint-only progress as progress during settle_all.
@joostjager joostjager force-pushed the chain-mon-internal-deferred-writes-with-fuzz branch from 20275f6 to af7e8ee Compare May 8, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants