fuzz: cover deferred writing in chanmon_consistency#4465
fuzz: cover deferred writing in chanmon_consistency#4465joostjager wants to merge 2 commits intolightningdevkit:mainfrom
Conversation
|
👋 I see @wpaulino was un-assigned. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4465 +/- ##
==========================================
- Coverage 86.12% 86.10% -0.02%
==========================================
Files 157 157
Lines 108824 108824
Branches 108824 108824
==========================================
- Hits 93727 93707 -20
- Misses 12480 12499 +19
- Partials 2617 2618 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
51afc25 to
0aebb10
Compare
|
Rebased |
0aebb10 to
6274ba0
Compare
|
Prerequisites are in, rebased. |
|
This LGTM, why is it draft? |
|
I first wanted to do a serious local run, but then it turned out there are so many pre-existing fuzz failures that it is hard to see what's new. I've bisected the failures to the various PRs that introduced them. |
|
Although for this PR I could just see if anything pops up that doesnt repro on main. Will do that. |
|
Several newly introduced fuzz failures to address: |
|
Zoomed in on one of those sequences, and it seems it is reproducible with another string without deferred mode too. |
c6da16e to
d4bf3e0
Compare
|
Dependency: #4520 |
d4bf3e0 to
a798d74
Compare
|
Interestingly the non-deferred mode reproducer It seems the increased headroom made the bug unobservable to the fuzzer? I ran the fuzzer on this branch overnight, and no failures were found. |
|
I'll try to add a new invariant to the fuzzer so it catches the problem in a more robust way. |
|
Invariant addition: #4601 |
|
Will rebase this after #4571 |
| /// This simulates the pattern of snapshotting the pending count, persisting the | ||
| /// `ChannelManager`, then flushing the queued monitor writes. |
There was a problem hiding this comment.
Nit: The docstring says this "simulates the pattern of snapshotting the pending count, persisting the ChannelManager, then flushing the queued monitor writes." However, in the actual fuzz loop, the flush happens during event processing (called from release_pending_monitor_events), while the ChannelManager is serialized at the end of each loop iteration (lines 3023-3031) — i.e., after the flush, not before.
This means the fuzzer always tests the scenario where the ChannelManager snapshot captures post-flush state. The more interesting crash scenario — serializing the ChannelManager while monitor writes are still queued, then crashing before the flush — is not directly exercised by this ordering. (The fuzzer does partially cover stale-monitor restarts through the use_old_mons selection in reload_node, but that's a different axis.)
Consider either updating the docstring to reflect what actually happens, or restructuring so the ChannelManager is serialized at the flush call-site to match the documented pattern.
| if persister_res == chain::ChannelMonitorUpdateStatus::Completed { | ||
| for (_channel_id, state) in self.latest_monitors.lock().unwrap().iter_mut() { | ||
| if let Some((id, data)) = state.pending_monitors.drain(..).last() { | ||
| state.persisted_monitor_id = id; | ||
| state.persisted_monitor = data; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
When persister_res == Completed, this drains pending_monitors for all channels, not just those whose operations were included in the current flush(count) call. This is correct today because update_ret is constant for the lifetime of a node (set at creation/reload, never mutated by the mon_style fuzz commands at 0x00-0x06). That invariant ensures:
- If
update_ret == Completed: every prior flush also usedCompleted, so there are no "orphaned" pending entries from an earlierInProgressflush. - If
update_ret == InProgress: this branch is never taken.
But this is a subtle invariant — if a future change allows update_ret to be toggled mid-run (e.g., fuzz commands updating the live persister), this would silently promote monitors whose corresponding channel_monitor_updated calls were never made by the flush. Consider adding a comment documenting this assumption, or tightening the logic to only promote monitors that correspond to channels with operations in the flushed batch.
|
After a thorough re-review of every hunk in this diff, I have no new issues to report. Correction of prior review: My previous inline comments (at lines 414-421 and 452) flagged a bug in
The other prior observations about docstring accuracy (line 367) and the No issues found. |
a798d74 to
20275f6
Compare
| fn mark_update_completed( | ||
| &self, channel_id: ChannelId, monitor_id: u64, serialized_monitor: Vec<u8>, | ||
| ) { | ||
| if let Some(state) = self.latest_monitors.lock().unwrap().get_mut(&channel_id) { | ||
| // Once LDK acknowledges update N as completed, any older pending monitor blob is fully | ||
| // superseded and must not be offered back on restart. | ||
| state.mark_persisted(monitor_id, serialized_monitor); | ||
| } |
There was a problem hiding this comment.
Bug: mark_persisted removes all pending entries with id <= monitor_id from the shadow's pending_monitors (via retain), but only the single selected entry was acknowledged with ChainMonitor::channel_monitor_updated. The older entries that are removed will never be acknowledged.
ChainMonitor::channel_monitor_updated removes only the specific completed_update_id from its internal pending_monitor_updates list, and only emits MonitorEvent::Completed when all pending updates have been individually acknowledged.
Regression from old code: The old complete_monitor_update only removed the selected entry (via remove(0), remove(1), or pop()) and left other entries intact for future completion. The new code's reuse of mark_persisted here also cleans up older entries that still need ChainMonitor acknowledgment.
Trigger scenario:
persistence_style = InProgress, two updates accumulate:pending_monitors = [(3, data3), (5, data5)]- Fuzz command
0xf2/0xf6/0xfa/0xfecallscomplete_monitor_update(Last) take_pending(Last)removes(5, data5)→pending_monitors = [(3, data3)]channel_monitor_updated(5)— ChainMonitor removes 5, still has 3 pendingmark_persisted(5, data5)—retainremoves(3, data3)→pending_monitors = []- Shadow has no record of update 3, ChainMonitor still has 3 pending forever
MonitorEvent::Completedis never emitted → channel operations blocked →process_all_eventspanics at 100 iterations
Fix: mark_update_completed should only update persisted_monitor_id/persisted_monitor without removing other pending entries. The retain cleanup in mark_persisted is correct for track_monitor_update (persister callback context) but wrong here (explicit completion context). Consider splitting the logic or adding a dedicated method that doesn't retain-prune.
Replace the test persister with harness-owned monitor persistence. Track completed and in-progress monitor snapshots for reloads. Keep completion obligations separate from restart candidates. Completing a newer monitor must not drop older callbacks.
Enable deferred ChainMonitor writes in chanmon_consistency. Checkpoint the ChannelManager before flushing captured monitor writes. Treat checkpoint-only progress as progress during settle_all.
20275f6 to
af7e8ee
Compare
Adds fuzz coverage for #4351