fix(auto-creation): batch merge journal writes#474
Conversation
f810e5d to
b786275
Compare
b786275 to
b4383c0
Compare
MotWakorb
left a comment
There was a problem hiding this comment.
Thanks for this, @Mirabis — and for the unusually clear write-up and repro on #473. The batching approach is the right lever for the WAL churn, the shared _build_journal_entry helper is clean, and the audit payload (batch_id, stream-ids-only before/after, scored-fuzzy provenance) is faithfully preserved. We ran a correctness review plus an SRE reliability pass and want to take this — with one required change and two test additions.
Required before merge
1. Guarantee the end-of-pipeline flush on the exception path (durability).
executor._flush_journal_buffer() in auto_creation_engine._process_streams sits just before return results with no try/finally. _process_streams has several await points after live merges have already executed against Dispatcharr (reorder, reconcile-orphans, EPG retry, probe). If any of those raises, the exception propagates up and the flush is skipped — so up to _journal_flush_threshold (100) buffered rows describing merges that actually happened are lost. That's exactly the aborted/bad-run scenario the per-merge journal (bd-0emgo.5) exists to make recoverable, and the old per-row commit didn't have this gap. Please wrap so the flush always runs:
try:
... # passes 2–6
return results
finally:
executor._flush_journal_buffer()_flush_journal_buffer already no-ops on an empty buffer and isolates its own failure, so a finally flush is safe on the exception path.
2. Regression test for #1 — a test where a later pipeline pass raises after a merge, asserting the buffered entry is still flushed.
3. Threshold-flush test — none of the current tests drive ≥100 merges to prove the automatic threshold flush fires and clears the buffer; please add one.
Minor
- The
_journal_mergedocstring still says failures are "swallowed (journal.log_entryalready does this internally)" — that path no longer callslog_entry; please reword to describe the buffer/flush model. A one-line note thattimestampis now stamped at flush time (not per-merge) would also help the next reader.
Everything else — correctness equivalence (incl. the falsy empty-dict case), credential safety, dry-run isolation, batch_id — checks out.
One heads-up for #473 itself: this PR meaningfully reduces WAL growth, but our reliability review found the ~15 GiB memory crash has a separate root cause (the in-memory execution_log + end-of-run json.dumps, on a disk-backed /config), so we'll address that separately — details in my comment on #473. Thanks again; genuinely good contribution.
Summary
journal.log_entries()to persist multiple journal rows in one transactionmerge_streamaudit rows inActionExecutorand flush them at threshold / pipeline completionbatch_id, stream-id-only before/after values, and scored-fuzzy provenanceFixes #473.
Why
Large auto-creation runs can perform many merge journal writes. Committing every merge row individually increases SQLite WAL churn and can produce very large WAL files during a single run. Batching keeps the same audit trail while reducing transaction frequency.
Validation
Passed:
Passed:
Local broader gate note:
The 7 local failures are outside this change:
backend/tests/unit/test_date_placeholders.py::test_expand_date_placeholders_formatis platform/libc-sensitive locally:%Qexpands toQon this macOS runner.backend/tests/routers/test_alert_methods.pySMTP cases fail because this local test registry exposes onlycreate_testandprobe_failures_threshold_test, notsmtp.Full
backend/tests/also reported E2E setup errors locally because those tests attempted to connect to a live server and receivedConnection refused.