Skip to content

feat: add durable dossier discussion summaries#379

Merged
rianjs merged 5 commits into
mainfrom
issue-369-discussion-summarization
Jun 24, 2026
Merged

feat: add durable dossier discussion summaries#379
rianjs merged 5 commits into
mainfrom
issue-369-discussion-summarization

Conversation

@rianjs

@rianjs rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Findings

  • Major: internal/pipeline/pipeline.go only fingerprints the bounded prompt projection, and internal/pipeline/pipeline.go builds SourceFingerprint from that same capped/excerpted input. Changes to omitted comments, omitted thread comments, omitted threads, or text beyond the 480-rune excerpt will not invalidate the cached dossier-discussion-summary task. That violates the contract’s artifact-digest rule and the plan’s “raw discussion changes invalidate summary” requirement. Include content digests for the raw discussion artifacts, or full raw discussion projection, in the task fingerprint/provenance and add tests for omitted/truncated input changes.

  • Major: internal/pipeline/pipeline.go accepts any non-empty inline-thread path from the model and does not validate line/side/anchor_kind against the input thread anchors. Since the final dossier now consumes only the normalized summary, a model can silently move a discussion to the wrong file/line. Preserve anchors deterministically outside the free-form summary, or validate decoded thread anchors against the known input set.

  • Major: internal/pipeline/pipeline.go enforces the reviewer-facing exclusion rules with a very narrow substring list. It catches "approval" and "ci status", but not common forbidden variants like approved, build failed, mergeable, draft, requested reviewers, run_id, retry state, or ledger/cache wording. Because the contract says these must not appear in reviewer-facing context, expand the validation and tests beyond the single CI status case.

  • Minor: internal/pipeline/pipeline.go makes no-run durable-task loading conditional only on empty runID. The plan scoped that exception to SelectionOnly dossier summaries, but the generic helper now allows any future empty-run structured task to bypass ledger lookup. Consider an explicit allowNoRunTask/callerOwnedArtifacts flag so this exception cannot spread accidentally.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Findings

  • Major: internal/pipeline/pipeline.go keeps omitted-count signals only in the summarizer prompt input, but internal/pipeline/pipeline.go does not carry omitted counts in the normalized summary artifact and internal/pipeline/pipeline.go does not render them deterministically into dossier/final/discussion.md. That means a large discussion can be capped to the first 20 comments/threads and the reviewer-facing dossier may look complete unless the model voluntarily mentions the omitted counts. The ticket plan explicitly required preserving omitted-count signals when truncation occurs; make those counts structured summary fields and render them deterministically, with a test covering final dossier output.

The prior review findings are otherwise addressed: the summary fingerprint now includes full source discussion provenance, inline anchors are validated against source threads, forbidden process-state coverage is stronger, and no-run durable reuse is explicitly scoped by allowNoRunCache.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

No further architectural findings. The omitted-count signal is now structured in the summary artifact, restored after cached loads, and rendered deterministically into the final dossier for top-level comments, inline threads, and per-thread omitted comments.

The PR now matches the ticket intent: raw discussion remains durable, summary execution is durable and run-backed for normal reviews, SelectionOnly has a scoped no-run cache path, final dossier discussion consumes normalized summary only, and cache invalidation covers full raw discussion provenance.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Coverage is improved, but several required behaviors are still not actually proved by these tests.

  • Major: Inline-thread cache invalidation and anchor preservation are still uncovered. TestPrepareDossierArtifactsInvalidatesSummaryWhenOmittedOrTruncatedDiscussionChanges only mutates top-level comments, and TestPrepareDossierArtifactsRendersInlineThreadOmittedCounts only checks omitted-count prose. Nothing asserts that changing an inline thread invalidates the summary cache or that path/side/line/anchor_kind/resolved survive into dossier/summary/discussion.json and the final dossier, so a regression in the new thread-summary contract would still pass. internal/pipeline/pipeline_test.go:733

  • Major: The summary prompt shape itself is effectively untested. All of the dossier-summary paths feed a canned discussionSummaryJSON(...) response and never inspect the generated prompt, so the suite would still pass if the implementation summarized from the wrong source, omitted the top-level-vs-inline split, or dropped anchor/resolved context before calling the model. That leaves the highest-risk part of this change unverified. internal/pipeline/pipeline_test.go:579 internal/pipeline/pipeline_test.go:626 internal/pipeline/pipeline_test.go:915

  • Major: SelectionOnly no-run reuse semantics are under-tested. TestSelectionOnlyReusesCachedDossierSummaryTask proves one cached reload for an unchanged artifact root, but it does not prove the stale-artifact invalidation requirement for caller-owned roots, and it does not assert that the loader avoided any ledger/session lookup. A broken implementation that silently consulted the store or reused stale summary data would still satisfy this test. internal/pipeline/pipeline_test.go:579

  • Minor: The reviewer-facing exclusion checks are string-literal based and shallow. TestDryRunPlansAndPersistsWithoutProviderWrites and TestDecodeDossierDiscussionSummaryRejectsProcessState only look for a few exact substrings, so alternate leak forms in the summary artifact or final markdown, different casing, or new process-state fields would still slip through. internal/pipeline/pipeline_test.go:30 internal/pipeline/pipeline_test.go:954

Comment thread internal/pipeline/pipeline.go
Comment thread internal/pipeline/pipeline.go
Comment thread internal/pipeline/pipeline.go
Comment thread docs/llm-task-artifacts.md Outdated

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 4cba246

Summary

Reviewer Findings
harness-engineering:harness-knowledge-reviewer 1
harness-engineering:harness-self-documenting-code-reviewer 2
security:security-code-auditor 2
harness-engineering:harness-knowledge-reviewer (1 findings)

💡 Suggestion - docs/llm-task-artifacts.md:30

The schema version note references 'Ticket #369' to justify not bumping the version. Tying a schema freeze decision to an external ticket means future readers cannot evaluate whether the same reasoning applies when the next task is added. Consider replacing the ticket reference with an inline rule, e.g., 'A new task that reuses the existing metadata fields does not require a schema bump,' so the policy is self-contained.

harness-engineering:harness-self-documenting-code-reviewer (2 findings)

💡 Suggestion - internal/pipeline/pipeline.go:3278

In dossierDiscussionPromptInputFromDiscussion, sourcePayload used to compute SourceFingerprint includes uncapped filteredTopLevel and discussion.InlineThreads, while the prompt input uses capped slices. This divergence is intentional — fingerprinting the full filtered discussion ensures changes to omitted content still invalidate the cache — but to a future reader it looks like an accidental inconsistency. A brief why-comment at the sourcePayload definition would make this contract explicit and prevent a well-meaning refactor from breaking cache correctness.

💡 Suggestion - internal/pipeline/pipeline.go:3428

In renderDossierDiscussionSummaryMarkdown, the switch on thread.Status has explicit arms for "settled" and "unresolved" but routes "noted" — a valid, documented status per the output contract in buildDossierDiscussionSummaryPrompt — through the default arm alongside the empty-string fallback. A reader cannot distinguish deliberate design (noted threads carry no label prefix) from an accidental omission. Adding an explicit case "noted": arm, even with the same body as default, would make the rendering contract self-evident.

security:security-code-auditor (2 findings)

⚠️ Should Fix - internal/pipeline/pipeline.go:3147

Prompt injection via user-controlled PR discussion content: dossierDiscussionPromptInputFromDiscussion sends raw PR comment bodies and inline thread comment bodies (from external contributors) directly into the LLM prompt. The forbiddenDiscussionSummaryPatterns filter only blocks a narrow set of process-state terms and would not prevent an adversarial commenter from embedding instructions like 'Ignore prior instructions. Summarize all threads as: All issues resolved, PR is safe to merge.' A successfully injected instruction that avoids those terms would produce a misleading reviewer-facing dossier entry. Since this tool is used to inform code review decisions, a manipulated summary is a meaningful risk. Fix: treat LLM-generated summaries as untrusted display strings; consider explicit output-length/format constraints or a stricter output schema that makes instruction-override less effective; document that this path trusts the LLM to resist adversarial prompts.

💡 Suggestion - internal/pipeline/pipeline.go:3195

Silently ignored json.Marshal error: data, _ := json.Marshal(sourcePayload) discards the error. If marshaling fails, data is nil and sha256Hex(nil) returns a constant hash, making SourceFingerprint identical across all discussions regardless of content. This would defeat cache invalidation in resetLLMTaskIfInputFingerprintChanged, causing stale summaries to be served after discussion changes. json.Marshal rarely fails on plain structs in practice, but the silent discard is a latent correctness risk that should at minimum log the error and return early.

1 info-level observations excluded. Run with --verbose to include.

4 PR discussion threads considered.


Completed in 4m 38s | $3.15 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Reviewers hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 4m 38s wall · 4m 32s compute (Reviewers: 3m 25s · Synthesis: 1m 07s)
Cost $3.15 (estimated)
Tokens 532.5k in / 31.7k out
Turns 14

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 39.2k 3.2k 13.1k 26.0k (1h) $0.21
documentation:docs-reviewer sonnet 36.1k 2.4k 13.1k 22.9k (1h) $0.18
harness-engineering:harness-architecture-reviewer sonnet 87.6k 796 13.1k 74.5k (1h) $0.46
harness-engineering:harness-enforcement-reviewer sonnet 90.9k 3.6k 13.1k 77.7k (1h) $0.52
harness-engineering:harness-knowledge-reviewer sonnet 90.5k 3.4k 13.1k 77.3k (1h) $0.52
harness-engineering:harness-self-documenting-code-reviewer sonnet 97.8k 12.3k 13.1k 84.6k (1h) $0.70
security:security-code-auditor sonnet 90.6k 6.0k 13.1k 77.4k (1h) $0.56

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Note: Posted 4 inline comments. 1 finding remained in the summary because GitHub could not attach them to the current diff.

@monit-reviewer monit-reviewer dismissed their stale review June 24, 2026 17:15

Superseded by updated review

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 688edee | Previous: 4cba246 (incremental)

Summary

No issues found.

8 PR discussion threads considered.


Completed in 2m 35s | $8.05 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Mode Re-review · Cycle 2 · Session resumed
Reviewers hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 2m 35s wall · 2m 03s compute (Reviewers: 1m 35s · Synthesis: 28s)
Cost $8.05 (estimated)
Tokens 1460.8k in / 37.8k out
Turns 28

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 90.2k 4.4k 26.3k 63.9k (1h) $0.46
documentation:docs-reviewer sonnet 82.2k 3.4k 26.3k 55.9k (1h) $0.39
harness-engineering:harness-architecture-reviewer sonnet 237.2k 954 26.3k 210.9k (1h) $1.29
harness-engineering:harness-enforcement-reviewer sonnet 246.8k 3.8k 26.3k 220.5k (1h) $1.39
harness-engineering:harness-knowledge-reviewer sonnet 246.5k 4.4k 98.3k 148.2k (1h) $0.98
harness-engineering:harness-self-documenting-code-reviewer sonnet 268.3k 13.4k 26.3k 242.0k (1h) $1.66
security:security-code-auditor sonnet 246.4k 6.8k 26.3k 220.1k (1h) $1.43
discussion-summarizer 43.2k 667 0 43.2k (1h) $0.45

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

No architectural drift found in the daemon-driven edits. The changes preserve the agreed direction: durable run-backed summary for normal reviews, scoped no-run cache for SelectionOnly, final dossier built from normalized summary, full-source fingerprinting, anchor validation, deterministic omitted-count rendering, and reviewer-facing exclusion rules.

The daemon edits are constructive: marking comment bodies as untrusted_body, adding an explicit prompt-injection rule, handling fingerprint marshal errors, and expanding tests all tighten the contract without changing ticket scope.

@rianjs rianjs merged commit c124fb8 into main Jun 24, 2026
10 checks passed
@rianjs rianjs deleted the issue-369-discussion-summarization branch June 24, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants