fix: isolate reviewer LLM task failures by rianjs · Pull Request #361 · open-cli-collective/codereview-cli

rianjs · 2026-06-24T01:17:11Z

Summary

Wrap selection, reviewer, and rollup structured LLM calls in durable per-task metadata with validated outputs, raw failed attempts, and provider session IDs.
Isolate reviewer validation failures as diagnostics so sibling reviewers and rollup can continue, while blocking LLM task failures leave runs incomplete for resume.
Resume live planning from completed task artifacts and rerun only failed downstream tasks, including prior provider session IDs when available.
Render isolated reviewer diagnostics in the rollup and force the review outcome to at least comment when a reviewer fails.

Fixes #360

Verification

rtk go test ./...
rtk make lint
rtk make build

monit-reviewer

Automated PR Review

Reviewed commit: 177e7da

Summary

Reviewer	Findings
harness-engineering:harness-knowledge-reviewer	3
harness-engineering:harness-self-documenting-code-reviewer	4

harness-engineering:harness-knowledge-reviewer (3 findings)

⚠️ Should Fix - internal/pipeline/pipeline.go:246

The new LLM task artifact schema (schema_version=1, three-status taxonomy, llm-tasks/ directory layout, fingerprint fields) is a durable on-disk contract that gates resume safety. The invariants that matter most — what changes require a version bump, why metadata is written last as the durable resume boundary, which fields are load-bearing — exist only as inline comments and the PR description. A future contributor who adds a status value or renames a field without bumping schema_version will silently break in-flight run resumption. A short docs/ entry (e.g., docs/llm-task-artifacts.md) would make these invariants versioned and discoverable.

💡 Suggestion - internal/pipeline/pipeline.go:251

The two-tier failure taxonomy (failed_isolated for reviewer validation failures, failed_blocking for selection/rollup failures) is a product behavior decision: isolated failures let sibling reviewers and rollup continue; blocking failures leave runs incomplete for resume. The rationale — reviewer failures are less critical than orchestration failures — is present in the PR description but not in any versioned artifact. This distinction will need to be understood by anyone extending the pipeline (e.g., adding a new phase) or diagnosing unexpected run states.

💡 Suggestion - internal/reviewplan/reviewplan.go:393

The decision to force the review outcome to at least comment (never approve) when any reviewer fails is a product policy choice with user-visible consequences. It is implemented in a single conditional with no prose explanation nearby. The PR description documents it, but the reasoning (approving when a reviewer silently failed could miss real issues) belongs somewhere versioned so it doesn't get quietly reversed in a future refactor.

harness-engineering:harness-self-documenting-code-reviewer (4 findings)

💡 Suggestion - internal/llm/adapter.go:80

The Attempt field in StructuredValidationAttempt holds a string label like "initial" or "retry" — it names which attempt this record represents, not the attempt itself. The field name collides semantically with the struct name (StructuredValidationAttempt.Attempt sounds like the attempt IS the field). Stage or Label would clarify that this is a string identifier distinguishing attempts, not a recursive self-reference.

⚠️ Should Fix - internal/pipeline/pipeline.go:1211

llmTaskSpec.validationFailure reads as a noun — a description of a failure type — rather than as a configurable policy. A reader has to scan runStructuredTask to discover it controls which llmTaskStatus to assign when ErrStructuredOutputInvalidAfterRetry is returned. validationFailureStatus or onValidationFailure would make its role as a field that configures task error handling obvious at the struct level.

⚠️ Should Fix - internal/pipeline/pipeline.go:1567

sanitizeTaskError does not convey what it sanitizes for or against. The body reveals three distinct operations: whitespace normalization, HTML-escaping of codereview comment markers (to prevent injection into rendered markdown), and rune-length truncation. A reader needs to trace the call sites to understand the rendering context. renderSafeTaskError or sanitizeTaskErrorForMarkdown would make the purpose evident without reading the body.

💡 Suggestion - internal/reviewplan/summary.go:57

ReviewerFailureSummary.Name is vague in a struct that could plausibly hold a display name, a reviewer label, or an agent identifier. The mapping site in pipeline.go assigns failure.AgentID to this field, making the mismatch visible. ReviewerID or AgentID would match the domain term used throughout the pipeline and eliminate the need to cross-reference the mapping.

Completed in 4m 28s | $3.01 | sonnet | daemon 0.2.132 | Glorfindel

Field	Value
Model	sonnet
Reviewers	hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine	claude · sonnet
Reviewed by	pr-review-daemon · monit-pr-reviewer
Duration	4m 28s wall · 4m 25s compute (Reviewers: 3m 35s · Synthesis: 50s)
Cost	$3.01 (estimated)
Tokens	483.4k in / 28.1k out
Turns	12

Per-workstream usage

Workstream	Model	In	Out	Cache read	Cache create	Cost
hybrid-synthesis	sonnet	37.3k	2.9k	13.1k	24.2k (1h)	$0.19
harness-engineering:harness-architecture-reviewer	sonnet	85.3k	1.5k	0	85.3k (1h)	$0.53
harness-engineering:harness-enforcement-reviewer	sonnet	87.5k	3.3k	10.6k	76.9k (1h)	$0.51
harness-engineering:harness-knowledge-reviewer	sonnet	87.4k	3.3k	10.6k	76.8k (1h)	$0.51
harness-engineering:harness-self-documenting-code-reviewer	sonnet	97.0k	12.2k	10.6k	86.4k (1h)	$0.70
security:security-code-auditor	sonnet	88.9k	5.0k	10.6k	78.3k (1h)	$0.55

_{Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.}

Superseded by updated review

monit-reviewer

Automated PR Review

Reviewed commit: 0c3510d | Previous: 177e7da (incremental)

Approved with 1 non-blocking suggestion below. Address at your discretion.

Summary

Reviewer	Findings
documentation:docs-reviewer	1

documentation:docs-reviewer (1 findings)

💡 Suggestion - docs/llm-task-artifacts.md:12

The directory structure lists 'initial.json' and 'retry.json' as sibling files, but the metadata schema section never explicitly names or defines these files. The 'attempts' field describes 'raw output path when present' but doesn't explain that these paths resolve to 'initial.json' and 'retry.json'. A developer implementing against this spec would have to infer the naming convention. A brief note tying the attempt label to the file name (e.g., '.json') would make the schema self-contained.

7 PR discussion threads considered.

Completed in 1m 31s | $8.15 | sonnet | daemon 0.2.132 | Glorfindel

Field	Value
Model	sonnet
Mode	Re-review · Cycle 2 · Session resumed
Reviewers	hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine	claude · sonnet
Reviewed by	pr-review-daemon · monit-pr-reviewer
Duration	1m 31s wall · 1m 00s compute (Reviewers: 39s · Synthesis: 22s)
Cost	$8.15 (estimated)
Tokens	1382.9k in / 33.0k out
Turns	26

Per-workstream usage

Workstream	Model	In	Out	Cache read	Cache create	Cost
hybrid-synthesis	sonnet	86.3k	4.0k	26.3k	60.0k (1h)	$0.43
documentation:docs-reviewer	sonnet	34.1k	1.7k	13.1k	20.9k (1h)	$0.15
harness-engineering:harness-architecture-reviewer	sonnet	232.7k	1.7k	13.1k	219.5k (1h)	$1.35
harness-engineering:harness-enforcement-reviewer	sonnet	239.3k	3.5k	23.7k	215.6k (1h)	$1.35
harness-engineering:harness-knowledge-reviewer	sonnet	238.8k	3.5k	23.7k	215.1k (1h)	$1.35
harness-engineering:harness-self-documenting-code-reviewer	sonnet	265.2k	12.4k	23.7k	241.4k (1h)	$1.64
security:security-code-auditor	sonnet	240.8k	5.2k	23.7k	217.1k (1h)	$1.39
discussion-summarizer	—	45.7k	1.1k	0	45.7k (1h)	$0.48

_{Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.}

rianjs · 2026-06-24T01:35:50Z

Findings

Major - internal/pipeline/pipeline.go:696, internal/pipeline/pipeline.go:1225
Rollup resume can ignore the failed rollup task’s own provider session when --session is active. The rollup spec always passes namedSession.resumeID(), and runStructuredTask only falls back to taskResumeSessionID(meta) when that value is blank. On a live named-session resume, completed selection is loaded first and namedSession.recordSessionID(selectionSession) makes the resume ID non-empty, so the failed rollup retry resumes from the selection/named session instead of the failed_blocking rollup attempt. That violates the “rerun only failed downstream tasks, including prior provider session IDs” contract. Prefer the failed same-task session when retrying failed_blocking metadata, and add a named-session resume test for this path.

Major - internal/llm/adapter.go:135, internal/llm/adapter.go:137, internal/pipeline/pipeline.go:1279
Provider start/wait failures still do not meet the generalized task diagnostic/resume contract. RunStructuredWithSessionResume discards sessionID on runOnceWithSession errors, so a wait failure after a stream starts cannot persist the known provider session ID for metadata or retry. Separately, reviewer tasks only become failed_isolated for ErrStructuredOutputInvalidAfterRetry; task-local provider failures become failed_blocking and cancel sibling reviewers. That is narrower than the discussed invariant that task-local reviewer LLM failures should not damage successful reviewers when the caller context is still valid.

Major - internal/pipeline/pipeline.go:1308, internal/pipeline/pipeline.go:1311
Resume trusts metadata.json’s validated_output_path verbatim. A stale or edited metadata file can redirect resume to any readable path, and the code will decode it as a completed task instead of treating the artifact as inconsistent. The artifact schema says validated-output.json is under the task directory and inconsistent artifacts should fail with rerun guidance; enforce that the path is either the expected path or safely inside the task artifact directory.

rianjs · 2026-06-24T01:43:46Z

Major - internal/pipeline/pipeline.go: task reuse does not validate the adapter/runtime that produced the artifact. baseLLMTaskMetadata records Adapter, but validateLLMTaskMetadata and llmTaskFingerprint only check schema/task/phase/model/effort/prompt/deps. A resume can therefore reuse outputs from a different adapter, and worse, a failed_blocking retry can pass an old provider session ID to a different adapter. Treat adapter identity as load-bearing before reusing output or taskResumeSessionID, and add a mismatch test.

Major - internal/pipeline/pipeline.go: reviewer provider failure isolation is too broad. llmTaskFailureStatus marks any non-context reviewer LLM error as failed_isolated, including auth/quota/systemic provider start errors. That can turn “the LLM backend is unavailable for every reviewer” into a completed comment with diagnostics instead of an incomplete resumable run. The intended boundary is reviewer-local failure; global/provider-capability failures should remain blocking or be explicitly classified with tests.

rianjs · 2026-06-24T01:49:19Z

No blocker or major findings. The latest diff binds task reuse and failed-task retry to the adapter through both metadata validation and fingerprinting, and it fails before invoking the adapter on mismatch. Reviewer LLM failures are now isolated only when there is structured-output failure evidence or a provider session/validation attempt tying the failure to that task; sessionless start failures remain blocking and resumable.

rianjs · 2026-06-24T01:51:47Z

Findings

Major: The new failure-isolation coverage never proves the durable artifacts actually exist on disk. TestDryRunReviewerFailureIsolation / TestDryRunReviewerProviderFailureIsolation in internal/pipeline/pipeline_test.go only assert counts, statuses, and summary text, while TestRunStructuredValidationAttempts in internal/llm/adapter_test.go only inspects in-memory attempts. These tests would still pass if llm-tasks/.../initial.json, retry.json, validated-output.json, or the sanitized failure fields stopped being persisted, which is the core recovery/debuggability fix in this ticket.
Major: The multi-reviewer isolation test is serialized and over-mocked. In internal/pipeline/pipeline_test.go it sets MaxConcurrency: 1, and the fake adapter keys off prompt substrings rather than reviewer identity. That means it does not exercise the actual concurrent reviewer path or prove that one reviewer’s failure cannot cancel or misroute sibling work when the real runner executes in parallel. The ticket explicitly called for agent-keyed/synchronized determinism here, so this leaves the regression surface uncovered.
Minor: The resume test in internal/reviewrun/reviewrun_test.go only proves the planner is invoked when a selection metadata file exists. It does not prove the stored task output/session row is actually reused, nor that fresh selection/reviewer LLM calls are skipped. A broken implementation that ignored llm-tasks and simply replanned the same run could still satisfy this test.

monit-reviewer

Automated PR Review

Reviewed commit: 55f3605 | Previous: 0c3510d (incremental)

Summary

Reviewer	Findings
harness-engineering:harness-self-documenting-code-reviewer	1

harness-engineering:harness-self-documenting-code-reviewer (1 findings)

⚠️ Should Fix - internal/pipeline/pipeline.go:1301

The if condition in llmTaskFailureStatus mixes three semantically distinct predicates into a single 8-clause expression: (1) whether the task is configured to support isolation (spec.llmFailureStatus != ""), (2) whether the error is a context cancellation (ctx.Err() == nil && !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded)), and (3) whether there is evidence the provider accepted the request (errors.Is(err, llm.ErrStructuredOutputInvalidAfterRetry) || strings.TrimSpace(providerSessionID) != "" || hasValidationAttempt). Each sub-group carries domain meaning that disappears in the combined form. Extracting them — e.g., supportsIsolation, isContextCancellation (or reusing the existing isContextError helper used elsewhere in the file), and hasTaskExecutionEvidence — would make the isolation policy readable without parsing all 8 clauses.

12 PR discussion threads considered.

Completed in 4m 07s | $15.72 | sonnet | daemon 0.2.132 | Glorfindel

Field	Value
Model	sonnet
Mode	Re-review · Cycle 3 · Session resumed
Reviewers	hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine	claude · sonnet
Reviewed by	pr-review-daemon · monit-pr-reviewer
Duration	4m 07s wall · 3m 36s compute (Reviewers: 2m 51s · Synthesis: 45s)
Cost	$15.72 (estimated)
Tokens	2682.3k in / 51.3k out
Turns	40

Per-workstream usage

Workstream	Model	In	Out	Cache read	Cache create	Cost
hybrid-synthesis	sonnet	144.7k	5.3k	39.4k	105.3k (1h)	$0.72
documentation:docs-reviewer	sonnet	77.7k	2.3k	26.3k	51.4k (1h)	$0.35
harness-engineering:harness-architecture-reviewer	sonnet	453.4k	1.9k	26.3k	427.1k (1h)	$2.60
harness-engineering:harness-enforcement-reviewer	sonnet	466.8k	5.7k	36.9k	429.9k (1h)	$2.68
harness-engineering:harness-knowledge-reviewer	sonnet	464.8k	4.7k	36.9k	428.0k (1h)	$2.65
harness-engineering:harness-self-documenting-code-reviewer	sonnet	514.1k	22.0k	36.9k	477.2k (1h)	$3.20
security:security-code-auditor	sonnet	465.7k	8.0k	36.9k	428.8k (1h)	$2.70
discussion-summarizer	—	95.1k	1.4k	18.5k	76.6k (1h)	$0.81

_{Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.}

Superseded by updated review

monit-reviewer · 2026-06-24T03:14:10Z

+	err    error
+}
+
+func (e *llmTaskError) Error() string {


🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): llmTaskAttemptMetadata.Attempt holds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved for StructuredValidationAttempt in this PR (renamed Attempt → Label). Using Label here would be consistent with that fix and eliminate the ambiguity.

Reply to this thread when addressed.

monit-reviewer · 2026-06-24T03:14:11Z

+// ReviewerFailureSummary is a reviewer task diagnostic rendered in the rollup.
+type ReviewerFailureSummary struct {
+	AgentID string
+	Error   string


🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): ReviewerFailureSummary.Error is a plain string holding a rendered diagnostic message, but the field name Error evokes Go's error interface and implies a runtime error value. A name like Diagnostic or ErrorMessage would distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.

Reply to this thread when addressed.

monit-reviewer

Automated PR Review

Reviewed commit: dc40084 | Previous: 55f3605 (incremental)

Approved with 3 non-blocking suggestions below. Address at your discretion.

Summary

Reviewer	Findings
harness-engineering:harness-self-documenting-code-reviewer	3

harness-engineering:harness-self-documenting-code-reviewer (3 findings)

💡 Suggestion - internal/pipeline/pipeline.go:266

llmTaskAttemptMetadata.Attempt holds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved for StructuredValidationAttempt in this PR (renamed Attempt → Label). Using Label here would be consistent with that fix and eliminate the ambiguity.

💡 Suggestion - internal/pipeline/pipeline.go:1690

In validateLLMTaskPayloadPath, the path-containment check rel == ".." || strings.HasPrefix(rel, ".."+string(os.PathSeparator)) || filepath.IsAbs(rel) encodes the concept "path escapes the task directory" as raw operator soup. Extracting as escapesTaskDir := ... before the if-statement would make the path-traversal intent legible without needing to parse the three clauses.

💡 Suggestion - internal/reviewplan/summary.go:60

ReviewerFailureSummary.Error is a plain string holding a rendered diagnostic message, but the field name Error evokes Go's error interface and implies a runtime error value. A name like Diagnostic or ErrorMessage would distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.

13 PR discussion threads considered.

Completed in 5m 58s | $19.91 | sonnet | daemon 0.2.132 | Glorfindel

Field	Value
Model	sonnet
Mode	Re-review · Cycle 4 · Session resumed
Reviewers	hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine	claude · sonnet
Reviewed by	pr-review-daemon · monit-pr-reviewer
Duration	5m 58s wall · 5m 30s compute (Reviewers: 3m 45s · Synthesis: 1m 45s)
Cost	$19.91 (estimated)
Tokens	3648.7k in / 44.8k out
Turns	48

Per-workstream usage

Workstream	Model	In	Out	Cache read	Cache create	Cost
hybrid-synthesis	sonnet	213.9k	7.1k	79.7k	134.2k (1h)	$0.94
documentation:docs-reviewer	sonnet	129.3k	2.5k	39.4k	89.9k (1h)	$0.59
harness-engineering:harness-architecture-reviewer	sonnet	747.6k	2.1k	39.4k	708.2k (1h)	$4.29
harness-engineering:harness-enforcement-reviewer	sonnet	770.2k	5.9k	195.3k	574.9k (1h)	$3.60
harness-engineering:harness-knowledge-reviewer	sonnet	765.6k	4.9k	50.0k	715.6k (1h)	$4.38
harness-engineering:harness-self-documenting-code-reviewer	sonnet	110.9k	12.4k	55.8k	55.1k (1h)	$0.53
security:security-code-auditor	sonnet	763.9k	8.2k	50.0k	713.8k (1h)	$4.42
discussion-summarizer	—	147.3k	1.6k	37.0k	110.4k (1h)	$1.16

_{Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.}

Note: Posted 2 inline comments. 1 finding remained in the summary because GitHub could not attach them to the current diff.

rianjs · 2026-06-24T03:21:58Z

No architectural drift found in the post-daemon diff. The daemon edits strengthen the concurrency coverage for sibling reviewer survival and raw payload persistence without changing the accepted task/resume semantics.

Isolate reviewer LLM task failures

177e7da

monit-reviewer previously requested changes Jun 24, 2026

View reviewed changes

Document LLM task artifact contract

0c3510d

monit-reviewer previously approved these changes Jun 24, 2026

View reviewed changes

Comment thread docs/llm-task-artifacts.md

rianjs changed the title ~~Isolate reviewer LLM task failures~~ fix: isolate reviewer LLM task failures Jun 24, 2026

docs: clarify LLM task attempt files

6ee858e

rianjs dismissed monit-reviewer’s stale review via 6ee858e June 24, 2026 01:29

fix: harden LLM task resume boundaries

07996f1

fix: bind LLM task artifacts to adapter

e51cb5f

test: assert durable LLM task artifacts

55f3605

monit-reviewer previously requested changes Jun 24, 2026

View reviewed changes

Comment thread internal/pipeline/pipeline.go

refactor: name LLM task failure predicates

dc40084

monit-reviewer reviewed Jun 24, 2026

View reviewed changes

monit-reviewer approved these changes Jun 24, 2026

View reviewed changes

rianjs merged commit fe57b80 into main Jun 24, 2026
10 checks passed

rianjs deleted the bug/360-reviewer-failure-isolation branch June 24, 2026 03:22

Uh oh!

Conversation

rianjs commented Jun 24, 2026

Summary

Verification

Uh oh!

monit-reviewer left a comment

Choose a reason for hiding this comment

Automated PR Review

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

monit-reviewer left a comment

Choose a reason for hiding this comment

Automated PR Review

Summary

Uh oh!

Uh oh!

rianjs commented Jun 24, 2026

Uh oh!

rianjs commented Jun 24, 2026

Uh oh!

rianjs commented Jun 24, 2026

Uh oh!

rianjs commented Jun 24, 2026

Uh oh!

monit-reviewer left a comment

Choose a reason for hiding this comment

Automated PR Review

Summary

Uh oh!

Uh oh!

monit-reviewer Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

monit-reviewer Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

monit-reviewer left a comment

Choose a reason for hiding this comment

Automated PR Review

Summary

Uh oh!

rianjs commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants