Skip to content

fix: isolate reviewer LLM task failures#361

Merged
rianjs merged 7 commits into
mainfrom
bug/360-reviewer-failure-isolation
Jun 24, 2026
Merged

fix: isolate reviewer LLM task failures#361
rianjs merged 7 commits into
mainfrom
bug/360-reviewer-failure-isolation

Conversation

@rianjs

@rianjs rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Wrap selection, reviewer, and rollup structured LLM calls in durable per-task metadata with validated outputs, raw failed attempts, and provider session IDs.
  • Isolate reviewer validation failures as diagnostics so sibling reviewers and rollup can continue, while blocking LLM task failures leave runs incomplete for resume.
  • Resume live planning from completed task artifacts and rerun only failed downstream tasks, including prior provider session IDs when available.
  • Render isolated reviewer diagnostics in the rollup and force the review outcome to at least comment when a reviewer fails.

Fixes #360

Verification

  • rtk go test ./...
  • rtk make lint
  • rtk make build

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 177e7da

Summary

Reviewer Findings
harness-engineering:harness-knowledge-reviewer 3
harness-engineering:harness-self-documenting-code-reviewer 4
harness-engineering:harness-knowledge-reviewer (3 findings)

⚠️ Should Fix - internal/pipeline/pipeline.go:246

The new LLM task artifact schema (schema_version=1, three-status taxonomy, llm-tasks/ directory layout, fingerprint fields) is a durable on-disk contract that gates resume safety. The invariants that matter most — what changes require a version bump, why metadata is written last as the durable resume boundary, which fields are load-bearing — exist only as inline comments and the PR description. A future contributor who adds a status value or renames a field without bumping schema_version will silently break in-flight run resumption. A short docs/ entry (e.g., docs/llm-task-artifacts.md) would make these invariants versioned and discoverable.

💡 Suggestion - internal/pipeline/pipeline.go:251

The two-tier failure taxonomy (failed_isolated for reviewer validation failures, failed_blocking for selection/rollup failures) is a product behavior decision: isolated failures let sibling reviewers and rollup continue; blocking failures leave runs incomplete for resume. The rationale — reviewer failures are less critical than orchestration failures — is present in the PR description but not in any versioned artifact. This distinction will need to be understood by anyone extending the pipeline (e.g., adding a new phase) or diagnosing unexpected run states.

💡 Suggestion - internal/reviewplan/reviewplan.go:393

The decision to force the review outcome to at least comment (never approve) when any reviewer fails is a product policy choice with user-visible consequences. It is implemented in a single conditional with no prose explanation nearby. The PR description documents it, but the reasoning (approving when a reviewer silently failed could miss real issues) belongs somewhere versioned so it doesn't get quietly reversed in a future refactor.

harness-engineering:harness-self-documenting-code-reviewer (4 findings)

💡 Suggestion - internal/llm/adapter.go:80

The Attempt field in StructuredValidationAttempt holds a string label like "initial" or "retry" — it names which attempt this record represents, not the attempt itself. The field name collides semantically with the struct name (StructuredValidationAttempt.Attempt sounds like the attempt IS the field). Stage or Label would clarify that this is a string identifier distinguishing attempts, not a recursive self-reference.

⚠️ Should Fix - internal/pipeline/pipeline.go:1211

llmTaskSpec.validationFailure reads as a noun — a description of a failure type — rather than as a configurable policy. A reader has to scan runStructuredTask to discover it controls which llmTaskStatus to assign when ErrStructuredOutputInvalidAfterRetry is returned. validationFailureStatus or onValidationFailure would make its role as a field that configures task error handling obvious at the struct level.

⚠️ Should Fix - internal/pipeline/pipeline.go:1567

sanitizeTaskError does not convey what it sanitizes for or against. The body reveals three distinct operations: whitespace normalization, HTML-escaping of codereview comment markers (to prevent injection into rendered markdown), and rune-length truncation. A reader needs to trace the call sites to understand the rendering context. renderSafeTaskError or sanitizeTaskErrorForMarkdown would make the purpose evident without reading the body.

💡 Suggestion - internal/reviewplan/summary.go:57

ReviewerFailureSummary.Name is vague in a struct that could plausibly hold a display name, a reviewer label, or an agent identifier. The mapping site in pipeline.go assigns failure.AgentID to this field, making the mismatch visible. ReviewerID or AgentID would match the domain term used throughout the pipeline and eliminate the need to cross-reference the mapping.


Completed in 4m 28s | $3.01 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Reviewers hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 4m 28s wall · 4m 25s compute (Reviewers: 3m 35s · Synthesis: 50s)
Cost $3.01 (estimated)
Tokens 483.4k in / 28.1k out
Turns 12

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 37.3k 2.9k 13.1k 24.2k (1h) $0.19
harness-engineering:harness-architecture-reviewer sonnet 85.3k 1.5k 0 85.3k (1h) $0.53
harness-engineering:harness-enforcement-reviewer sonnet 87.5k 3.3k 10.6k 76.9k (1h) $0.51
harness-engineering:harness-knowledge-reviewer sonnet 87.4k 3.3k 10.6k 76.8k (1h) $0.51
harness-engineering:harness-self-documenting-code-reviewer sonnet 97.0k 12.2k 10.6k 86.4k (1h) $0.70
security:security-code-auditor sonnet 88.9k 5.0k 10.6k 78.3k (1h) $0.55

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Comment thread internal/pipeline/pipeline.go
Comment thread internal/pipeline/pipeline.go
Comment thread internal/reviewplan/reviewplan.go
Comment thread internal/llm/adapter.go
Comment thread internal/pipeline/pipeline.go
Comment thread internal/pipeline/pipeline.go
Comment thread internal/reviewplan/summary.go
@monit-reviewer monit-reviewer dismissed their stale review June 24, 2026 01:28

Superseded by updated review

monit-reviewer
monit-reviewer previously approved these changes Jun 24, 2026

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 0c3510d | Previous: 177e7da (incremental)

Approved with 1 non-blocking suggestion below. Address at your discretion.

Summary

Reviewer Findings
documentation:docs-reviewer 1
documentation:docs-reviewer (1 findings)

💡 Suggestion - docs/llm-task-artifacts.md:12

The directory structure lists 'initial.json' and 'retry.json' as sibling files, but the metadata schema section never explicitly names or defines these files. The 'attempts' field describes 'raw output path when present' but doesn't explain that these paths resolve to 'initial.json' and 'retry.json'. A developer implementing against this spec would have to infer the naming convention. A brief note tying the attempt label to the file name (e.g., '.json') would make the schema self-contained.

7 PR discussion threads considered.


Completed in 1m 31s | $8.15 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Mode Re-review · Cycle 2 · Session resumed
Reviewers hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 1m 31s wall · 1m 00s compute (Reviewers: 39s · Synthesis: 22s)
Cost $8.15 (estimated)
Tokens 1382.9k in / 33.0k out
Turns 26

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 86.3k 4.0k 26.3k 60.0k (1h) $0.43
documentation:docs-reviewer sonnet 34.1k 1.7k 13.1k 20.9k (1h) $0.15
harness-engineering:harness-architecture-reviewer sonnet 232.7k 1.7k 13.1k 219.5k (1h) $1.35
harness-engineering:harness-enforcement-reviewer sonnet 239.3k 3.5k 23.7k 215.6k (1h) $1.35
harness-engineering:harness-knowledge-reviewer sonnet 238.8k 3.5k 23.7k 215.1k (1h) $1.35
harness-engineering:harness-self-documenting-code-reviewer sonnet 265.2k 12.4k 23.7k 241.4k (1h) $1.64
security:security-code-auditor sonnet 240.8k 5.2k 23.7k 217.1k (1h) $1.39
discussion-summarizer 45.7k 1.1k 0 45.7k (1h) $0.48

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Comment thread docs/llm-task-artifacts.md
@rianjs rianjs changed the title Isolate reviewer LLM task failures fix: isolate reviewer LLM task failures Jun 24, 2026
@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Findings

Major - internal/pipeline/pipeline.go:696, internal/pipeline/pipeline.go:1225
Rollup resume can ignore the failed rollup task’s own provider session when --session is active. The rollup spec always passes namedSession.resumeID(), and runStructuredTask only falls back to taskResumeSessionID(meta) when that value is blank. On a live named-session resume, completed selection is loaded first and namedSession.recordSessionID(selectionSession) makes the resume ID non-empty, so the failed rollup retry resumes from the selection/named session instead of the failed_blocking rollup attempt. That violates the “rerun only failed downstream tasks, including prior provider session IDs” contract. Prefer the failed same-task session when retrying failed_blocking metadata, and add a named-session resume test for this path.

Major - internal/llm/adapter.go:135, internal/llm/adapter.go:137, internal/pipeline/pipeline.go:1279
Provider start/wait failures still do not meet the generalized task diagnostic/resume contract. RunStructuredWithSessionResume discards sessionID on runOnceWithSession errors, so a wait failure after a stream starts cannot persist the known provider session ID for metadata or retry. Separately, reviewer tasks only become failed_isolated for ErrStructuredOutputInvalidAfterRetry; task-local provider failures become failed_blocking and cancel sibling reviewers. That is narrower than the discussed invariant that task-local reviewer LLM failures should not damage successful reviewers when the caller context is still valid.

Major - internal/pipeline/pipeline.go:1308, internal/pipeline/pipeline.go:1311
Resume trusts metadata.json’s validated_output_path verbatim. A stale or edited metadata file can redirect resume to any readable path, and the code will decode it as a completed task instead of treating the artifact as inconsistent. The artifact schema says validated-output.json is under the task directory and inconsistent artifacts should fail with rerun guidance; enforce that the path is either the expected path or safely inside the task artifact directory.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Major - internal/pipeline/pipeline.go: task reuse does not validate the adapter/runtime that produced the artifact. baseLLMTaskMetadata records Adapter, but validateLLMTaskMetadata and llmTaskFingerprint only check schema/task/phase/model/effort/prompt/deps. A resume can therefore reuse outputs from a different adapter, and worse, a failed_blocking retry can pass an old provider session ID to a different adapter. Treat adapter identity as load-bearing before reusing output or taskResumeSessionID, and add a mismatch test.

Major - internal/pipeline/pipeline.go: reviewer provider failure isolation is too broad. llmTaskFailureStatus marks any non-context reviewer LLM error as failed_isolated, including auth/quota/systemic provider start errors. That can turn “the LLM backend is unavailable for every reviewer” into a completed comment with diagnostics instead of an incomplete resumable run. The intended boundary is reviewer-local failure; global/provider-capability failures should remain blocking or be explicitly classified with tests.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

No blocker or major findings. The latest diff binds task reuse and failed-task retry to the adapter through both metadata validation and fingerprinting, and it fails before invoking the adapter on mismatch. Reviewer LLM failures are now isolated only when there is structured-output failure evidence or a provider session/validation attempt tying the failure to that task; sessionless start failures remain blocking and resumable.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Findings

  1. Major: The new failure-isolation coverage never proves the durable artifacts actually exist on disk. TestDryRunReviewerFailureIsolation / TestDryRunReviewerProviderFailureIsolation in internal/pipeline/pipeline_test.go only assert counts, statuses, and summary text, while TestRunStructuredValidationAttempts in internal/llm/adapter_test.go only inspects in-memory attempts. These tests would still pass if llm-tasks/.../initial.json, retry.json, validated-output.json, or the sanitized failure fields stopped being persisted, which is the core recovery/debuggability fix in this ticket.

  2. Major: The multi-reviewer isolation test is serialized and over-mocked. In internal/pipeline/pipeline_test.go it sets MaxConcurrency: 1, and the fake adapter keys off prompt substrings rather than reviewer identity. That means it does not exercise the actual concurrent reviewer path or prove that one reviewer’s failure cannot cancel or misroute sibling work when the real runner executes in parallel. The ticket explicitly called for agent-keyed/synchronized determinism here, so this leaves the regression surface uncovered.

  3. Minor: The resume test in internal/reviewrun/reviewrun_test.go only proves the planner is invoked when a selection metadata file exists. It does not prove the stored task output/session row is actually reused, nor that fresh selection/reviewer LLM calls are skipped. A broken implementation that ignored llm-tasks and simply replanned the same run could still satisfy this test.

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 55f3605 | Previous: 0c3510d (incremental)

Summary

Reviewer Findings
harness-engineering:harness-self-documenting-code-reviewer 1
harness-engineering:harness-self-documenting-code-reviewer (1 findings)

⚠️ Should Fix - internal/pipeline/pipeline.go:1301

The if condition in llmTaskFailureStatus mixes three semantically distinct predicates into a single 8-clause expression: (1) whether the task is configured to support isolation (spec.llmFailureStatus != ""), (2) whether the error is a context cancellation (ctx.Err() == nil && !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded)), and (3) whether there is evidence the provider accepted the request (errors.Is(err, llm.ErrStructuredOutputInvalidAfterRetry) || strings.TrimSpace(providerSessionID) != "" || hasValidationAttempt). Each sub-group carries domain meaning that disappears in the combined form. Extracting them — e.g., supportsIsolation, isContextCancellation (or reusing the existing isContextError helper used elsewhere in the file), and hasTaskExecutionEvidence — would make the isolation policy readable without parsing all 8 clauses.

12 PR discussion threads considered.


Completed in 4m 07s | $15.72 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Mode Re-review · Cycle 3 · Session resumed
Reviewers hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 4m 07s wall · 3m 36s compute (Reviewers: 2m 51s · Synthesis: 45s)
Cost $15.72 (estimated)
Tokens 2682.3k in / 51.3k out
Turns 40

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 144.7k 5.3k 39.4k 105.3k (1h) $0.72
documentation:docs-reviewer sonnet 77.7k 2.3k 26.3k 51.4k (1h) $0.35
harness-engineering:harness-architecture-reviewer sonnet 453.4k 1.9k 26.3k 427.1k (1h) $2.60
harness-engineering:harness-enforcement-reviewer sonnet 466.8k 5.7k 36.9k 429.9k (1h) $2.68
harness-engineering:harness-knowledge-reviewer sonnet 464.8k 4.7k 36.9k 428.0k (1h) $2.65
harness-engineering:harness-self-documenting-code-reviewer sonnet 514.1k 22.0k 36.9k 477.2k (1h) $3.20
security:security-code-auditor sonnet 465.7k 8.0k 36.9k 428.8k (1h) $2.70
discussion-summarizer 95.1k 1.4k 18.5k 76.6k (1h) $0.81

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Comment thread internal/pipeline/pipeline.go
@monit-reviewer monit-reviewer dismissed their stale review June 24, 2026 03:14

Superseded by updated review

err error
}

func (e *llmTaskError) Error() string {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): llmTaskAttemptMetadata.Attempt holds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved for StructuredValidationAttempt in this PR (renamed AttemptLabel). Using Label here would be consistent with that fix and eliminate the ambiguity.

Reply to this thread when addressed.

// ReviewerFailureSummary is a reviewer task diagnostic rendered in the rollup.
type ReviewerFailureSummary struct {
AgentID string
Error string

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): ReviewerFailureSummary.Error is a plain string holding a rendered diagnostic message, but the field name Error evokes Go's error interface and implies a runtime error value. A name like Diagnostic or ErrorMessage would distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.

Reply to this thread when addressed.

@monit-reviewer monit-reviewer left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: dc40084 | Previous: 55f3605 (incremental)

Approved with 3 non-blocking suggestions below. Address at your discretion.

Summary

Reviewer Findings
harness-engineering:harness-self-documenting-code-reviewer 3
harness-engineering:harness-self-documenting-code-reviewer (3 findings)

💡 Suggestion - internal/pipeline/pipeline.go:266

llmTaskAttemptMetadata.Attempt holds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved for StructuredValidationAttempt in this PR (renamed AttemptLabel). Using Label here would be consistent with that fix and eliminate the ambiguity.

💡 Suggestion - internal/pipeline/pipeline.go:1690

In validateLLMTaskPayloadPath, the path-containment check rel == ".." || strings.HasPrefix(rel, ".."+string(os.PathSeparator)) || filepath.IsAbs(rel) encodes the concept "path escapes the task directory" as raw operator soup. Extracting as escapesTaskDir := ... before the if-statement would make the path-traversal intent legible without needing to parse the three clauses.

💡 Suggestion - internal/reviewplan/summary.go:60

ReviewerFailureSummary.Error is a plain string holding a rendered diagnostic message, but the field name Error evokes Go's error interface and implies a runtime error value. A name like Diagnostic or ErrorMessage would distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.

13 PR discussion threads considered.


Completed in 5m 58s | $19.91 | sonnet | daemon 0.2.132 | Glorfindel
Field Value
Model sonnet
Mode Re-review · Cycle 4 · Session resumed
Reviewers hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 5m 58s wall · 5m 30s compute (Reviewers: 3m 45s · Synthesis: 1m 45s)
Cost $19.91 (estimated)
Tokens 3648.7k in / 44.8k out
Turns 48

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 213.9k 7.1k 79.7k 134.2k (1h) $0.94
documentation:docs-reviewer sonnet 129.3k 2.5k 39.4k 89.9k (1h) $0.59
harness-engineering:harness-architecture-reviewer sonnet 747.6k 2.1k 39.4k 708.2k (1h) $4.29
harness-engineering:harness-enforcement-reviewer sonnet 770.2k 5.9k 195.3k 574.9k (1h) $3.60
harness-engineering:harness-knowledge-reviewer sonnet 765.6k 4.9k 50.0k 715.6k (1h) $4.38
harness-engineering:harness-self-documenting-code-reviewer sonnet 110.9k 12.4k 55.8k 55.1k (1h) $0.53
security:security-code-auditor sonnet 763.9k 8.2k 50.0k 713.8k (1h) $4.42
discussion-summarizer 147.3k 1.6k 37.0k 110.4k (1h) $1.16

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Note: Posted 2 inline comments. 1 finding remained in the summary because GitHub could not attach them to the current diff.

@rianjs

rianjs commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

No architectural drift found in the post-daemon diff. The daemon edits strengthen the concurrency coverage for sibling reviewer survival and raw payload persistence without changing the accepted task/resume semantics.

@rianjs rianjs merged commit fe57b80 into main Jun 24, 2026
10 checks passed
@rianjs rianjs deleted the bug/360-reviewer-failure-isolation branch June 24, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cr review aborts whole run and loses diagnostics when one reviewer returns invalid structured output

2 participants