fix: isolate reviewer LLM task failures#361
Conversation
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: 177e7da
Summary
| Reviewer | Findings |
|---|---|
| harness-engineering:harness-knowledge-reviewer | 3 |
| harness-engineering:harness-self-documenting-code-reviewer | 4 |
harness-engineering:harness-knowledge-reviewer (3 findings)
internal/pipeline/pipeline.go:246
The new LLM task artifact schema (schema_version=1, three-status taxonomy, llm-tasks/ directory layout, fingerprint fields) is a durable on-disk contract that gates resume safety. The invariants that matter most — what changes require a version bump, why metadata is written last as the durable resume boundary, which fields are load-bearing — exist only as inline comments and the PR description. A future contributor who adds a status value or renames a field without bumping schema_version will silently break in-flight run resumption. A short docs/ entry (e.g., docs/llm-task-artifacts.md) would make these invariants versioned and discoverable.
💡 Suggestion - internal/pipeline/pipeline.go:251
The two-tier failure taxonomy (failed_isolated for reviewer validation failures, failed_blocking for selection/rollup failures) is a product behavior decision: isolated failures let sibling reviewers and rollup continue; blocking failures leave runs incomplete for resume. The rationale — reviewer failures are less critical than orchestration failures — is present in the PR description but not in any versioned artifact. This distinction will need to be understood by anyone extending the pipeline (e.g., adding a new phase) or diagnosing unexpected run states.
💡 Suggestion - internal/reviewplan/reviewplan.go:393
The decision to force the review outcome to at least comment (never approve) when any reviewer fails is a product policy choice with user-visible consequences. It is implemented in a single conditional with no prose explanation nearby. The PR description documents it, but the reasoning (approving when a reviewer silently failed could miss real issues) belongs somewhere versioned so it doesn't get quietly reversed in a future refactor.
harness-engineering:harness-self-documenting-code-reviewer (4 findings)
💡 Suggestion - internal/llm/adapter.go:80
The
Attemptfield inStructuredValidationAttemptholds a string label like "initial" or "retry" — it names which attempt this record represents, not the attempt itself. The field name collides semantically with the struct name (StructuredValidationAttempt.Attemptsounds like the attempt IS the field).StageorLabelwould clarify that this is a string identifier distinguishing attempts, not a recursive self-reference.
internal/pipeline/pipeline.go:1211
llmTaskSpec.validationFailurereads as a noun — a description of a failure type — rather than as a configurable policy. A reader has to scanrunStructuredTaskto discover it controls whichllmTaskStatusto assign whenErrStructuredOutputInvalidAfterRetryis returned.validationFailureStatusoronValidationFailurewould make its role as a field that configures task error handling obvious at the struct level.
internal/pipeline/pipeline.go:1567
sanitizeTaskErrordoes not convey what it sanitizes for or against. The body reveals three distinct operations: whitespace normalization, HTML-escaping of codereview comment markers (to prevent injection into rendered markdown), and rune-length truncation. A reader needs to trace the call sites to understand the rendering context.renderSafeTaskErrororsanitizeTaskErrorForMarkdownwould make the purpose evident without reading the body.
💡 Suggestion - internal/reviewplan/summary.go:57
ReviewerFailureSummary.Nameis vague in a struct that could plausibly hold a display name, a reviewer label, or an agent identifier. The mapping site inpipeline.goassignsfailure.AgentIDto this field, making the mismatch visible.ReviewerIDorAgentIDwould match the domain term used throughout the pipeline and eliminate the need to cross-reference the mapping.
Completed in 4m 28s | $3.01 | sonnet | daemon 0.2.132 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Reviewers | hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 4m 28s wall · 4m 25s compute (Reviewers: 3m 35s · Synthesis: 50s) |
| Cost | $3.01 (estimated) |
| Tokens | 483.4k in / 28.1k out |
| Turns | 12 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 37.3k | 2.9k | 13.1k | 24.2k (1h) | $0.19 |
| harness-engineering:harness-architecture-reviewer | sonnet | 85.3k | 1.5k | 0 | 85.3k (1h) | $0.53 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 87.5k | 3.3k | 10.6k | 76.9k (1h) | $0.51 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 87.4k | 3.3k | 10.6k | 76.8k (1h) | $0.51 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 97.0k | 12.2k | 10.6k | 86.4k (1h) | $0.70 |
| security:security-code-auditor | sonnet | 88.9k | 5.0k | 10.6k | 78.3k (1h) | $0.55 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: 0c3510d | Previous: 177e7da (incremental)
Approved with 1 non-blocking suggestion below. Address at your discretion.
Summary
| Reviewer | Findings |
|---|---|
| documentation:docs-reviewer | 1 |
documentation:docs-reviewer (1 findings)
💡 Suggestion - docs/llm-task-artifacts.md:12
The directory structure lists 'initial.json' and 'retry.json' as sibling files, but the metadata schema section never explicitly names or defines these files. The 'attempts' field describes 'raw output path when present' but doesn't explain that these paths resolve to 'initial.json' and 'retry.json'. A developer implementing against this spec would have to infer the naming convention. A brief note tying the attempt label to the file name (e.g., '.json') would make the schema self-contained.
7 PR discussion threads considered.
Completed in 1m 31s | $8.15 | sonnet | daemon 0.2.132 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Mode | Re-review · Cycle 2 · Session resumed |
| Reviewers | hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 1m 31s wall · 1m 00s compute (Reviewers: 39s · Synthesis: 22s) |
| Cost | $8.15 (estimated) |
| Tokens | 1382.9k in / 33.0k out |
| Turns | 26 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 86.3k | 4.0k | 26.3k | 60.0k (1h) | $0.43 |
| documentation:docs-reviewer | sonnet | 34.1k | 1.7k | 13.1k | 20.9k (1h) | $0.15 |
| harness-engineering:harness-architecture-reviewer | sonnet | 232.7k | 1.7k | 13.1k | 219.5k (1h) | $1.35 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 239.3k | 3.5k | 23.7k | 215.6k (1h) | $1.35 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 238.8k | 3.5k | 23.7k | 215.1k (1h) | $1.35 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 265.2k | 12.4k | 23.7k | 241.4k (1h) | $1.64 |
| security:security-code-auditor | sonnet | 240.8k | 5.2k | 23.7k | 217.1k (1h) | $1.39 |
| discussion-summarizer | — | 45.7k | 1.1k | 0 | 45.7k (1h) | $0.48 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
|
Findings Major - Major - Major - |
|
Major - Major - |
|
No blocker or major findings. The latest diff binds task reuse and failed-task retry to the adapter through both metadata validation and fingerprinting, and it fails before invoking the adapter on mismatch. Reviewer LLM failures are now isolated only when there is structured-output failure evidence or a provider session/validation attempt tying the failure to that task; sessionless start failures remain blocking and resumable. |
|
Findings
|
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: 55f3605 | Previous: 0c3510d (incremental)
Summary
| Reviewer | Findings |
|---|---|
| harness-engineering:harness-self-documenting-code-reviewer | 1 |
harness-engineering:harness-self-documenting-code-reviewer (1 findings)
internal/pipeline/pipeline.go:1301
The
ifcondition inllmTaskFailureStatusmixes three semantically distinct predicates into a single 8-clause expression: (1) whether the task is configured to support isolation (spec.llmFailureStatus != ""), (2) whether the error is a context cancellation (ctx.Err() == nil && !errors.Is(err, context.Canceled) && !errors.Is(err, context.DeadlineExceeded)), and (3) whether there is evidence the provider accepted the request (errors.Is(err, llm.ErrStructuredOutputInvalidAfterRetry) || strings.TrimSpace(providerSessionID) != "" || hasValidationAttempt). Each sub-group carries domain meaning that disappears in the combined form. Extracting them — e.g.,supportsIsolation,isContextCancellation(or reusing the existingisContextErrorhelper used elsewhere in the file), andhasTaskExecutionEvidence— would make the isolation policy readable without parsing all 8 clauses.
12 PR discussion threads considered.
Completed in 4m 07s | $15.72 | sonnet | daemon 0.2.132 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Mode | Re-review · Cycle 3 · Session resumed |
| Reviewers | hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 4m 07s wall · 3m 36s compute (Reviewers: 2m 51s · Synthesis: 45s) |
| Cost | $15.72 (estimated) |
| Tokens | 2682.3k in / 51.3k out |
| Turns | 40 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 144.7k | 5.3k | 39.4k | 105.3k (1h) | $0.72 |
| documentation:docs-reviewer | sonnet | 77.7k | 2.3k | 26.3k | 51.4k (1h) | $0.35 |
| harness-engineering:harness-architecture-reviewer | sonnet | 453.4k | 1.9k | 26.3k | 427.1k (1h) | $2.60 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 466.8k | 5.7k | 36.9k | 429.9k (1h) | $2.68 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 464.8k | 4.7k | 36.9k | 428.0k (1h) | $2.65 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 514.1k | 22.0k | 36.9k | 477.2k (1h) | $3.20 |
| security:security-code-auditor | sonnet | 465.7k | 8.0k | 36.9k | 428.8k (1h) | $2.70 |
| discussion-summarizer | — | 95.1k | 1.4k | 18.5k | 76.6k (1h) | $0.81 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
| err error | ||
| } | ||
|
|
||
| func (e *llmTaskError) Error() string { |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): llmTaskAttemptMetadata.Attempt holds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved for StructuredValidationAttempt in this PR (renamed Attempt → Label). Using Label here would be consistent with that fix and eliminate the ambiguity.
Reply to this thread when addressed.
| // ReviewerFailureSummary is a reviewer task diagnostic rendered in the rollup. | ||
| type ReviewerFailureSummary struct { | ||
| AgentID string | ||
| Error string |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): ReviewerFailureSummary.Error is a plain string holding a rendered diagnostic message, but the field name Error evokes Go's error interface and implies a runtime error value. A name like Diagnostic or ErrorMessage would distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.
Reply to this thread when addressed.
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: dc40084 | Previous: 55f3605 (incremental)
Approved with 3 non-blocking suggestions below. Address at your discretion.
Summary
| Reviewer | Findings |
|---|---|
| harness-engineering:harness-self-documenting-code-reviewer | 3 |
harness-engineering:harness-self-documenting-code-reviewer (3 findings)
💡 Suggestion - internal/pipeline/pipeline.go:266
llmTaskAttemptMetadata.Attemptholds a label string (e.g., "initial", "retry") but the field name collides semantically with the struct name, making it read as "the attempt itself" rather than "the label identifying this attempt." This is the same ambiguity that was resolved forStructuredValidationAttemptin this PR (renamedAttempt→Label). UsingLabelhere would be consistent with that fix and eliminate the ambiguity.
💡 Suggestion - internal/pipeline/pipeline.go:1690
In
validateLLMTaskPayloadPath, the path-containment checkrel == ".." || strings.HasPrefix(rel, ".."+string(os.PathSeparator)) || filepath.IsAbs(rel)encodes the concept "path escapes the task directory" as raw operator soup. Extracting asescapesTaskDir := ...before the if-statement would make the path-traversal intent legible without needing to parse the three clauses.
💡 Suggestion - internal/reviewplan/summary.go:60
ReviewerFailureSummary.Erroris a plain string holding a rendered diagnostic message, but the field nameErrorevokes Go's error interface and implies a runtime error value. A name likeDiagnosticorErrorMessagewould distinguish it from an interface type and make the rendered-string nature clear to readers of the struct.
13 PR discussion threads considered.
Completed in 5m 58s | $19.91 | sonnet | daemon 0.2.132 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Mode | Re-review · Cycle 4 · Session resumed |
| Reviewers | hybrid-synthesis, documentation:docs-reviewer, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 5m 58s wall · 5m 30s compute (Reviewers: 3m 45s · Synthesis: 1m 45s) |
| Cost | $19.91 (estimated) |
| Tokens | 3648.7k in / 44.8k out |
| Turns | 48 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 213.9k | 7.1k | 79.7k | 134.2k (1h) | $0.94 |
| documentation:docs-reviewer | sonnet | 129.3k | 2.5k | 39.4k | 89.9k (1h) | $0.59 |
| harness-engineering:harness-architecture-reviewer | sonnet | 747.6k | 2.1k | 39.4k | 708.2k (1h) | $4.29 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 770.2k | 5.9k | 195.3k | 574.9k (1h) | $3.60 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 765.6k | 4.9k | 50.0k | 715.6k (1h) | $4.38 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 110.9k | 12.4k | 55.8k | 55.1k (1h) | $0.53 |
| security:security-code-auditor | sonnet | 763.9k | 8.2k | 50.0k | 713.8k (1h) | $4.42 |
| discussion-summarizer | — | 147.3k | 1.6k | 37.0k | 110.4k (1h) | $1.16 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
Note: Posted 2 inline comments. 1 finding remained in the summary because GitHub could not attach them to the current diff.
|
No architectural drift found in the post-daemon diff. The daemon edits strengthen the concurrency coverage for sibling reviewer survival and raw payload persistence without changing the accepted task/resume semantics. |
Summary
Fixes #360
Verification
rtk go test ./...rtk make lintrtk make build