What happened
On PR #816, the review bot approved 3 times while missing 3 blocking correctness bugs that the human reviewer caught on the first pass:
- UTF-8 truncation:
truncate() in postreview.go sliced strings by byte offset, which could split multi-byte Unicode characters and produce invalid UTF-8. The human suggested []rune conversion.
- Overly broad 422 retry:
CreateIssue in github.go retried on any HTTP 422 response, not just label-validation errors. The bot actually praised this retry logic as "pragmatic" — a false negative that actively mischaracterized a bug as a feature.
- Missing label-scope filtering:
ListOpenIssues fetched all open issues with type/chore label instead of further filtering by the marker label, causing unnecessary API calls and potential false-positive dedup matches.
All three bugs are correctness issues in error handling and API contract usage — areas where the review agent consistently underperformed.
What could go better
The review agent's code-review skill (skills/code-review/SKILL.md) evaluates six dimensions but does not include specific guidance for common Go correctness patterns:
- String safety: Operations on
string in Go operate on bytes, not characters. Any truncation, slicing, or indexing should be flagged for Unicode safety review.
- Error-handling specificity: Retry logic that catches broad error categories (e.g., all 422s) without inspecting the error detail is a common anti-pattern. The skill should prompt the agent to check whether error handling is appropriately narrow.
- API contract alignment: When code wraps an external API (like GitHub's), the review should verify that error handling matches the API's documented error semantics, not just that error handling exists.
Confidence: High. The bot's review text explicitly praised the broad retry pattern, showing it evaluated the code but reached the wrong conclusion. This isn't a case of the bot skipping the code — it's a reasoning gap that better prompting could address.
Uncertainty: Adding review guidance could increase false positives on legitimate broad error handling. The guidance should focus on retry/recovery paths specifically, not all error handling.
Proposed change
Add a "Common correctness patterns" subsection to the code-review skill (internal/scaffold/fullsend-repo/skills/code-review/SKILL.md) under the correctness dimension. Include:
- String safety: "When reviewing string truncation, slicing, or indexing in Go, check whether the operation is byte-safe or rune-safe. Flag any
string[:n] pattern where n could land in the middle of a multi-byte UTF-8 character."
- Error-handling specificity: "When reviewing retry or recovery logic, verify that the error condition is specific enough. Retrying on a broad HTTP status code (e.g., all 422s) without inspecting the response body or error details is a correctness risk — it may mask unrelated failures."
- API contract fidelity: "When code wraps an external API, verify that error handling matches the API's documented error semantics. For example, GitHub's 422 response includes an
errors array with field and code entries that should be inspected for specific failure modes."
Validation criteria
Over the next 10 review agent runs on PRs that modify Go error-handling or string-manipulation code in fullsend-ai/fullsend, track whether the bot flags UTF-8 safety or overly broad error matching. At least 2 of those runs should produce medium-or-higher findings for patterns that would have been missed before this change. No significant increase in false-positive rate (human reviewers should not need to dismiss more than 1 additional bot finding per review on average).
Generated by retro agent from #816
What happened
On PR #816, the review bot approved 3 times while missing 3 blocking correctness bugs that the human reviewer caught on the first pass:
truncate()inpostreview.gosliced strings by byte offset, which could split multi-byte Unicode characters and produce invalid UTF-8. The human suggested[]runeconversion.CreateIssueingithub.goretried on any HTTP 422 response, not just label-validation errors. The bot actually praised this retry logic as "pragmatic" — a false negative that actively mischaracterized a bug as a feature.ListOpenIssuesfetched all open issues withtype/chorelabel instead of further filtering by the marker label, causing unnecessary API calls and potential false-positive dedup matches.All three bugs are correctness issues in error handling and API contract usage — areas where the review agent consistently underperformed.
What could go better
The review agent's code-review skill (
skills/code-review/SKILL.md) evaluates six dimensions but does not include specific guidance for common Go correctness patterns:stringin Go operate on bytes, not characters. Any truncation, slicing, or indexing should be flagged for Unicode safety review.Confidence: High. The bot's review text explicitly praised the broad retry pattern, showing it evaluated the code but reached the wrong conclusion. This isn't a case of the bot skipping the code — it's a reasoning gap that better prompting could address.
Uncertainty: Adding review guidance could increase false positives on legitimate broad error handling. The guidance should focus on retry/recovery paths specifically, not all error handling.
Proposed change
Add a "Common correctness patterns" subsection to the code-review skill (
internal/scaffold/fullsend-repo/skills/code-review/SKILL.md) under the correctness dimension. Include:string[:n]pattern wherencould land in the middle of a multi-byte UTF-8 character."errorsarray withfieldandcodeentries that should be inspected for specific failure modes."Validation criteria
Over the next 10 review agent runs on PRs that modify Go error-handling or string-manipulation code in fullsend-ai/fullsend, track whether the bot flags UTF-8 safety or overly broad error matching. At least 2 of those runs should produce medium-or-higher findings for patterns that would have been missed before this change. No significant increase in false-positive rate (human reviewers should not need to dismiss more than 1 additional bot finding per review on average).
Generated by retro agent from #816