Skip to content

ai: contain Kiro ACP stream failures and empty responses#217

Open
anders-heimer wants to merge 5 commits into
sashiko-dev:mainfrom
anders-heimer:kiro-acp-error-guards
Open

ai: contain Kiro ACP stream failures and empty responses#217
anders-heimer wants to merge 5 commits into
sashiko-dev:mainfrom
anders-heimer:kiro-acp-error-guards

Conversation

@anders-heimer
Copy link
Copy Markdown
Contributor

Summary

This PR adds Kiro-specific containment and diagnostics for ACP session/prompt
failures.

The Kiro backend can fail or stall while Sashiko is waiting for the long-lived
ACP prompt response stream. In earlier reviews this showed up as repeated
JSON-RPC -32603 Internal error responses wrapping Kiro/upstream response-stream
failures such as reqwest::Error { kind: Body, source: TimedOut } and
UnexpectedEof. Those errors were retried as generic transient provider errors,
which allowed a single expensive Kiro turn to monopolize a review worker through
many retries.

This series keeps the retry containment scoped to the Kiro ACP backend. Other AI
providers keep the existing generic retry behavior.

Changes

  • Improve Kiro ACP diagnostics around:

    • initialize
    • session/new
    • session/prompt
    • prompt duration
    • chunk count
    • response length
    • ACP JSON-RPC line count
    • max idle gap between valid ACP JSON-RPC lines
    • malformed stdout/stderr previews
    • redacted JSON-RPC error.data
    • provider request IDs when present
  • Classify known Kiro ACP stream/provider failures as transient, including
    response-stream errors, response body errors, stalled-stream/minimum-throughput
    failures, TimedOut, and UnexpectedEof.

  • Preserve rate-limit precedence over generic stream-transient classification, so
    throttling wrapped in a stream-looking error still follows the rate-limit path.

  • Add a side-effect-aware retry gate for ACP prompt turns. If Kiro emits
    tool/action-like updates before an otherwise retryable failure, the error is
    treated as fatal rather than replaying a possibly non-idempotent turn.

  • Add Kiro-only per-turn retry containment:

    • max transient attempts per logical turn
    • max same-error streak per logical turn
    • max wall-clock budget per logical turn
    • process-local Kiro circuit breaker
    • configurable Kiro-specific environment overrides
  • Add a Kiro ACP idle watchdog based on valid ACP JSON-RPC stdout activity,
    rather than arbitrary stdout bytes. This avoids treating verbose/non-ACP Kiro
    diagnostics as protocol progress.

  • Send ACP session/cancel when session/prompt read handling fails before the
    provider returns the error.

  • Treat protocol-level successful Kiro turns with no model content as a distinct
    Kiro empty-response failure. Empty responses are retried only while the
    Kiro-specific empty-response budget allows it, and remain fatal after
    side-effect-looking ACP updates.

Configuration

The Kiro containment settings are configurable through environment variables:

SASHIKO__AI__KIRO_MAX_TRANSIENT_ATTEMPTS_PER_TURN
SASHIKO__AI__KIRO_MAX_SAME_TRANSIENT_STREAK_PER_TURN
SASHIKO__AI__KIRO_MAX_TURN_WALL_CLOCK_SECS
SASHIKO__AI__KIRO_ACP_IDLE_TIMEOUT_SECS
SASHIKO__AI__KIRO_CIRCUIT_FAILURE_WINDOW_SECS
SASHIKO__AI__KIRO_CIRCUIT_FAILURE_THRESHOLD
SASHIKO__AI__KIRO_CIRCUIT_COOLDOWN_SECS
SASHIKO__AI__KIRO_MAX_EMPTY_RESPONSES_PER_TURN

Capture malformed ACP stdout separately from stderr so errors can
report both diagnostic streams without leaking secrets.

Include redacted JSON-RPC error data in Kiro ACP failures to expose
nested provider error details.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Kiro ACP wraps response-stream and provider failures in JSON-RPC
-32603 errors with details in error.data.

Classify known transient, rate-limit, provider, and permanent markers
so the retry layer can back off appropriately. Treat kiro-cli timeouts
as transient typed errors.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Track whether ACP session updates indicate possible tool, command, or
file mutations before an error is returned.

Retryable-looking failures after such updates are kept fatal so the
caller does not replay potentially non-idempotent work.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Bound Kiro ACP stream failures with Kiro-only per-turn retry
budgets, a process-local circuit breaker, and ACP-line idle
telemetry.

Send session/cancel on prompt failures, preserve terminal Kiro
failures across review-bin stdio, and keep retries blocked after
side-effect-looking ACP updates.

Rate-limit markers intentionally take precedence over generic stream
failure markers so throttling receives the slower quota backoff instead
of the short stream retry delay.

Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Treat Kiro ACP session/prompt successes with no model content as
provider errors before downstream JSON validation sees them.

Retry side-effect-free empty responses under a small Kiro-only per-turn
budget, and fail terminal when the budget is exhausted or side effects
were observed.

Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant