Skip to content

Add provider retry foundation#915

Open
simple-agent-manager[bot] wants to merge 2 commits intomainfrom
sam/infra-retry-foundation-01kqxg
Open

Add provider retry foundation#915
simple-agent-manager[bot] wants to merge 2 commits intomainfrom
sam/infra-retry-foundation-01kqxg

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

Summary

  • Created SAM idea 01KQXHKV6A34HJQR4YCACZR734 for the system-wide retry and resiliency plan.
  • Adds a provider retry taxonomy to ProviderError, including retryability, reason, retry-after, and idempotency-risk metadata.
  • Adds providerFetchWithRetry with bounded exponential backoff, jitter, and Retry-After support.
  • Wires configurable Hetzner API retry and placement fallback knobs through API env/provider credential config.
  • Keeps Hetzner server creation out of the broad retry wrapper to avoid duplicate paid VM creation after ambiguous timeouts; create still handles explicit 412 placement/capacity fallback.

Verification

  • pnpm --filter @simple-agent-manager/providers test -- --run
  • pnpm --filter @simple-agent-manager/api test -- provider-credentials
  • pnpm --filter @simple-agent-manager/providers typecheck
  • pnpm --filter @simple-agent-manager/providers lint (passes with existing non-null assertion warnings)
  • pnpm --filter @simple-agent-manager/api typecheck
  • git diff --check

Notes

  • This is the first PR-sized implementation slice of the broader retry plan. Follow-up slices should add idempotent VM-create recovery, Cloudflare DNS/API retry wrappers, TaskRunner retry budgets, API-to-VM retry policy, and devcontainer network resilience from the May investigation.

Copy link
Copy Markdown
Owner

Implementation follow-up is complete locally but not pushed yet because the workspace GitHub token is currently invalid.

Local commit waiting to push:

  • 60f0f819 feat: extend retry and resiliency coverage

What changed in that local commit:

  • Cloudflare DNS retry/upsert behavior
  • Generic API fetch retry helper
  • API-to-VM retry knobs
  • TaskRunner per-step retry budgets
  • Provider-label VM recovery before createVM
  • Cloud-init provider env threading and apt retry config
  • VM agent devcontainer build timeout plus Hetzner apt mirror injection

Verification completed locally:

  • pnpm --filter @simple-agent-manager/api test -- fetch-timeout dns-hostname node-provisioning task-runner-do-pure-functions task-runner-do-state
  • pnpm --filter @simple-agent-manager/cloud-init test -- --run
  • pnpm --filter @simple-agent-manager/providers test -- --run
  • pnpm --filter @simple-agent-manager/providers typecheck
  • pnpm --filter @simple-agent-manager/cloud-init typecheck
  • pnpm --filter @simple-agent-manager/cloud-init build
  • pnpm --filter @simple-agent-manager/api typecheck
  • pnpm --filter @simple-agent-manager/api lint passes with existing warnings
  • git diff --check

Blocked verification:

  • go test ./internal/config ./internal/bootstrap could not run because go is not installed in the workspace.
  • gofmt could not run because gofmt is not installed in the workspace.

Push blocker:

  • git push fails with remote: Invalid username or token. Password authentication is not supported for Git operations.
  • gh auth status reports The token in GH_TOKEN is invalid.

Add comprehensive retry infrastructure across the platform:

1. Generic fetch retry with exponential backoff, jitter, and Retry-After
   header honoring (fetchWithTimeoutAndRetry in fetch-timeout.ts)
2. Idempotent DNS upsert — search-before-create with duplicate handling
   and PATCH-if-exists for Cloudflare DNS records
3. CF API retry with env-controlled knobs (CF_API_RETRY_*)
4. API-to-VM agent request retry (NODE_AGENT_REQUEST_RETRY_*)
5. TaskRunner DO per-step retry budgets with env overrides
6. isTransientError() extended with explicit retryable metadata support
7. Idempotent provider VM recovery via listVMs label matching before createVM
8. Cloud-init provider awareness (PROVIDER env var) and apt retry config
9. VM agent devcontainer build timeout (DEVCONTAINER_BUILD_TIMEOUT)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
3.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant