Skip to content

fix: devcontainer build network resilience (apt mirrors, timeout, retries)#901

Open
simple-agent-manager[bot] wants to merge 6 commits intomainfrom
sam/use-skill-re-run-01kqvq
Open

fix: devcontainer build network resilience (apt mirrors, timeout, retries)#901
simple-agent-manager[bot] wants to merge 6 commits intomainfrom
sam/use-skill-re-run-01kqvq

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

@simple-agent-manager simple-agent-manager Bot commented May 5, 2026

Summary

  • Fixes intermittent devcontainer build failures caused by archive.ubuntu.com being slow/unreachable from Hetzner containers (discovered during Ubuntu DDoS/outage on 2026-05-05)
  • Root cause: containers on Hetzner use archive.ubuntu.com instead of the provider's local mirror (mirror.hetzner.com), causing timeouts through Docker bridge NAT
  • Three-pronged fix: (1) provider-specific apt mirrors injected into containers, (2) configurable devcontainer build timeout (default 15m), (3) universal apt retry config in all containers

Validation

  • pnpm lint
  • pnpm typecheck
  • pnpm test (139 cloud-init tests, all Go tests pass)
  • pnpm build

Staging Verification (REQUIRED for all code changes — merge-blocking)

  • Staging deployment green — Deploy Staging workflow triggered and passed (run 25370103204)
  • Live app verified via Playwright — logged into app.sammy.party, dashboard renders, navigation works
  • Existing workflows confirmed working — dashboard, settings page load with 0 errors
  • New feature/fix verified on staging — Infrastructure changes require VM provisioning; no Hetzner credentials configured on staging test account. See Exceptions section.
  • Infrastructure verification completed — Cannot provision VM without Hetzner credentials on staging.
  • N/A: no UI changes

Staging Verification Evidence

  • API health endpoint responds healthy
  • App dashboard loads with user authenticated (serverspresentation2025)
  • Settings page renders without errors
  • Deploy succeeded including Go cross-compilation and cloud-init package build

UI Compliance Checklist (Required for UI changes)

N/A: no UI changes

End-to-End Verification (Required for multi-component changes)

  • Data flow traced from user input to final outcome with code path citations
  • Capability test exercises the complete happy path across system boundaries
  • All spec/doc assumptions about existing behavior verified against code
  • If any gap exists between automated test coverage and full E2E, manual verification steps documented below

Data Flow Trace

  1. User creates node → apps/api/src/services/nodes.ts:provisionNode() passes provider: targetProvider to generateCloudInit()
  2. Cloud-init generates YAML → packages/cloud-init/src/generate.ts:generateCloudInit() substitutes {{ provider }} in template
  3. VM boots with cloud-init → systemd starts vm-agent with Environment=PROVIDER=hetzner
  4. VM agent reads config → packages/vm-agent/internal/config/config.go:Load() parses PROVIDER and DEVCONTAINER_BUILD_TIMEOUT
  5. Workspace created → bootstrap.go:bootstrapWorkspace() finds container, calls injectAptRetryConfig() then injectAptMirrorConfig()
  6. Build uses timeout → devcontainerBuildContext() wraps build command with 15m deadline

Untested Gaps

Full VM provisioning flow cannot be automated without Hetzner credentials on staging. The individual components are proven via unit/integration tests.

Specialist Review Evidence

Reviewer Status Findings
go-specialist ADDRESSED CRITICAL: shell injection fixed (resolveAptMirror + exec.Command). HIGH: context timer leak fixed. MEDIUM: logging improvements added.
task-completion-validator ADDRESSED HIGH: missing tests added (TestResolveAptMirror, TestIsValidMirrorHostname, TestInjectAptMirrorConfig_HetznerProvider). MEDIUM: apt retries separated for all providers.

Exceptions (If any)

  • Scope: Infrastructure verification (VM provisioning) skipped
  • Rationale: No Hetzner credentials configured on staging test user. Changes are additive and non-fatal — failures logged as warnings, apt falls back to defaults.
  • Expiration: Should be verified during next task that provisions a VM on staging

Agent Preflight (Required)

  • Preflight completed before code changes

Classification

  • external-api-change
  • cross-component-change
  • business-logic-change
  • public-surface-change
  • docs-sync-change
  • security-sensitive-change
  • ui-change
  • infra-change

External References

  • Ubuntu/Canonical DDoS/outage reports (2026-05-05)
  • Hetzner mirror documentation (mirror.hetzner.com)
  • Docker bridge networking NAT behavior

Codebase Impact Analysis

  • packages/cloud-init/ — template, generation, types, validation
  • packages/vm-agent/ — config, bootstrap (container setup)
  • apps/api/ — node provisioning (passes provider field)

Documentation & Specs

N/A: internal infrastructure resilience improvements, no user-facing doc changes needed.

Constitution & Risk Check

  • Principle XI (No Hardcoded Values): Build timeout configurable via DEVCONTAINER_BUILD_TIMEOUT, provider from env var PROVIDER.
  • Security: Shell injection prevented via Go-side resolution + hostname regex validation + exec.Command with direct args.

raphaeltm and others added 5 commits May 5, 2026 09:38
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… retries)

- Thread cloud provider through cloud-init to VM agent via PROVIDER env var
- Add provider-specific apt mirror script (/etc/sam/apt-mirror-config.sh)
  that maps Hetzner → mirror.hetzner.com for container apt operations
- Add apt retry config (3 retries, 30s timeout) on host via cloud-init
- Add configurable devcontainer build timeout (default: 15min) to prevent
  indefinite hangs when apt/network is degraded
- VM agent injects apt mirror config into containers before package installs
- Validate provider field in cloud-init variable validation

Fixes intermittent devcontainer build failures caused by containers using
archive.ubuntu.com instead of Hetzner's local mirror, combined with no
build timeout causing 30min+ hangs during Ubuntu repo outages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…leak)

- CRITICAL: Eliminate sh -c outer shell in injectAptMirrorConfig — use
  exec.Command("docker","exec",...) with containerID as a direct argument
  to prevent shell injection. Mirror hostname is resolved in Go and
  validated with a hostname regex before use.
- HIGH: Call buildCancel() immediately after cmd.CombinedOutput() returns
  instead of relying solely on defer, releasing the timer goroutine before
  the potentially long fallback build runs.
- MEDIUM: Log findDevcontainerID failure at slog.Debug level at both call
  sites so silent skips are diagnosable.
- MEDIUM: Add // Non-fatal comment to second call site for consistency.
- LOW: Downgrade devcontainerBuildContext log from Info to Debug.
- LOW: Document zero-means-disabled contract in devcontainerBuildContext.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…silience

Apt retry config (Acquire::Retries, timeouts) now injects into ALL
containers regardless of provider, not just Hetzner. Mirror replacement
remains provider-specific. Adds TestInjectAptRetryConfig test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simple-agent-manager simple-agent-manager Bot added the needs-human-review Agent could not complete all review gates — human must approve before merge label May 5, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 5, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots
4.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-human-review Agent could not complete all review gates — human must approve before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant