Skip to content

chore(audit): execute 2026-06-04 audit — VPS lifecycle guard, single-source guardrails, client timeouts, self-verifying rename#119

Merged
agjs merged 11 commits into
mainfrom
chore/audit-fixes-20260604-0044
Jun 3, 2026
Merged

chore(audit): execute 2026-06-04 audit — VPS lifecycle guard, single-source guardrails, client timeouts, self-verifying rename#119
agjs merged 11 commits into
mainfrom
chore/audit-fixes-20260604-0044

Conversation

@agjs

@agjs agjs commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Executes all 13 findings from the fourth audit cycle, which re-aimed at never-audited scopes — the OpenTofu bootstrap yielded the high: hcloud_server had no lifecycle guards, so any cloud-init/tfvars edit on tofu apply silently replaced the VPS, destroying every Docker volume (Postgres, acme.json). Now ignore_changes=[user_data] with a documented -replace rebuild path.
  • Structural fix for guardrail drift: the six compose guardrails moved from workflow YAML into infra/compose/scripts/validate-guardrails.sh — CI runs six thin per-check steps, the local pre-push runs all, and the 2026-06-03 incident class (CI-only env seed, green local push, red CI) is impossible by construction.
  • Bootstrap hardening: SSH CIDRs now required-explicit (no 0.0.0.0/0 default), repo URL validated at plan time, Docker installed from the GPG-verified apt repo (release-key fingerprint pinned) instead of curl | sh.
  • Resilience: explicit time budgets on every external client — Cloudflare email fetch (was unbounded ×3 retries), OAuth fetchJson (same class, found while scoping the rule), Stripe (was implicit 80s), OpenAI/Anthropic (were 600s/10min), Valkey commandTimeout.
  • Self-verifying rename: rename-project.sh is inventory-driven (221 files vs the 34-entry allowlist that missed Prometheus labels, tracer names, compose project names) and fails loudly if any upstream identifier survives — the assertion caught a real gap during its own test run.
  • Hygiene with teeth: 16 dead locale keys purged (the new i18n-locale-keys-used rule found 15 beyond the audit's one); bun caches added to the 7 workflows missing them; AGENT_CONTRACT plugin tables now bidirectionally parity-checked (the UI table documented a plugin that isn't installed); docs no longer teach the retired admin123456/change-me credentials (banned by rule); CONTRIBUTING's nonexistent demo login replaced with the real flow; root gate scripts finally shellchecked in CI.

Test plan

  • bun run check from repo root
  • Stack smoke: full pre-push gate on push — smoke + e2e green; security gates green
  • Scratch-clone end-to-end rename test (clean run, zero survivors, idempotent re-run)
  • tofu validate + stubbed cloud-init template yamllint

App merge bars

Area Command Result
API cd apps/api && bun run validate ✅ 1141 tests (REQUIRE_INTEGRATION_DB=true), coverage 88%
UI cd apps/ui && bun run validate ✅ check + unit + lint-meta suites green
Docs cd apps/docs && bun run build:ci ✅ incl. linkcheck
Repo drift bun run check (from repo root)

Conventions

  • No any, no blind as, no !
  • New env vars in schema + .env.example (ssh_allowed_ips now required in tfvars; no new app env vars)
  • Tests updated for changed behavior (6 new lint-meta rules all tested; cloudflare abort-signal test; RED proofs via stash for tofu-hardening and client-timeout rules)

Notes for reviewers

  • prevent_destroy deliberately omitted from the VPS lifecycle block: HCL only accepts a literal there, and it would also block intentional tofu destroy — replacement-on-user_data-change is the accidental-loss vector and ignore_changes closes it precisely.
  • The behavioral prod-image-tag check now self-seeds its curated env inside the shared script (ENV_FILE diverted), so it can never again depend on workflow-level env blocks.
  • F010 (ROADMAP port) turned out to target a gitignored local file — fixed locally, nothing committable.
  • Decision log: .audit/execution-summary.json (local, gitignored).

agjs added 11 commits June 4, 2026 00:52
…y lint-meta

apps/api AGENT_CONTRACT omitted two installed plugins (code-flow, comment-hygiene) and claimed 14 of 16; apps/ui omitted comment-hygiene AND documented resource-architecture, which is not installed in the UI at all. New eslint-plugin-contract-parity rule (both apps, tested) enforces both directions: installed→documented and documented→installed.

Audit: F011
…y behaviors

error-tracking.mdx still taught admin123456 and observability.mdx taught admin/change-me — both removed from the stack on 2026-06-03 (dev.sh generates random per-install passwords). Users following stale docs fail to log in and may hardcode the old defaults back. Also documents: VALKEY_PASSWORD now server-enforced + prod-required (env-vars.mdx) and register's enumeration-safe identical-200 behavior (auth.mdx). New docs-no-retired-credentials lint-meta rule (both apps, sibling-scanning) bans the retired literals from docs prose forever — RED-verified against a synthetic page.

Audit: F009
billing.currentPlan.paid was the audit's instance; the new defined-to-used lint-meta rule (dynamic t-template prefixes exempt, literal key references anywhere in src count) surfaced 15 more orphans across both locales — all verified zero-reference and removed with parent pruning. Cross-repo i18n-keys plugin covers used-to-defined; this closes the reverse direction. This commit also carries F005: CONTRIBUTING/setup.sh no longer promise a demo@example.com/password123 login that does not exist by default — instructions now point at open dev signup (Mailpit catches the verification email) or explicit SUPERUSER_* seeding.

Audit: F012

Audit: F005
…n scripts/** pushes

scripts/ci/* (the repo's primary local defense) and setup.sh were validated by nothing in CI — the shellcheck job's targets covered infra paths only, and the workflow's push trigger skipped scripts/** entirely (PR runs were unaffected since the pull_request trigger has no paths filter). Targets + push paths now cover them; the local pre-push mirror's shellcheck stage extended identically for parity.

Audit: F007
…ted repo URL

ssh_allowed_ips defaulted to 0.0.0.0/0+::/0 on port 22 — world-open admin access on a production template must be an explicit operator choice; the default is gone and tfvars.example requires a value. monorepo_repo gains a real GitHub-URL validation so malformed values fail at plan time instead of inside cloud-init on the booted server. tofu validate green.

Audit: F004
…ace the server

hcloud_server had no lifecycle block: cloud-init interpolates tfvars, Hetzner replaces the server on any user_data change, and the replacement destroys every Docker volume (Postgres data, acme.json, GlitchTip). ignore_changes=[user_data] makes post-create drift inert (cloud-init only runs at first boot anyway); deliberate rebuilds use tofu apply -replace, documented inline. prevent_destroy deliberately NOT set — HCL only accepts a literal there and it would also block intentional tofu destroy; the accidental-loss vector is replacement, which this closes. New tofu-bootstrap-hardening lint-meta rule (both apps, tested) enforces all three bootstrap invariants — RED-verified 3/3 against the pre-fix tree via stash.

Audit: F001
get.docker.com piped to sh executed unverified remote code as root at first boot. Cloud-init now adds Docker's apt repository with the release key fingerprint pinned (9DC8…CD88, verified before the repo is trusted) and installs pinned-by-apt packages. Stubbed-template YAML parse + yamllint clean; tofu validate green.

Audit: F013
apps-api-ci.yml was the only workflow restoring ~/.bun/install/cache; seven more (acl-drift, openapi-drift, docs-linkcheck, bundle-diff, ui-release, ui-validate, playwright-e2e) reinstalled cold on every run — 30-90s each, with docs-linkcheck triple-installing. Cache steps replicated with the same SHA-pinned actions/cache, keyed per workflow on the exact bun.lock set it installs, mirroring each step's if-condition. New github-actions-bun-cache lint-meta rule (both apps, tested) pins the convention — RED flagged exactly the seven gaps.

Audit: F006
…fying

The 34-entry SCAN_PATHS allowlist silently missed every file added since it was written — forks kept 'boringstack' in Prometheus labels (metrics/registry.ts), tracer names (withDbSpan/withQueueSpan), compose project names (dev.sh: boringstack-infra/-smoke), and env schema text. The script now inventories every file matching the upstream identifiers (221 files vs 34) minus an explicit exclude list (git/generated trees, bun.lock, CHANGELOG, LICENSE attribution, itself), and fails loudly if any identifier survives the rewrite. The new assertion immediately caught a real gap in testing — the YAML APP_NAME form in two workflows — now covered. End-to-end verified on a scratch clone: clean run, zero survivors, idempotent re-run.

Audit: F008
Cloudflare email fetch had no AbortSignal (unbounded, ×3 retries); the OAuth fetchJson on the callback path was the same class (found while scoping the rule); Stripe relied on the SDK's implicit 80s; OpenAI/Anthropic on 600s/10min defaults; the Valkey app client bounded connects but not commands. All five now carry named-constant budgets (10s providers, 60s AI, 1s valkey commands). New external-client-timeout lint-meta rule (API) bans timeout-less SDK constructors and signal-less fetch in src — RED 3/3 against the pre-fix files via stash; cloudflare test asserts the per-attempt signal. Full suite: 1140 tests green.

Audit: F003
…run the same script

The six guardrail bodies lived in workflow YAML, so the local pre-push could not reuse them — and the duplication already drifted once (2026-06-03: a CI-only env seed missing, green local push, red CI). All six checks now live in infra/compose/scripts/validate-guardrails.sh (healthchecks, digest-pins, credential-fallbacks, valkey-auth, rooted-caps, and the behavioral prod-image-tags test with its curated env self-contained); the CI steps are thin per-check invocations preserving granular annotations, and the local gate gains a guardrails stage running 'all'. Parity now holds by construction. RED-verified through the shared script (synthetic :latest image → exit 1; removed valkey guard → exit 1) and GREEN end-to-end locally.

Audit: F002
@agjs agjs merged commit 56907a6 into main Jun 3, 2026
32 of 33 checks passed
@agjs agjs deleted the chore/audit-fixes-20260604-0044 branch June 3, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant