chore(audit): execute 2026-06-04 audit — VPS lifecycle guard, single-source guardrails, client timeouts, self-verifying rename#119
Merged
Conversation
…y lint-meta apps/api AGENT_CONTRACT omitted two installed plugins (code-flow, comment-hygiene) and claimed 14 of 16; apps/ui omitted comment-hygiene AND documented resource-architecture, which is not installed in the UI at all. New eslint-plugin-contract-parity rule (both apps, tested) enforces both directions: installed→documented and documented→installed. Audit: F011
…y behaviors error-tracking.mdx still taught admin123456 and observability.mdx taught admin/change-me — both removed from the stack on 2026-06-03 (dev.sh generates random per-install passwords). Users following stale docs fail to log in and may hardcode the old defaults back. Also documents: VALKEY_PASSWORD now server-enforced + prod-required (env-vars.mdx) and register's enumeration-safe identical-200 behavior (auth.mdx). New docs-no-retired-credentials lint-meta rule (both apps, sibling-scanning) bans the retired literals from docs prose forever — RED-verified against a synthetic page. Audit: F009
billing.currentPlan.paid was the audit's instance; the new defined-to-used lint-meta rule (dynamic t-template prefixes exempt, literal key references anywhere in src count) surfaced 15 more orphans across both locales — all verified zero-reference and removed with parent pruning. Cross-repo i18n-keys plugin covers used-to-defined; this closes the reverse direction. This commit also carries F005: CONTRIBUTING/setup.sh no longer promise a demo@example.com/password123 login that does not exist by default — instructions now point at open dev signup (Mailpit catches the verification email) or explicit SUPERUSER_* seeding. Audit: F012 Audit: F005
…n scripts/** pushes scripts/ci/* (the repo's primary local defense) and setup.sh were validated by nothing in CI — the shellcheck job's targets covered infra paths only, and the workflow's push trigger skipped scripts/** entirely (PR runs were unaffected since the pull_request trigger has no paths filter). Targets + push paths now cover them; the local pre-push mirror's shellcheck stage extended identically for parity. Audit: F007
…ted repo URL ssh_allowed_ips defaulted to 0.0.0.0/0+::/0 on port 22 — world-open admin access on a production template must be an explicit operator choice; the default is gone and tfvars.example requires a value. monorepo_repo gains a real GitHub-URL validation so malformed values fail at plan time instead of inside cloud-init on the booted server. tofu validate green. Audit: F004
…ace the server hcloud_server had no lifecycle block: cloud-init interpolates tfvars, Hetzner replaces the server on any user_data change, and the replacement destroys every Docker volume (Postgres data, acme.json, GlitchTip). ignore_changes=[user_data] makes post-create drift inert (cloud-init only runs at first boot anyway); deliberate rebuilds use tofu apply -replace, documented inline. prevent_destroy deliberately NOT set — HCL only accepts a literal there and it would also block intentional tofu destroy; the accidental-loss vector is replacement, which this closes. New tofu-bootstrap-hardening lint-meta rule (both apps, tested) enforces all three bootstrap invariants — RED-verified 3/3 against the pre-fix tree via stash. Audit: F001
get.docker.com piped to sh executed unverified remote code as root at first boot. Cloud-init now adds Docker's apt repository with the release key fingerprint pinned (9DC8…CD88, verified before the repo is trusted) and installs pinned-by-apt packages. Stubbed-template YAML parse + yamllint clean; tofu validate green. Audit: F013
apps-api-ci.yml was the only workflow restoring ~/.bun/install/cache; seven more (acl-drift, openapi-drift, docs-linkcheck, bundle-diff, ui-release, ui-validate, playwright-e2e) reinstalled cold on every run — 30-90s each, with docs-linkcheck triple-installing. Cache steps replicated with the same SHA-pinned actions/cache, keyed per workflow on the exact bun.lock set it installs, mirroring each step's if-condition. New github-actions-bun-cache lint-meta rule (both apps, tested) pins the convention — RED flagged exactly the seven gaps. Audit: F006
…fying The 34-entry SCAN_PATHS allowlist silently missed every file added since it was written — forks kept 'boringstack' in Prometheus labels (metrics/registry.ts), tracer names (withDbSpan/withQueueSpan), compose project names (dev.sh: boringstack-infra/-smoke), and env schema text. The script now inventories every file matching the upstream identifiers (221 files vs 34) minus an explicit exclude list (git/generated trees, bun.lock, CHANGELOG, LICENSE attribution, itself), and fails loudly if any identifier survives the rewrite. The new assertion immediately caught a real gap in testing — the YAML APP_NAME form in two workflows — now covered. End-to-end verified on a scratch clone: clean run, zero survivors, idempotent re-run. Audit: F008
Cloudflare email fetch had no AbortSignal (unbounded, ×3 retries); the OAuth fetchJson on the callback path was the same class (found while scoping the rule); Stripe relied on the SDK's implicit 80s; OpenAI/Anthropic on 600s/10min defaults; the Valkey app client bounded connects but not commands. All five now carry named-constant budgets (10s providers, 60s AI, 1s valkey commands). New external-client-timeout lint-meta rule (API) bans timeout-less SDK constructors and signal-less fetch in src — RED 3/3 against the pre-fix files via stash; cloudflare test asserts the per-attempt signal. Full suite: 1140 tests green. Audit: F003
…run the same script The six guardrail bodies lived in workflow YAML, so the local pre-push could not reuse them — and the duplication already drifted once (2026-06-03: a CI-only env seed missing, green local push, red CI). All six checks now live in infra/compose/scripts/validate-guardrails.sh (healthchecks, digest-pins, credential-fallbacks, valkey-auth, rooted-caps, and the behavioral prod-image-tags test with its curated env self-contained); the CI steps are thin per-check invocations preserving granular annotations, and the local gate gains a guardrails stage running 'all'. Parity now holds by construction. RED-verified through the shared script (synthetic :latest image → exit 1; removed valkey guard → exit 1) and GREEN end-to-end locally. Audit: F002
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hcloud_serverhad no lifecycle guards, so any cloud-init/tfvars edit ontofu applysilently replaced the VPS, destroying every Docker volume (Postgres, acme.json). Nowignore_changes=[user_data]with a documented-replacerebuild path.infra/compose/scripts/validate-guardrails.sh— CI runs six thin per-check steps, the local pre-push runsall, and the 2026-06-03 incident class (CI-only env seed, green local push, red CI) is impossible by construction.0.0.0.0/0default), repo URL validated at plan time, Docker installed from the GPG-verified apt repo (release-key fingerprint pinned) instead ofcurl | sh.fetchJson(same class, found while scoping the rule), Stripe (was implicit 80s), OpenAI/Anthropic (were 600s/10min), ValkeycommandTimeout.rename-project.shis inventory-driven (221 files vs the 34-entry allowlist that missed Prometheus labels, tracer names, compose project names) and fails loudly if any upstream identifier survives — the assertion caught a real gap during its own test run.i18n-locale-keys-usedrule found 15 beyond the audit's one); bun caches added to the 7 workflows missing them; AGENT_CONTRACT plugin tables now bidirectionally parity-checked (the UI table documented a plugin that isn't installed); docs no longer teach the retiredadmin123456/change-mecredentials (banned by rule); CONTRIBUTING's nonexistent demo login replaced with the real flow; root gate scripts finally shellchecked in CI.Test plan
bun run checkfrom repo roottofu validate+ stubbed cloud-init template yamllintApp merge bars
cd apps/api && bun run validatecd apps/ui && bun run validatecd apps/docs && bun run build:cibun run check(from repo root)Conventions
any, no blindas, no!.env.example(ssh_allowed_ipsnow required in tfvars; no new app env vars)Notes for reviewers
prevent_destroydeliberately omitted from the VPS lifecycle block: HCL only accepts a literal there, and it would also block intentionaltofu destroy— replacement-on-user_data-change is the accidental-loss vector andignore_changescloses it precisely.ENV_FILEdiverted), so it can never again depend on workflow-level env blocks..audit/execution-summary.json(local, gitignored).