Skip to content

auth: fail-fast on missing signing key + enlace doctor CLI (closes #11)#12

Merged
thorwhalen merged 4 commits intomainfrom
fix/auth-fail-fast-and-doctor
Apr 21, 2026
Merged

auth: fail-fast on missing signing key + enlace doctor CLI (closes #11)#12
thorwhalen merged 4 commits intomainfrom
fix/auth-fail-fast-and-doctor

Conversation

@thorwhalen
Copy link
Copy Markdown
Member

Summary

Fixes #11. Two changes, two commits:

  1. Fail-fast when [auth].enabled=true and ENLACE_SIGNING_KEY is missing or malformed. Replaces the silent return at compose.py:_wire_auth_and_stores that caused the production incident described in Auth silently disabled when ENLACE_SIGNING_KEY is missing (production incident) #11 (the gateway booted clean, systemctl is-active said active, but /auth/* was unmounted and the SPA catch-all returned <!doctype html> for every CSRF/login request). The new behavior raises EnlaceConfigError with a clear remediation message. Loud opt-out via ENLACE_ALLOW_UNSIGNED=1 for operators diagnosing a broken box.

  2. New enlace doctor subcommand — a post-deploy smoke tool. Probes a running gateway over plain urllib (no new deps) and reports pass/fail per check. Catches exactly the regression that motivated this PR: when /auth/csrf returns text/html instead of JSON, doctor fails loudly. Also: each app's frontend and API mount, oauth importability, frontend_dir sanity.

Bonus fix: the OAuth ImportError swallow in _wire_auth_and_stores (compose.py:470) now logs an ERROR banner listing the configured providers — a missing authlib install stops being invisible.

New CLI

enlace doctor --base-url http://127.0.0.1:8010
enlace doctor --base-url ... --envfile /opt/tw_platform/.env   # load env from deploy
enlace doctor --base-url ... --skip-env-checks                 # trust HTTP as the oracle
enlace doctor --base-url ... --json                            # for CI / deploy pipelines

Tested against live production

After wiring in, ran against apps.thorwhalen.com's gateway (the one that had the original incident): 25 pass, 0 fail. Reproducing the regression by reverting EnvironmentFile= in the systemd unit is covered by the tw_platform PR that consumes this as a post-deploy smoke.

Breaking change?

Technically yes: any deployment that currently has [auth].enabled=true and no ENLACE_SIGNING_KEY set will now refuse to start. But such a deployment is already broken — /auth/* wasn't working, and any protected mount was either unreachable or (worse) unchecked. Failing loudly is strictly safer than the silent-serve-SPA fallback. ENLACE_ALLOW_UNSIGNED=1 is the escape hatch for operators who need to boot the gateway without auth for diagnostics.

Test plan

  • 167 tests pass (pytest enlace/tests tests)
  • 6 new tests for fail-fast (test_auth_failfast.py)
  • 9 new tests for doctor (test_doctor.py)
  • Live run against production gateway: 25 pass / 0 fail with --skip-env-checks; 30 pass / 0 fail with --envfile /opt/tw_platform/.env
  • Manually verified: missing key → EnlaceConfigError with actionable message
  • Manually verified: ENLACE_ALLOW_UNSIGNED=1 restores silent path with loud log line

When [auth].enabled is true but the signing key env var is unset or shorter
than 32 chars, build_backend() now raises EnlaceConfigError instead of
silently skipping auth wiring. The old silent path caused a production
incident where /auth/* requests fell through to the SPA catch-all and
returned index.html (see #11).

Loud opt-out: ENLACE_ALLOW_UNSIGNED=1 restores the prior behavior while
logging an error banner, for operators diagnosing a broken box.

Also: the OAuth ImportError swallow now logs an error listing the
configured providers, so a missing authlib install stops being invisible.
Probes a running gateway over plain urllib (no new deps) to catch
silent-degradation failures that static validation can't:

- signing-key env check (only meaningful when run in the gateway's env;
  use --envfile or --skip-env-checks from outside)
- oauth importability
- frontend_dir sanity
- HTTP: /auth/csrf must return JSON with a csrf key (this is the probe
  that would have caught #11 directly — the SPA catch-all
  returns text/html when auth is unwired)
- HTTP: each app's frontend mount exists (200, 401, or 3xx; fails on 404
  and 5xx so 'protected but responding' stays green)
- HTTP: each app's API mount returns non-5xx

Output: pretty text by default, or --json for CI / deploy pipelines.
Exits nonzero on any failure.
@thorwhalen thorwhalen merged commit a50b8ff into main Apr 21, 2026
12 checks passed
@thorwhalen thorwhalen deleted the fix/auth-fail-fast-and-doctor branch April 21, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auth silently disabled when ENLACE_SIGNING_KEY is missing (production incident)

1 participant