Skip to content

v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29

Open
dmichael-fastly wants to merge 1 commit into
mainfrom
refactor/cleanup
Open

v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29
dmichael-fastly wants to merge 1 commit into
mainfrom
refactor/cleanup

Conversation

@dmichael-fastly

@dmichael-fastly dmichael-fastly commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Architectural cleanup release. The post-v1.2.0 perf branch closed the worst read-path latency by stacking remediation on top of an architecture that wasn't designed for the workload; this PR pays that down. The branch is squashed into a single commit (6e29655) that's tag-ready as v2.0.0, plus follow-up commits from the user (d237e9f, 298797b) and a ruff lint fix (dd17b90).

See CHANGELOG.md [2.0.0] entry for the full annotated diff — this body is the PR-review companion (status of gates, test surface, what to watch for in review, follow-up items).

What landed

  • Architectural carves: every backend file > 1500 LOC carved into per-concern packages (iceberg/, cron/ jobs, metadata/, tunnel/, share_db/, routers/admin/, core/rollups/); pre-split public surface preserved via package __init__.py re-exports. Top-5 backend now ≤ 1461 LOC; no backend file > 1500.
  • Telemetry on OTel + structlog: replaces the four fragmented custom surfaces. Console exporter ships by default; production backends are deploy-config.
  • RequestContext replaces AnalyticsDeps: tenancy enforced at context construction; routes never parse service_id from a path param.
  • Composite-endpoint hard cutover: dashboard/bundle, security/bundle, network/bundle ship with the frontend swap. Granular endpoints deleted (see Breaking).
  • Frontend large-file splits: ProvisionWizard (3582 → wizard/steps/), app/logs/page.tsx (2136 → _sections/ + _state.ts), and 14 more. No frontend file > 499 LOC.
  • Coverage gates: backend --cov-fail-under 78 → 85 (actual 85.05 %); frontend coverage.thresholds.lines 44 → 58 (actual 61.66 %).
  • mypy: ignore_errors list 36 → 0; per-module strict block grew 8 → 19 modules.
  • Library swaps: aiodns + tenacity + pydantic-settings + cachetools + orjson + rich + typer + nuqs (Next.js URL state) + argon2id (passcode hashing).
  • Trust topology asserted in tests (Caddy + compose + middleware order); sql_validator rejects NUL bytes; ruff T201 (no print() in production code, with per-file ignores for scripts/*, the provision CLI, and tests/*).
  • Followups in subsequent commits: SQL hardening (escape_sql_literal at read_parquet() + glob() sites), accessibility, stable React keys, request timeouts (d237e9f); VCL macro heredoc support (298797b).

Verification

  • make ci is green on the current tip (dd17b90): 3972 pytest, 444 vitest, 9 VCL test all pass; lint + format + mypy + osv + secret-scan all pass.
  • The 4 "failing tests" I flagged in my initial PR body turned out to be pytest-randomly + xdist ordering flakes — every test passes deterministic + in isolation. No code change was needed for them.
  • The one real issue from the squash was a T201 print() violation in tests/test_performance_smoke.py (legitimate benchmark prints); fixed in dd17b90 by adding tests/* to the per-file-ignores list (same treatment as scripts/*).
  • Prod-deployed + smoke-tested via the post-restart pipeline:
    • Admin (SSH-tunneled localhost:3001): 18/18 endpoints green (bootstrap, sync-status, log-extents, every /api/admin/*)
    • Analyst (Fastly URL + /share-login session): 9/9 pages rendered with real data — Dashboard, Performance, Origin, Security, Charts, Insights, Network, Sessions, Query
    • Gating: /admin redirects analyst to /dashboard; /api/admin/* returns 401 for analyst session; /api/sync-status returns 401 (admin-only)
  • 593 files changed in the squash, +82,649 / −37,156 LOC

Breaking changes

  • Composite-endpoint cutover: granular per-card endpoints (/api/dashboard/aggregates, /api/dashboard/raw, /api/dashboard/top_n, etc.) are deleted; callers must use the composite (/api/dashboard/bundle).
  • AnalyticsDepsRequestContext: alias removed.
  • is_cached / _is_cached alias on BaseResponse removed (is_cached is canonical).
  • SSH-to-localhost.run analyst sharing removed; production has always been direct-mode against the Fastly+Caddy public URL.
  • Error contract (from v2.0.1 hardening): raise_internal(logger, exc, code, status) replaces raise HTTPException(detail={"error": str(e)}) at except sites — detail is now {"error": <code>, "error_id": <8-hex>}. Upstream messages are scrubbed; operators correlate via the server log keyed on error_id.

What was NOT done in this PR (deferred)

  • v2.0.0 tag — version is at 2.0.0 in pyproject.toml / frontend/package.json / backend/main.py / uv.lock; the tag itself is not yet pushed (per "do not merge anything yet"). The user has already drafted a v2.0.1 commit message (d237e9f) but the actual version strings still read 2.0.0 — needs a decision: ship as 2.0.0 (with the hardening folded in) or bump to 2.0.1 before tagging.
  • pending-docs/best_practices_audit.md — left untracked in the working tree (your WIP audit doc, not part of this PR).

Test plan

  • make ci green
  • Prod smoke: admin + analyst paths verified
  • Tag decision: v2.0.0 vs v2.0.1 (and whether to bump the version strings)
  • After tag merge: re-smoke prod once more via ~/restart.sh + the [prod-verify-paths] memory pattern

🤖 Generated with Claude Code

Squash of the refactor/cleanup branch — architectural cleanup
release. Tag this commit as v2.0.0.

== Architecture ==

Carved every backend file > 1500 LOC into a per-concern package; all
packages preserve the pre-split public surface via __init__.py
re-exports so import paths stay stable.

* core/iceberg.py (4232)   -> iceberg/{view, catalog, warehouse,
                              manifest, fs, _core, buffer, ddl,
                              snapshot_cache, dedup, ...}. Custom
                              FosFsspecFileIO + CachedFosS3FileSystem
                              subclasses retire 5 of 6 historical
                              s3fs monkeypatches.
* scheduler.py (2843)      -> cron/{scheduler, decorators,
                              jobs/{sync, commit, compaction, optimize,
                              expire, metadata, gap_heal,
                              rollup_compact_daily}}. Phase 6 picks
                              separate-pool isolation based on Phase 1
                              thread-wait telemetry.
* core/metadata_db.py(3168)-> core/metadata/{base, alerts, views,
                              ingest_log, cron_log, asn_cache,
                              usage_log, reconciliation, state};
                              metadata_db.py is a backward-compat shim.
* utils/tunnel.py (1022)   -> tunnel/{manager, session, rate_limiter,
                              state, fingerprint}. SSH-to-localhost.run
                              path DELETED (~400 lines). Direct-mode
                              only.
* core/share_db.py (1312)  -> share_db/{connection, schema, invites,
                              sessions, audit, passcode, tos, settings,
                              validation}. argon2id replaces scrypt for
                              new passcodes; scrypt verify branch stays
                              for transparent rehash-on-login.
* routers/admin.py (1650)  -> admin/{pop_locations, ingest, trees,
                              downloads, sync_status, compaction,
                              health, log_accounting, iceberg,
                              bot_sources} + _helpers / _dir_size /
                              _router. admin_usage sidecar still
                              attaches to the shared router.
* core/rollups.py (2045)   -> rollups/{_common, time_series, sessions,
                              hour_bundles, day_bundles, recompute,
                              wellknown_bots}. 41 re-exports.

Other architecture changes:

* RequestContext replaces AnalyticsDeps; tenancy enforced at context
  construction; routes never parse service_id from a path param.
* Composite endpoints hard cutover: dashboard/bundle + security/bundle
  + network/bundle ship with the frontend swap. Granular per-card
  endpoints deleted. _meta_con parallel path dropped. is_cached /
  _is_cached alias collapsed. AnalyticsDeps = RequestContext shim
  removed.
* Top-5 backend files now <= 1461 LOC; no backend file > 1500.
* Top frontend files all < 500 LOC (ProvisionWizard 3582 ->
  wizard/steps/*; app/logs/page.tsx 2136 -> _sections/* + _state.ts).

== Telemetry, dependencies ==

* OpenTelemetry (api/sdk + fastapi/botocore/aiohttp instrumentors)
  replaces the four fragmented custom telemetry surfaces. Console
  exporter ships by default.
* structlog wires trace_id + span_id into structured log output.
* process_context_scope + _ACTIVE_CONTEXTS mirror KEPT at
  backend/utils/telemetry.py. OTel context propagation uses Python
  ContextVars under the hood and inherits the same cross-thread
  limitation (fsspec iothread, pyiceberg ThreadPoolExecutor) the
  manual mirror was built to solve.
* aiodns + asyncio.gather + bulk-transaction sqlite writes in
  utils/rdns_cache.py replace the serial-blocking
  socket.gethostbyaddr loop.
* tenacity decorator retry for Fastly/NGWAF/SQLite-WAL-busy paths.
* pydantic-settings centralises env-var reads + boot validation.
* cachetools replaces in-process bounded_cache / rdns_cache /
  ngwaf_bot_cache.
* Structured .tf.json generation replaces f-string HCL + the
  _hcl_escape regex; eliminates a custom-HCL injection vector.
* orjson via FastAPI ORJSONResponse for composite payloads.
* rich + typer for the provision CLI.
* httpx everywhere except telemetry_proxy.py (aiohttp stays for
  the proxy server role).
* nuqs as the URL state source on the frontend.

== Trust topology, security hardening ==

* Middleware order asserted at boot AND in tests; one-line INVARIANT
  markers replace the paragraph-long comments in main.py.
* @pytest.mark.security_regression marker + monotonic-count CI gate
  (floor: 24).
* Trust-topology snapshot tests pin Caddy @from_fastly matcher, XFF
  forwarding, /share-login rate-limit, and the backend
  --forwarded-allow-ips=127.0.0.1 flags.
* sql_validator rejects NUL bytes before any cost.
* ruff T201 (print-detection) lint rule enforced in production code.

== Frontend ==

* RSC/CSR boundary in app/_routing.md. Hidden-Plotly + hidden-MapLibre
  + setTimeout warm-up hacks dropped; replaced with modulepreload +
  the styledata-event swap pattern.
* Live Query Monitor: live-first sort, peak-memory column, keyboard
  shortcuts, URL-persisted filters, per-run inline expand for xN
  cron-grouped rows, >=30s stuck-query pulse, copy-SQL.
* Operations Overview cards on the admin landing page surface ingest
  gap + live query activity + slow-query count.

== Quality gates ==

* Backend coverage gate --cov-fail-under 78 -> 85 (actual 85.05 %).
* Frontend coverage gate coverage.thresholds.lines 44 -> 58
  (actual 61.66 %).
* tool.mypy.overrides ignore_errors list: 36 modules -> 0.
* mypy per-module strict block: 19 modules opted in
  (disallow_untyped_defs + disallow_incomplete_defs +
  check_untyped_defs + warn_return_any + warn_unused_ignores).
* Load-harness CI step: scripts/emit_perf_latest.py runs a 100K-row
  synthetic DuckDB workload; scripts/perf_gate.sh fails on >50%
  regression vs tests/perf/baseline.json.

== Operations, portability ==

* VM-agnostic deploy runbooks at docs/deploy/{aws_ec2, azure_vm,
  gce, generic_linux}.md. Storage stays Fastly Object Storage
  (S3-compatible).
* scripts/refresh_fastly_cidrs.py pulls api.fastly.com/public-ip-list
  and rewrites the Caddy @from_fastly block.

== Version ==

pyproject.toml + frontend/package.json + backend/main.py FastAPI app
+ uv.lock all bumped to 2.0.0. CHANGELOG.md gains a [2.0.0] entry.
README.md updated to drop the removed SSH-tunnel sharing mode.
AGENTS.md updated to reflect the post-split admin/ + rollups/
package paths.

== Breaking ==

* Composite-endpoint cutover: granular per-card endpoints are
  deleted. Callers use the composite (/api/dashboard/bundle, etc.).
* AnalyticsDeps alias for RequestContext is removed.
* is_cached / _is_cached alias on BaseResponse is removed.
* SSH-to-localhost.run analyst sharing is removed; production has
  always been direct-mode against the Fastly+Caddy public URL.

== Post-squash follow-ups folded into v2.0.0 ==

The original v2.0.0 squash (6e29655) was followed by 9 additional
commits during the release-stabilization window; all of them are
folded into this single release commit. Highlights:

* SQL hardening, accessibility pass, stable React keys, request
  timeouts (originally tagged v2.0.1 in-flight, then folded into
  v2.0.0 per CHANGELOG decision).
* `raise_internal` upgraded to ``-> NoReturn`` so callers no longer
  need a redundant return; additional exception-leak sites scrubbed
  to use the generic-code+error_id contract.
* Security audit (13 findings reviewed, 12 applied, 1 follow-up
  documented):
  - 003 telemetry middleware moved INSIDE RemoteAccessMiddleware so
    analyst attribution sees a populated session.
  - 004 SQL identifier escaping (``"`` → ``""``) in
    optional_col/create_filtered_temp_table; ``safe_fields`` now
    passed end-to-end into execute_top_n_rollups.
  - 005 tenant-scoped scoring matrix path (matrix_<sid>.json) on
    both write AND read paths; teardown wipes the scoped file.
  - 006 ``Fastly-FF`` auth bypass closed via shield-auth secret
    (X-Edge-CDN-Auth) stamped on bereqs in miss_pass; vcl_recv now
    gates auth/penaltybox/Client-IP on the unspoofable marker
    instead of fastly.ff.visits_this_service.
  - 007 analyst credential rotation now requires the target
    service_id to be in the analyst's allowed set.
  - 008 VCL macro validator rejects ``;`` plus unescaped ``{``/``}``
    while legitimate ``\{...\}`` heredocs (e.g. strftime patterns)
    pass.
  - 009 Compute@Edge scorer expires cookies BEFORE scoring instead
    of at re-issue time, closing the expired-cookie replay vector.
  - 010 url/ua/referer field limits ``int()``-cast before f-string
    interpolation into VCL.
  - 011 ``check-yaml --unsafe`` removed from pre-commit.
  - 012 Python normalizer no longer pre-replaces ``%3F`` → ``?``;
    the encoded char survives so downstream traversal payloads
    don't get hidden behind a truncated prefix.
  - 013 Rust normalizer added percent_decode for parity with Python
    (``/%61dmin`` → ``/admin``, ``/a/%2e%2e/b`` → ``/b``).
  - 014 SQL validator rejects NUL bytes (``\x00``) before regex /
    parser sees them.
  - 015 CSV usage-log export escapes spreadsheet-formula prefixes
    (``=``, ``+``, ``-``, ``@``).
  - 018 ``service_id`` required at all five view/alert
    fetch/toggle/delete entry points; the O(N) cross-tenant scan
    fallback (which the auditor flagged in views.py + which existed
    structurally identically in alerts.py) is gone.
  - 006 follow-up regression coverage: 4 ``security_regression``
    tests in tests/utils/test_fastly_utils.py pin the shield-auth
    invariants.
* Operational:
  - Backend container drops privileges to UID/GID 1000 via
    ``USER app``. The Dockerfile uses ``--create-home`` so DuckDB's
    extension cache (``INSTALL httpfs;``) has a writable HOME.
  - ``caddy/Dockerfile`` re-creates the ``caddy`` user (the
    upstream ``caddy:2-alpine`` image stopped shipping one) AND
    re-applies CAP_NET_BIND_SERVICE via ``setcap`` after the custom
    build replaces the binary.
  - ``~/restart.sh`` on the GCE VM gains a chown step against
    ``/mnt/app-data/{configs,data,cache}`` so the host mount UID
    matches the in-container app user. This script lives on the
    host (not in the repo); existing deployers must mirror it
    manually.

== Breaking (additions to the original v2.0.0 set) ==

* ``service_id`` is now required at ``DELETE /api/views/{id}``,
  ``DELETE /api/alerts/{id}``, and the alerts toggle endpoint.
  Callers that previously relied on the cross-tenant fallback now
  receive ``400 {"error":"service_id_required"}``.
* The backend container runs as a non-root user (UID/GID 1000).
  Existing deployments must ``chown -R 1000:1000 /mnt/app-data/*``
  (or whichever host path is bind-mounted to ``/app/{configs,data,
  cache}``). On the maintained GCE deployment this is now baked
  into ``~/restart.sh``.
* Custom-field VCL macros containing ``;`` or unescaped ``{``/``}``
  are rejected at validation time (audit finding 008). A previously
  misconfigured custom field that linted clean before will fail
  now; the rejection points at the exact characters.

== Migration notes ==

* On upgrade: ``chown -R 1000:1000`` the bind-mounted host
  directories before the first restart, or the backend container
  fails to boot with ``PermissionError [Errno 13]`` on the first
  config read.
* If you use saved-view or alert DELETE endpoints in automation,
  thread the ``service_id`` query param through every call.
* If your fork carries custom-field VCL expressions, re-run the
  validator after upgrade; ``;``/``{``/``}`` need to come out (or
  use Fastly's ``\{``/``\}`` heredoc escape if literal braces are
  required, e.g. inside a ``strftime(\{"format"\}, ...)`` call).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant