v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29
Open
dmichael-fastly wants to merge 1 commit into
Open
v2.0.0: cleanup release — architecture, telemetry, tenancy, gates#29dmichael-fastly wants to merge 1 commit into
dmichael-fastly wants to merge 1 commit into
Conversation
Squash of the refactor/cleanup branch — architectural cleanup
release. Tag this commit as v2.0.0.
== Architecture ==
Carved every backend file > 1500 LOC into a per-concern package; all
packages preserve the pre-split public surface via __init__.py
re-exports so import paths stay stable.
* core/iceberg.py (4232) -> iceberg/{view, catalog, warehouse,
manifest, fs, _core, buffer, ddl,
snapshot_cache, dedup, ...}. Custom
FosFsspecFileIO + CachedFosS3FileSystem
subclasses retire 5 of 6 historical
s3fs monkeypatches.
* scheduler.py (2843) -> cron/{scheduler, decorators,
jobs/{sync, commit, compaction, optimize,
expire, metadata, gap_heal,
rollup_compact_daily}}. Phase 6 picks
separate-pool isolation based on Phase 1
thread-wait telemetry.
* core/metadata_db.py(3168)-> core/metadata/{base, alerts, views,
ingest_log, cron_log, asn_cache,
usage_log, reconciliation, state};
metadata_db.py is a backward-compat shim.
* utils/tunnel.py (1022) -> tunnel/{manager, session, rate_limiter,
state, fingerprint}. SSH-to-localhost.run
path DELETED (~400 lines). Direct-mode
only.
* core/share_db.py (1312) -> share_db/{connection, schema, invites,
sessions, audit, passcode, tos, settings,
validation}. argon2id replaces scrypt for
new passcodes; scrypt verify branch stays
for transparent rehash-on-login.
* routers/admin.py (1650) -> admin/{pop_locations, ingest, trees,
downloads, sync_status, compaction,
health, log_accounting, iceberg,
bot_sources} + _helpers / _dir_size /
_router. admin_usage sidecar still
attaches to the shared router.
* core/rollups.py (2045) -> rollups/{_common, time_series, sessions,
hour_bundles, day_bundles, recompute,
wellknown_bots}. 41 re-exports.
Other architecture changes:
* RequestContext replaces AnalyticsDeps; tenancy enforced at context
construction; routes never parse service_id from a path param.
* Composite endpoints hard cutover: dashboard/bundle + security/bundle
+ network/bundle ship with the frontend swap. Granular per-card
endpoints deleted. _meta_con parallel path dropped. is_cached /
_is_cached alias collapsed. AnalyticsDeps = RequestContext shim
removed.
* Top-5 backend files now <= 1461 LOC; no backend file > 1500.
* Top frontend files all < 500 LOC (ProvisionWizard 3582 ->
wizard/steps/*; app/logs/page.tsx 2136 -> _sections/* + _state.ts).
== Telemetry, dependencies ==
* OpenTelemetry (api/sdk + fastapi/botocore/aiohttp instrumentors)
replaces the four fragmented custom telemetry surfaces. Console
exporter ships by default.
* structlog wires trace_id + span_id into structured log output.
* process_context_scope + _ACTIVE_CONTEXTS mirror KEPT at
backend/utils/telemetry.py. OTel context propagation uses Python
ContextVars under the hood and inherits the same cross-thread
limitation (fsspec iothread, pyiceberg ThreadPoolExecutor) the
manual mirror was built to solve.
* aiodns + asyncio.gather + bulk-transaction sqlite writes in
utils/rdns_cache.py replace the serial-blocking
socket.gethostbyaddr loop.
* tenacity decorator retry for Fastly/NGWAF/SQLite-WAL-busy paths.
* pydantic-settings centralises env-var reads + boot validation.
* cachetools replaces in-process bounded_cache / rdns_cache /
ngwaf_bot_cache.
* Structured .tf.json generation replaces f-string HCL + the
_hcl_escape regex; eliminates a custom-HCL injection vector.
* orjson via FastAPI ORJSONResponse for composite payloads.
* rich + typer for the provision CLI.
* httpx everywhere except telemetry_proxy.py (aiohttp stays for
the proxy server role).
* nuqs as the URL state source on the frontend.
== Trust topology, security hardening ==
* Middleware order asserted at boot AND in tests; one-line INVARIANT
markers replace the paragraph-long comments in main.py.
* @pytest.mark.security_regression marker + monotonic-count CI gate
(floor: 24).
* Trust-topology snapshot tests pin Caddy @from_fastly matcher, XFF
forwarding, /share-login rate-limit, and the backend
--forwarded-allow-ips=127.0.0.1 flags.
* sql_validator rejects NUL bytes before any cost.
* ruff T201 (print-detection) lint rule enforced in production code.
== Frontend ==
* RSC/CSR boundary in app/_routing.md. Hidden-Plotly + hidden-MapLibre
+ setTimeout warm-up hacks dropped; replaced with modulepreload +
the styledata-event swap pattern.
* Live Query Monitor: live-first sort, peak-memory column, keyboard
shortcuts, URL-persisted filters, per-run inline expand for xN
cron-grouped rows, >=30s stuck-query pulse, copy-SQL.
* Operations Overview cards on the admin landing page surface ingest
gap + live query activity + slow-query count.
== Quality gates ==
* Backend coverage gate --cov-fail-under 78 -> 85 (actual 85.05 %).
* Frontend coverage gate coverage.thresholds.lines 44 -> 58
(actual 61.66 %).
* tool.mypy.overrides ignore_errors list: 36 modules -> 0.
* mypy per-module strict block: 19 modules opted in
(disallow_untyped_defs + disallow_incomplete_defs +
check_untyped_defs + warn_return_any + warn_unused_ignores).
* Load-harness CI step: scripts/emit_perf_latest.py runs a 100K-row
synthetic DuckDB workload; scripts/perf_gate.sh fails on >50%
regression vs tests/perf/baseline.json.
== Operations, portability ==
* VM-agnostic deploy runbooks at docs/deploy/{aws_ec2, azure_vm,
gce, generic_linux}.md. Storage stays Fastly Object Storage
(S3-compatible).
* scripts/refresh_fastly_cidrs.py pulls api.fastly.com/public-ip-list
and rewrites the Caddy @from_fastly block.
== Version ==
pyproject.toml + frontend/package.json + backend/main.py FastAPI app
+ uv.lock all bumped to 2.0.0. CHANGELOG.md gains a [2.0.0] entry.
README.md updated to drop the removed SSH-tunnel sharing mode.
AGENTS.md updated to reflect the post-split admin/ + rollups/
package paths.
== Breaking ==
* Composite-endpoint cutover: granular per-card endpoints are
deleted. Callers use the composite (/api/dashboard/bundle, etc.).
* AnalyticsDeps alias for RequestContext is removed.
* is_cached / _is_cached alias on BaseResponse is removed.
* SSH-to-localhost.run analyst sharing is removed; production has
always been direct-mode against the Fastly+Caddy public URL.
== Post-squash follow-ups folded into v2.0.0 ==
The original v2.0.0 squash (6e29655) was followed by 9 additional
commits during the release-stabilization window; all of them are
folded into this single release commit. Highlights:
* SQL hardening, accessibility pass, stable React keys, request
timeouts (originally tagged v2.0.1 in-flight, then folded into
v2.0.0 per CHANGELOG decision).
* `raise_internal` upgraded to ``-> NoReturn`` so callers no longer
need a redundant return; additional exception-leak sites scrubbed
to use the generic-code+error_id contract.
* Security audit (13 findings reviewed, 12 applied, 1 follow-up
documented):
- 003 telemetry middleware moved INSIDE RemoteAccessMiddleware so
analyst attribution sees a populated session.
- 004 SQL identifier escaping (``"`` → ``""``) in
optional_col/create_filtered_temp_table; ``safe_fields`` now
passed end-to-end into execute_top_n_rollups.
- 005 tenant-scoped scoring matrix path (matrix_<sid>.json) on
both write AND read paths; teardown wipes the scoped file.
- 006 ``Fastly-FF`` auth bypass closed via shield-auth secret
(X-Edge-CDN-Auth) stamped on bereqs in miss_pass; vcl_recv now
gates auth/penaltybox/Client-IP on the unspoofable marker
instead of fastly.ff.visits_this_service.
- 007 analyst credential rotation now requires the target
service_id to be in the analyst's allowed set.
- 008 VCL macro validator rejects ``;`` plus unescaped ``{``/``}``
while legitimate ``\{...\}`` heredocs (e.g. strftime patterns)
pass.
- 009 Compute@Edge scorer expires cookies BEFORE scoring instead
of at re-issue time, closing the expired-cookie replay vector.
- 010 url/ua/referer field limits ``int()``-cast before f-string
interpolation into VCL.
- 011 ``check-yaml --unsafe`` removed from pre-commit.
- 012 Python normalizer no longer pre-replaces ``%3F`` → ``?``;
the encoded char survives so downstream traversal payloads
don't get hidden behind a truncated prefix.
- 013 Rust normalizer added percent_decode for parity with Python
(``/%61dmin`` → ``/admin``, ``/a/%2e%2e/b`` → ``/b``).
- 014 SQL validator rejects NUL bytes (``\x00``) before regex /
parser sees them.
- 015 CSV usage-log export escapes spreadsheet-formula prefixes
(``=``, ``+``, ``-``, ``@``).
- 018 ``service_id`` required at all five view/alert
fetch/toggle/delete entry points; the O(N) cross-tenant scan
fallback (which the auditor flagged in views.py + which existed
structurally identically in alerts.py) is gone.
- 006 follow-up regression coverage: 4 ``security_regression``
tests in tests/utils/test_fastly_utils.py pin the shield-auth
invariants.
* Operational:
- Backend container drops privileges to UID/GID 1000 via
``USER app``. The Dockerfile uses ``--create-home`` so DuckDB's
extension cache (``INSTALL httpfs;``) has a writable HOME.
- ``caddy/Dockerfile`` re-creates the ``caddy`` user (the
upstream ``caddy:2-alpine`` image stopped shipping one) AND
re-applies CAP_NET_BIND_SERVICE via ``setcap`` after the custom
build replaces the binary.
- ``~/restart.sh`` on the GCE VM gains a chown step against
``/mnt/app-data/{configs,data,cache}`` so the host mount UID
matches the in-container app user. This script lives on the
host (not in the repo); existing deployers must mirror it
manually.
== Breaking (additions to the original v2.0.0 set) ==
* ``service_id`` is now required at ``DELETE /api/views/{id}``,
``DELETE /api/alerts/{id}``, and the alerts toggle endpoint.
Callers that previously relied on the cross-tenant fallback now
receive ``400 {"error":"service_id_required"}``.
* The backend container runs as a non-root user (UID/GID 1000).
Existing deployments must ``chown -R 1000:1000 /mnt/app-data/*``
(or whichever host path is bind-mounted to ``/app/{configs,data,
cache}``). On the maintained GCE deployment this is now baked
into ``~/restart.sh``.
* Custom-field VCL macros containing ``;`` or unescaped ``{``/``}``
are rejected at validation time (audit finding 008). A previously
misconfigured custom field that linted clean before will fail
now; the rejection points at the exact characters.
== Migration notes ==
* On upgrade: ``chown -R 1000:1000`` the bind-mounted host
directories before the first restart, or the backend container
fails to boot with ``PermissionError [Errno 13]`` on the first
config read.
* If you use saved-view or alert DELETE endpoints in automation,
thread the ``service_id`` query param through every call.
* If your fork carries custom-field VCL expressions, re-run the
validator after upgrade; ``;``/``{``/``}`` need to come out (or
use Fastly's ``\{``/``\}`` heredoc escape if literal braces are
required, e.g. inside a ``strftime(\{"format"\}, ...)`` call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
61b53df to
63e7d15
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Architectural cleanup release. The post-
v1.2.0perf branch closed the worst read-path latency by stacking remediation on top of an architecture that wasn't designed for the workload; this PR pays that down. The branch is squashed into a single commit (6e29655) that's tag-ready asv2.0.0, plus follow-up commits from the user (d237e9f,298797b) and a ruff lint fix (dd17b90).See
CHANGELOG.md[2.0.0]entry for the full annotated diff — this body is the PR-review companion (status of gates, test surface, what to watch for in review, follow-up items).What landed
iceberg/,cron/jobs,metadata/,tunnel/,share_db/,routers/admin/,core/rollups/); pre-split public surface preserved via package__init__.pyre-exports. Top-5 backend now ≤ 1461 LOC; no backend file > 1500.RequestContextreplacesAnalyticsDeps: tenancy enforced at context construction; routes never parseservice_idfrom a path param.dashboard/bundle,security/bundle,network/bundleship with the frontend swap. Granular endpoints deleted (see Breaking).ProvisionWizard(3582 → wizard/steps/),app/logs/page.tsx(2136 → _sections/ + _state.ts), and 14 more. No frontend file > 499 LOC.--cov-fail-under78 → 85 (actual 85.05 %); frontendcoverage.thresholds.lines44 → 58 (actual 61.66 %).ignore_errorslist 36 → 0; per-module strict block grew 8 → 19 modules.sql_validatorrejects NUL bytes; ruff T201 (noprint()in production code, with per-file ignores forscripts/*, the provision CLI, andtests/*).escape_sql_literalatread_parquet()+glob()sites), accessibility, stable React keys, request timeouts (d237e9f); VCL macro heredoc support (298797b).Verification
make ciis green on the current tip (dd17b90): 3972 pytest, 444 vitest, 9 VCL test all pass; lint + format + mypy + osv + secret-scan all pass.T201 print()violation intests/test_performance_smoke.py(legitimate benchmark prints); fixed indd17b90by addingtests/*to the per-file-ignores list (same treatment asscripts/*).bootstrap,sync-status,log-extents, every/api/admin/*)/share-loginsession): 9/9 pages rendered with real data — Dashboard, Performance, Origin, Security, Charts, Insights, Network, Sessions, Query/adminredirects analyst to/dashboard;/api/admin/*returns 401 for analyst session;/api/sync-statusreturns 401 (admin-only)Breaking changes
/api/dashboard/aggregates,/api/dashboard/raw,/api/dashboard/top_n, etc.) are deleted; callers must use the composite (/api/dashboard/bundle).AnalyticsDeps→RequestContext: alias removed.is_cached / _is_cachedalias onBaseResponseremoved (is_cachedis canonical).v2.0.1hardening):raise_internal(logger, exc, code, status)replacesraise HTTPException(detail={"error": str(e)})at except sites —detailis now{"error": <code>, "error_id": <8-hex>}. Upstream messages are scrubbed; operators correlate via the server log keyed onerror_id.What was NOT done in this PR (deferred)
v2.0.0tag — version is at2.0.0inpyproject.toml/frontend/package.json/backend/main.py/uv.lock; the tag itself is not yet pushed (per "do not merge anything yet"). The user has already drafted av2.0.1commit message (d237e9f) but the actual version strings still read2.0.0— needs a decision: ship as2.0.0(with the hardening folded in) or bump to2.0.1before tagging.pending-docs/best_practices_audit.md— left untracked in the working tree (your WIP audit doc, not part of this PR).Test plan
make cigreenv2.0.0vsv2.0.1(and whether to bump the version strings)~/restart.sh+ the[prod-verify-paths]memory pattern🤖 Generated with Claude Code