From 63e7d157ce4660afca186170ec4967b8b4016090 Mon Sep 17 00:00:00 2001 From: Drew Michael Date: Sat, 13 Jun 2026 09:48:25 -0500 Subject: [PATCH 001/112] =?UTF-8?q?v2.0.0:=20cleanup=20release=20=E2=80=94?= =?UTF-8?q?=20architecture,=20telemetry,=20tenancy,=20gates?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Squash of the refactor/cleanup branch — architectural cleanup release. Tag this commit as v2.0.0. == Architecture == Carved every backend file > 1500 LOC into a per-concern package; all packages preserve the pre-split public surface via __init__.py re-exports so import paths stay stable. * core/iceberg.py (4232) -> iceberg/{view, catalog, warehouse, manifest, fs, _core, buffer, ddl, snapshot_cache, dedup, ...}. Custom FosFsspecFileIO + CachedFosS3FileSystem subclasses retire 5 of 6 historical s3fs monkeypatches. * scheduler.py (2843) -> cron/{scheduler, decorators, jobs/{sync, commit, compaction, optimize, expire, metadata, gap_heal, rollup_compact_daily}}. Phase 6 picks separate-pool isolation based on Phase 1 thread-wait telemetry. * core/metadata_db.py(3168)-> core/metadata/{base, alerts, views, ingest_log, cron_log, asn_cache, usage_log, reconciliation, state}; metadata_db.py is a backward-compat shim. * utils/tunnel.py (1022) -> tunnel/{manager, session, rate_limiter, state, fingerprint}. SSH-to-localhost.run path DELETED (~400 lines). Direct-mode only. * core/share_db.py (1312) -> share_db/{connection, schema, invites, sessions, audit, passcode, tos, settings, validation}. argon2id replaces scrypt for new passcodes; scrypt verify branch stays for transparent rehash-on-login. * routers/admin.py (1650) -> admin/{pop_locations, ingest, trees, downloads, sync_status, compaction, health, log_accounting, iceberg, bot_sources} + _helpers / _dir_size / _router. admin_usage sidecar still attaches to the shared router. * core/rollups.py (2045) -> rollups/{_common, time_series, sessions, hour_bundles, day_bundles, recompute, wellknown_bots}. 41 re-exports. Other architecture changes: * RequestContext replaces AnalyticsDeps; tenancy enforced at context construction; routes never parse service_id from a path param. * Composite endpoints hard cutover: dashboard/bundle + security/bundle + network/bundle ship with the frontend swap. Granular per-card endpoints deleted. _meta_con parallel path dropped. is_cached / _is_cached alias collapsed. AnalyticsDeps = RequestContext shim removed. * Top-5 backend files now <= 1461 LOC; no backend file > 1500. * Top frontend files all < 500 LOC (ProvisionWizard 3582 -> wizard/steps/*; app/logs/page.tsx 2136 -> _sections/* + _state.ts). == Telemetry, dependencies == * OpenTelemetry (api/sdk + fastapi/botocore/aiohttp instrumentors) replaces the four fragmented custom telemetry surfaces. Console exporter ships by default. * structlog wires trace_id + span_id into structured log output. * process_context_scope + _ACTIVE_CONTEXTS mirror KEPT at backend/utils/telemetry.py. OTel context propagation uses Python ContextVars under the hood and inherits the same cross-thread limitation (fsspec iothread, pyiceberg ThreadPoolExecutor) the manual mirror was built to solve. * aiodns + asyncio.gather + bulk-transaction sqlite writes in utils/rdns_cache.py replace the serial-blocking socket.gethostbyaddr loop. * tenacity decorator retry for Fastly/NGWAF/SQLite-WAL-busy paths. * pydantic-settings centralises env-var reads + boot validation. * cachetools replaces in-process bounded_cache / rdns_cache / ngwaf_bot_cache. * Structured .tf.json generation replaces f-string HCL + the _hcl_escape regex; eliminates a custom-HCL injection vector. * orjson via FastAPI ORJSONResponse for composite payloads. * rich + typer for the provision CLI. * httpx everywhere except telemetry_proxy.py (aiohttp stays for the proxy server role). * nuqs as the URL state source on the frontend. == Trust topology, security hardening == * Middleware order asserted at boot AND in tests; one-line INVARIANT markers replace the paragraph-long comments in main.py. * @pytest.mark.security_regression marker + monotonic-count CI gate (floor: 24). * Trust-topology snapshot tests pin Caddy @from_fastly matcher, XFF forwarding, /share-login rate-limit, and the backend --forwarded-allow-ips=127.0.0.1 flags. * sql_validator rejects NUL bytes before any cost. * ruff T201 (print-detection) lint rule enforced in production code. == Frontend == * RSC/CSR boundary in app/_routing.md. Hidden-Plotly + hidden-MapLibre + setTimeout warm-up hacks dropped; replaced with modulepreload + the styledata-event swap pattern. * Live Query Monitor: live-first sort, peak-memory column, keyboard shortcuts, URL-persisted filters, per-run inline expand for xN cron-grouped rows, >=30s stuck-query pulse, copy-SQL. * Operations Overview cards on the admin landing page surface ingest gap + live query activity + slow-query count. == Quality gates == * Backend coverage gate --cov-fail-under 78 -> 85 (actual 85.05 %). * Frontend coverage gate coverage.thresholds.lines 44 -> 58 (actual 61.66 %). * tool.mypy.overrides ignore_errors list: 36 modules -> 0. * mypy per-module strict block: 19 modules opted in (disallow_untyped_defs + disallow_incomplete_defs + check_untyped_defs + warn_return_any + warn_unused_ignores). * Load-harness CI step: scripts/emit_perf_latest.py runs a 100K-row synthetic DuckDB workload; scripts/perf_gate.sh fails on >50% regression vs tests/perf/baseline.json. == Operations, portability == * VM-agnostic deploy runbooks at docs/deploy/{aws_ec2, azure_vm, gce, generic_linux}.md. Storage stays Fastly Object Storage (S3-compatible). * scripts/refresh_fastly_cidrs.py pulls api.fastly.com/public-ip-list and rewrites the Caddy @from_fastly block. == Version == pyproject.toml + frontend/package.json + backend/main.py FastAPI app + uv.lock all bumped to 2.0.0. CHANGELOG.md gains a [2.0.0] entry. README.md updated to drop the removed SSH-tunnel sharing mode. AGENTS.md updated to reflect the post-split admin/ + rollups/ package paths. == Breaking == * Composite-endpoint cutover: granular per-card endpoints are deleted. Callers use the composite (/api/dashboard/bundle, etc.). * AnalyticsDeps alias for RequestContext is removed. * is_cached / _is_cached alias on BaseResponse is removed. * SSH-to-localhost.run analyst sharing is removed; production has always been direct-mode against the Fastly+Caddy public URL. == Post-squash follow-ups folded into v2.0.0 == The original v2.0.0 squash (6e29655) was followed by 9 additional commits during the release-stabilization window; all of them are folded into this single release commit. Highlights: * SQL hardening, accessibility pass, stable React keys, request timeouts (originally tagged v2.0.1 in-flight, then folded into v2.0.0 per CHANGELOG decision). * `raise_internal` upgraded to ``-> NoReturn`` so callers no longer need a redundant return; additional exception-leak sites scrubbed to use the generic-code+error_id contract. * Security audit (13 findings reviewed, 12 applied, 1 follow-up documented): - 003 telemetry middleware moved INSIDE RemoteAccessMiddleware so analyst attribution sees a populated session. - 004 SQL identifier escaping (``"`` → ``""``) in optional_col/create_filtered_temp_table; ``safe_fields`` now passed end-to-end into execute_top_n_rollups. - 005 tenant-scoped scoring matrix path (matrix_.json) on both write AND read paths; teardown wipes the scoped file. - 006 ``Fastly-FF`` auth bypass closed via shield-auth secret (X-Edge-CDN-Auth) stamped on bereqs in miss_pass; vcl_recv now gates auth/penaltybox/Client-IP on the unspoofable marker instead of fastly.ff.visits_this_service. - 007 analyst credential rotation now requires the target service_id to be in the analyst's allowed set. - 008 VCL macro validator rejects ``;`` plus unescaped ``{``/``}`` while legitimate ``\{...\}`` heredocs (e.g. strftime patterns) pass. - 009 Compute@Edge scorer expires cookies BEFORE scoring instead of at re-issue time, closing the expired-cookie replay vector. - 010 url/ua/referer field limits ``int()``-cast before f-string interpolation into VCL. - 011 ``check-yaml --unsafe`` removed from pre-commit. - 012 Python normalizer no longer pre-replaces ``%3F`` → ``?``; the encoded char survives so downstream traversal payloads don't get hidden behind a truncated prefix. - 013 Rust normalizer added percent_decode for parity with Python (``/%61dmin`` → ``/admin``, ``/a/%2e%2e/b`` → ``/b``). - 014 SQL validator rejects NUL bytes (``\x00``) before regex / parser sees them. - 015 CSV usage-log export escapes spreadsheet-formula prefixes (``=``, ``+``, ``-``, ``@``). - 018 ``service_id`` required at all five view/alert fetch/toggle/delete entry points; the O(N) cross-tenant scan fallback (which the auditor flagged in views.py + which existed structurally identically in alerts.py) is gone. - 006 follow-up regression coverage: 4 ``security_regression`` tests in tests/utils/test_fastly_utils.py pin the shield-auth invariants. * Operational: - Backend container drops privileges to UID/GID 1000 via ``USER app``. The Dockerfile uses ``--create-home`` so DuckDB's extension cache (``INSTALL httpfs;``) has a writable HOME. - ``caddy/Dockerfile`` re-creates the ``caddy`` user (the upstream ``caddy:2-alpine`` image stopped shipping one) AND re-applies CAP_NET_BIND_SERVICE via ``setcap`` after the custom build replaces the binary. - ``~/restart.sh`` on the GCE VM gains a chown step against ``/mnt/app-data/{configs,data,cache}`` so the host mount UID matches the in-container app user. This script lives on the host (not in the repo); existing deployers must mirror it manually. == Breaking (additions to the original v2.0.0 set) == * ``service_id`` is now required at ``DELETE /api/views/{id}``, ``DELETE /api/alerts/{id}``, and the alerts toggle endpoint. Callers that previously relied on the cross-tenant fallback now receive ``400 {"error":"service_id_required"}``. * The backend container runs as a non-root user (UID/GID 1000). Existing deployments must ``chown -R 1000:1000 /mnt/app-data/*`` (or whichever host path is bind-mounted to ``/app/{configs,data, cache}``). On the maintained GCE deployment this is now baked into ``~/restart.sh``. * Custom-field VCL macros containing ``;`` or unescaped ``{``/``}`` are rejected at validation time (audit finding 008). A previously misconfigured custom field that linted clean before will fail now; the rejection points at the exact characters. == Migration notes == * On upgrade: ``chown -R 1000:1000`` the bind-mounted host directories before the first restart, or the backend container fails to boot with ``PermissionError [Errno 13]`` on the first config read. * If you use saved-view or alert DELETE endpoints in automation, thread the ``service_id`` query param through every call. * If your fork carries custom-field VCL expressions, re-run the validator after upgrade; ``;``/``{``/``}`` need to come out (or use Fastly's ``\{``/``\}`` heredoc escape if literal braces are required, e.g. inside a ``strftime(\{"format"\}, ...)`` call). Co-Authored-By: Claude Opus 4.7 (1M context) --- .check_router_core_floor | 1 + .env.example | 7 + .github/workflows/ci.yml | 62 +- .github/workflows/cidr-refresh.yml | 53 + .gitignore | 5 + .pre-commit-config.yaml | 47 +- AGENTS.md | 148 +- CHANGELOG.md | 244 + CONTRIBUTING.md | 9 + Caddyfile | 22 +- MONKEYPATCHES.md | 37 +- Makefile | 40 +- README.md | 8 +- backend/Dockerfile | 39 +- backend/config.py | 24 +- backend/core/_duckdb_status.py | 1155 +++++ backend/core/_log_fields_data.py | 1304 +++++ backend/core/data_migrations.py | 165 + backend/core/duckdb.py | 1093 +---- backend/core/duckdb_pool.py | 230 +- backend/core/fastly/utils.py | 60 +- backend/core/field_registry.py | 572 +++ backend/core/iceberg.py | 4232 ----------------- backend/core/iceberg/__init__.py | 173 + backend/core/iceberg/_core.py | 1123 +++++ backend/core/iceberg/buffer.py | 1135 +++++ backend/core/iceberg/fs.py | 506 ++ backend/core/iceberg/manifest.py | 453 ++ backend/core/iceberg/sync.py | 509 ++ backend/core/iceberg/view.py | 1127 +++++ backend/core/ingest.py | 64 +- backend/core/local_compaction.py | 87 +- backend/core/log_fields.py | 1298 +---- backend/core/metadata/__init__.py | 287 ++ backend/core/metadata/alerts.py | 142 + backend/core/metadata/asn_cache.py | 49 + backend/core/metadata/base.py | 615 +++ backend/core/metadata/cron_log.py | 483 ++ backend/core/metadata/ingest_log.py | 855 ++++ backend/core/metadata/reconciliation.py | 345 ++ backend/core/metadata/slow_queries.py | 168 + backend/core/metadata/state.py | 226 + backend/core/metadata/usage_log.py | 688 +++ backend/core/metadata/usage_log_db.py | 407 ++ backend/core/metadata/views.py | 122 + backend/core/metadata_db.py | 3219 +------------ backend/core/query_attribution.py | 245 + backend/core/query_instrumentation.py | 456 ++ backend/core/query_registry.py | 580 +++ backend/core/request_context.py | 185 + backend/core/request_telemetry.py | 323 ++ backend/core/rollups.py | 1036 ---- backend/core/rollups/__init__.py | 145 + backend/core/rollups/_common.py | 343 ++ backend/core/rollups/day_bundles.py | 352 ++ backend/core/rollups/hour_bundles.py | 335 ++ backend/core/rollups/recompute.py | 281 ++ backend/core/rollups/sessions.py | 276 ++ backend/core/rollups/time_series.py | 242 + backend/core/rollups/wellknown_bots.py | 314 ++ backend/core/settings.py | 262 + backend/core/share_db.py | 1312 ----- backend/core/share_db/__init__.py | 179 + backend/core/share_db/audit.py | 77 + backend/core/share_db/connection.py | 210 + backend/core/share_db/invites.py | 538 +++ backend/core/share_db/passcode.py | 221 + backend/core/share_db/schema.py | 199 + backend/core/share_db/sessions.py | 73 + backend/core/share_db/settings.py | 42 + backend/core/share_db/tos.py | 33 + backend/core/share_db/validation.py | 183 + backend/core/sqlite_migrations.py | 99 + backend/cron/__init__.py | 9 + backend/cron/decorators.py | 100 + backend/cron/jobs/__init__.py | 9 + backend/cron/jobs/commit.py | 235 + backend/cron/jobs/compaction.py | 196 + backend/cron/jobs/expire.py | 90 + backend/cron/jobs/metadata.py | 751 +++ backend/cron/jobs/optimize.py | 143 + backend/cron/jobs/sync.py | 910 ++++ backend/cron/scheduler.py | 854 ++++ backend/cron_progress.py | 21 +- backend/deps.py | 66 +- backend/main.py | 286 +- backend/models/admin.py | 16 + backend/models/common.py | 28 + backend/models/custom_fields.py | 2 +- backend/models/dashboard.py | 3 + backend/models/lake.py | 8 +- backend/models/performance.py | 1 + backend/models/security.py | 9 + backend/models/services.py | 9 +- backend/provision/cli.py | 199 +- backend/provision/fastly_api.py | 40 +- backend/provision/fos_setup.py | 5 +- backend/provision/orchestrator.py | 10 +- .../provision/session_scoring_orchestrator.py | 2 +- backend/provision/session_scoring_vcl.py | 6 +- backend/provision/utils.py | 72 +- backend/repositories/_base.py | 474 +- backend/repositories/_sql/__init__.py | 19 + backend/repositories/_sql/alerts.py | 102 + backend/repositories/_sql/base.py | 227 + backend/repositories/_sql/dashboard.py | 226 + backend/repositories/_sql/insights.py | 717 +++ backend/repositories/_sql/network.py | 279 ++ backend/repositories/_sql/origin.py | 495 ++ backend/repositories/_sql/performance.py | 36 + backend/repositories/_sql/query.py | 95 + backend/repositories/_sql/security.py | 353 ++ backend/repositories/_sql/sessions.py | 164 + backend/repositories/_sql/usage.py | 26 + backend/repositories/alerts.py | 63 +- backend/repositories/dashboard.py | 410 +- backend/repositories/insights/definitions.py | 3 +- backend/repositories/insights/registry.py | 4 +- backend/repositories/insights/repository.py | 8 +- backend/repositories/network.py | 213 +- backend/repositories/origin.py | 312 +- backend/repositories/performance.py | 87 +- backend/repositories/query.py | 42 +- backend/repositories/security.py | 525 +- backend/repositories/session_scoring.py | 186 + backend/repositories/sessions.py | 582 ++- backend/repositories/usage.py | 3 +- backend/repositories/utils/filters.py | 9 +- backend/repositories/views.py | 53 +- backend/routers/admin.py | 1739 ------- backend/routers/admin/__init__.py | 97 + backend/routers/admin/_dir_size.py | 46 + backend/routers/admin/_helpers.py | 169 + backend/routers/admin/_router.py | 13 + backend/routers/admin/bot_sources.py | 32 + backend/routers/admin/compaction.py | 258 + backend/routers/admin/downloads.py | 339 ++ backend/routers/admin/health.py | 121 + backend/routers/admin/iceberg.py | 111 + backend/routers/admin/ingest.py | 80 + backend/routers/admin/log_accounting.py | 348 ++ backend/routers/admin/pop_locations.py | 46 + backend/routers/admin/sync_status.py | 187 + backend/routers/admin/trees.py | 32 + backend/routers/admin_queries.py | 258 + backend/routers/admin_usage.py | 326 ++ backend/routers/alerts.py | 18 +- backend/routers/bootstrap.py | 216 +- backend/routers/dashboard.py | 113 +- backend/routers/insights.py | 8 +- backend/routers/network.py | 18 +- backend/routers/origin.py | 58 +- backend/routers/performance.py | 14 +- backend/routers/provision.py | 99 +- backend/routers/query.py | 51 +- backend/routers/security.py | 14 +- backend/routers/services/audit.py | 10 +- backend/routers/services/core.py | 168 +- backend/routers/services/cron.py | 12 +- backend/routers/session_scoring.py | 1377 +----- backend/routers/session_scoring_admin.py | 1230 +++++ backend/routers/sessions.py | 14 +- backend/routers/share_admin.py | 16 +- backend/routers/share_auth.py | 42 +- backend/routers/usage.py | 77 +- backend/routers/views.py | 10 +- backend/scheduler.py | 2908 +---------- backend/scoring/labels.py | 68 +- backend/scoring/normalize.py | 44 +- backend/state_sync.py | 11 +- backend/utils/bot_sources.py | 28 +- backend/utils/iceberg_expr.py | 41 + backend/utils/pop_utils.py | 7 +- backend/utils/rdns_cache.py | 473 +- backend/utils/remote_access.py | 276 +- backend/utils/retry.py | 246 + backend/utils/router_utils.py | 29 + backend/utils/sql_validator.py | 7 +- backend/utils/sqlite_profiler.py | 67 +- backend/utils/structlog_config.py | 102 + backend/utils/telemetry.py | 144 +- backend/utils/telemetry_proxy.py | 45 +- .../utils/telemetry_response_middleware.py | 15 +- backend/utils/terraform_gen.py | 495 +- backend/utils/tunnel.py | 1022 ---- backend/utils/tunnel/__init__.py | 47 + backend/utils/tunnel/fingerprint.py | 31 + backend/utils/tunnel/manager.py | 521 ++ backend/utils/tunnel/rate_limiter.py | 97 + backend/utils/tunnel/session.py | 66 + backend/utils/tunnel/state.py | 102 + backend/utils/vcl_utils.py | 34 +- backend/utils/vcl_validator.py | 3 +- caddy/Dockerfile | 21 + compute/scorer/src/main.rs | 31 +- compute/scorer/src/normalize.rs | 57 +- configs/ssh_known_hosts | 30 - docs/ARCHITECTURE.md | 34 +- docs/adr/01-storage-model.md | 44 + docs/adr/02-request-lifecycle.md | 54 + docs/adr/03-tenancy.md | 54 + docs/adr/04-middleware-order.md | 83 + docs/adr/05-frontend-rendering-boundary.md | 69 + docs/adr/06-view-warming.md | 218 + docs/adr/07-feature-budgets.md | 114 + docs/adr/08-observability.md | 120 + docs/adr/09-error-handling.md | 116 + docs/adr/10-schema-evolution.md | 108 + docs/adr/11-secret-rotation.md | 114 + docs/adr/12-api-versioning.md | 129 + docs/adr/13-backup-dr.md | 145 + docs/deploy/README.md | 31 + docs/deploy/aws_ec2.md | 195 + docs/deploy/azure_vm.md | 205 + docs/deploy/gce.md | 193 + docs/deploy/generic_linux.md | 244 + frontend/Dockerfile | 9 +- frontend/__tests__/app/admin.test.tsx | 20 +- frontend/__tests__/app/dashboard.test.tsx | 16 +- frontend/__tests__/app/insights.test.tsx | 6 +- frontend/__tests__/app/query.test.tsx | 28 + .../components/AnalyticsCard.test.tsx | 2 +- .../__tests__/components/AppLayout.test.tsx | 4 + .../__tests__/components/DataTable.test.tsx | 11 +- .../__tests__/components/FilterBar.test.tsx | 6 + .../components/LogSettingsModal.test.tsx | 22 +- .../components/ProvisionWizard.test.tsx | 20 +- .../ProvisionWizard/wizard-api.test.ts | 436 ++ .../wizard-config-helpers.test.ts | 285 ++ .../ProvisionWizard/wizard-deploy.test.ts | 440 ++ .../__tests__/hooks/useFilterUrlSync.test.ts | 103 + .../__tests__/hooks/useFilteredActive.test.ts | 295 ++ .../hooks/useKeyboardShortcuts.test.ts | 118 + .../__tests__/hooks/useReportConfig.test.ts | 45 +- .../__tests__/hooks/useUrlFilterSync.test.ts | 4 +- .../__tests__/hooks/useUrlServiceSync.test.ts | 80 +- .../__tests__/lib/api/custom-fields.test.ts | 281 ++ frontend/__tests__/lib/date.test.ts | 2 +- frontend/__tests__/lib/toast.test.ts | 176 + .../lib/workers/buildTrafficData.test.ts | 64 + .../__tests__/lib/workers/parseJson.test.ts | 71 + frontend/__tests__/middleware.test.ts | 75 + frontend/__tests__/ssr/bootstrap.test.ts | 144 + frontend/__tests__/stores/filterStore.test.ts | 68 +- frontend/app/_routing.md | 68 + .../app/admin/_sections/BotSourcesPanel.tsx | 247 + .../app/admin/_sections/CredentialsDialog.tsx | 168 + .../app/admin/_sections/DiagnosticsPanel.tsx | 78 + .../app/admin/_sections/GlobalSettings.tsx | 291 ++ frontend/app/admin/_sections/NgwafDialog.tsx | 191 + .../admin/_sections/OperationsOverview.tsx | 240 + .../app/admin/_sections/ServicesTable.tsx | 138 + .../admin/_sections/ServicesTableColumns.tsx | 332 ++ frontend/app/admin/_sections/SystemStatus.tsx | 102 + frontend/app/admin/page.tsx | 1358 +----- frontend/app/admin/queries/_helpers.ts | 70 + .../admin/queries/_hooks/useFilteredActive.ts | 251 + .../queries/_hooks/useKeyboardShortcuts.ts | 67 + .../queries/_hooks/useQueryMonitorUrlSync.ts | 102 + .../admin/queries/_sections/ActiveTable.tsx | 73 + .../queries/_sections/CompletedTable.tsx | 63 + .../admin/queries/_sections/DbFilterChips.tsx | 37 + .../admin/queries/_sections/FilterChips.tsx | 33 + .../queries/_sections/PollingIndicator.tsx | 30 + .../queries/_sections/RowDetailDialog.tsx | 233 + .../admin/queries/_sections/ShortcutsHelp.tsx | 50 + .../admin/queries/_sections/SummaryStrip.tsx | 62 + .../admin/queries/_sections/queryColumns.tsx | 387 ++ frontend/app/admin/queries/_types.ts | 103 + frontend/app/admin/queries/page.tsx | 536 +++ .../app/admin/usage-log/_sections/Filters.tsx | 98 + .../admin/usage-log/_sections/UsageChart.tsx | 179 + .../admin/usage-log/_sections/UsageTable.tsx | 96 + .../app/admin/usage-log/_sections/shared.ts | 52 + frontend/app/admin/usage-log/page.tsx | 397 +- frontend/app/alerts/_sections/AlertEditor.tsx | 409 ++ .../app/alerts/_sections/AlertPreview.tsx | 155 + frontend/app/alerts/_sections/AlertsList.tsx | 300 ++ frontend/app/alerts/page.tsx | 783 +-- frontend/app/charts/page.tsx | 11 +- frontend/app/dashboard/_sections/CardGrid.tsx | 223 + frontend/app/dashboard/_sections/GeoMap.tsx | 81 + .../app/dashboard/_sections/TrafficChart.tsx | 239 + .../app/dashboard/_sections/categories.ts | 77 + .../app/dashboard/_sections/chartHelpers.ts | 207 + frontend/app/dashboard/_sections/types.ts | 30 + frontend/app/dashboard/page.tsx | 1151 +---- frontend/app/globals.css | 9 +- frontend/app/insights/page.tsx | 12 +- frontend/app/layout.tsx | 103 +- frontend/app/logs/_sections/AuditColumns.tsx | 364 ++ frontend/app/logs/_sections/CronColumns.tsx | 349 ++ .../app/logs/_sections/CronExplanations.ts | 13 + .../app/logs/_sections/CronScheduleBox.tsx | 174 + frontend/app/logs/_sections/CronTab.tsx | 201 + .../logs/_sections/FloatingOperationsDock.tsx | 241 + frontend/app/logs/_sections/IngestionTab.tsx | 71 + .../app/logs/_sections/QuickActionsBar.tsx | 150 + frontend/app/logs/_sections/SSEModal.tsx | 76 + frontend/app/logs/_sections/SchemaTab.tsx | 85 + .../app/logs/_sections/ServiceHistoryTab.tsx | 117 + frontend/app/logs/_state.ts | 490 ++ frontend/app/logs/page.tsx | 2106 +------- frontend/app/network/help-content.tsx | 3 +- frontend/app/network/page.tsx | 46 +- frontend/app/origin/_sections/Aggregates.tsx | 61 + .../app/origin/_sections/LatencyHeatmap.tsx | 209 + frontend/app/origin/_sections/Timeseries.tsx | 140 + frontend/app/origin/page.tsx | 503 +- frontend/app/performance/help-content.tsx | 1 - frontend/app/performance/page.tsx | 70 +- frontend/app/query/_sections/ModeToggle.tsx | 32 + frontend/app/query/_sections/QueryToolbar.tsx | 175 + frontend/app/query/_sections/RawSqlMode.tsx | 28 + frontend/app/query/_sections/ResultsTable.tsx | 76 + .../app/query/_sections/StructuredMode.tsx | 29 + frontend/app/query/_sql_builder.ts | 122 + frontend/app/query/page.tsx | 528 +- .../app/security/_sections/BotsSection.tsx | 450 ++ .../_sections/HeaderAnomaliesSection.tsx | 115 + .../app/security/_sections/NetworkSection.tsx | 144 + .../app/security/_sections/securityInfo.tsx | 169 + frontend/app/security/page.tsx | 648 +-- .../sessions/_sections/ScoringControls.tsx | 101 + .../app/sessions/_sections/SessionDetail.tsx | 293 ++ .../app/sessions/_sections/SessionsTable.tsx | 201 + frontend/app/sessions/page.tsx | 562 +-- frontend/app/share-login/acknowledge/page.tsx | 5 +- frontend/app/share-login/page.tsx | 3 +- frontend/app/usage/page.tsx | 10 +- frontend/components/AnalyticsCard.tsx | 12 +- frontend/components/AppLayout.tsx | 142 +- frontend/components/ChartIntervalButtons.tsx | 1 + .../CostCalculator/CostCalculator.tsx | 582 +-- frontend/components/CostCalculator/Inputs.tsx | 93 + .../components/CostCalculator/Pricing.tsx | 90 + .../components/CostCalculator/Results.tsx | 69 + frontend/components/CostCalculator/calc.ts | 288 ++ frontend/components/CostCalculator/parts.tsx | 75 + frontend/components/CronLiveLog.tsx | 41 +- .../CronSettingsModal/CronSettingsModal.tsx | 382 +- .../components/CronSettingsModal/Preview.tsx | 30 + .../components/CronSettingsModal/Schedule.tsx | 243 + .../components/CronSettingsModal/Triggers.tsx | 147 + .../components/CronSettingsModal/constants.ts | 43 + .../CustomFields/CustomFieldDrawer.tsx | 70 +- .../CustomFields/CustomFieldsManager.tsx | 8 +- .../Dashboard/FieldSearchDialog.tsx | 2 +- frontend/components/Dashboard/TopTenTable.tsx | 13 +- frontend/components/DataTable/Body.tsx | 64 + .../components/DataTable/ColumnPicker.tsx | 51 + .../DataTable/ColumnVisibilityDropdown.tsx | 6 +- frontend/components/DataTable/DataTable.tsx | 352 +- frontend/components/DataTable/Header.tsx | 94 + frontend/components/DataTable/Toolbar.tsx | 49 + frontend/components/DebugPanel.tsx | 22 +- frontend/components/DeltaIndicator.tsx | 6 +- .../components/FileBrowser/FileBrowser.tsx | 42 +- .../components/FilterBar/AddFilterDialog.tsx | 4 +- frontend/components/FilterBar/FilterBar.tsx | 104 +- .../components/FilterBar/SaveViewDialog.tsx | 4 +- .../components/FilterBar/ViewSelector.tsx | 11 +- frontend/components/FilterPopover.tsx | 8 +- .../IcebergStatus/IcebergCalendar.tsx | 12 +- .../IcebergStatus/IcebergStatus.tsx | 18 +- .../Insights/ImpossibleDistanceModal.tsx | 8 +- frontend/components/Insights/InsightCard.tsx | 11 +- .../components/Insights/InsightHelpModal.tsx | 566 --- .../Insights/InsightHelpModal/index.tsx | 69 + .../InsightHelpModal/sections/cache.tsx | 61 + .../InsightHelpModal/sections/errors.tsx | 60 + .../sections/optimization.tsx | 55 + .../InsightHelpModal/sections/performance.tsx | 105 + .../InsightHelpModal/sections/security.tsx | 219 + .../InsightHelpModal/sections/traffic.tsx | 86 + .../Insights/InsightHelpModal/types.ts | 15 + .../components/Insights/InsightItemRow.tsx | 1 + .../InviteAnalystDialog.tsx | 183 +- .../LogSettingsModal/CustomFields.tsx | 16 + .../LogSettingsModal/FieldGroups.tsx | 357 ++ .../LogSettingsModal/LogSettingsModal.tsx | 519 +- .../components/LogSettingsModal/Preview.tsx | 187 + frontend/components/Map/ChoroplethMap.tsx | 86 +- frontend/components/Map/NetworkMap.tsx | 562 --- .../components/Map/NetworkMap/MapLayer.tsx | 355 ++ .../Map/NetworkMap/OverlayLayer.tsx | 69 + .../components/Map/NetworkMap/controls.tsx | 158 + frontend/components/Map/NetworkMap/index.tsx | 124 + frontend/components/Map/ShieldingMap.tsx | 38 +- .../components/PlotlyChart/ChartA11yTable.tsx | 56 + .../components/PlotlyChart/PlotlyChart.tsx | 73 +- .../__tests__/tracesToTable.test.ts | 98 + .../components/PlotlyChart/tracesToTable.ts | 120 + .../ProvisionWizard/JsonImportSection.tsx | 82 + .../ProvisionWizard/ProvisionWizard.tsx | 3597 +------------- .../ProvisionWizard/WizardFooter.tsx | 239 + .../ProvisionWizard/WizardHeader.tsx | 62 + .../ProvisionWizard/steps/AnalyzeStep.tsx | 203 + .../ProvisionWizard/steps/ConfirmStep.tsx | 137 + .../ProvisionWizard/steps/ExecuteStep.tsx | 347 ++ .../ProvisionWizard/steps/FieldsStep.tsx | 119 + .../ProvisionWizard/steps/JoinStep.tsx | 467 ++ .../ProvisionWizard/steps/ModeStep.tsx | 78 + .../ProvisionWizard/steps/NgwafStep.tsx | 125 + .../ProvisionWizard/steps/ServiceStep.tsx | 91 + .../ProvisionWizard/steps/SettingsStep.tsx | 102 + .../ProvisionWizard/steps/StorageStep.tsx | 422 ++ .../ProvisionWizard/steps/TerraformStep.tsx | 157 + .../ProvisionWizard/steps/TokenStep.tsx | 74 + frontend/components/ProvisionWizard/types.ts | 287 ++ .../ProvisionWizard/useWizardState.ts | 493 ++ .../components/ProvisionWizard/wizard-api.ts | 229 + .../ProvisionWizard/wizard-config-helpers.ts | 134 + .../ProvisionWizard/wizard-deploy.ts | 381 ++ .../ProvisionWizard/wizard-effects.ts | 247 + frontend/components/QueryProvider.tsx | 21 +- frontend/components/ReportLayout.tsx | 24 +- frontend/components/ReportShell.tsx | 16 +- frontend/components/SSEModal/SSEModal.tsx | 70 +- .../components/SSEModal/SSEProgressView.tsx | 28 +- .../ServiceSwitcher/ServiceSwitcher.tsx | 1 + .../SessionScoring/FlagSessionPopover.tsx | 43 +- .../components/SessionScoring/LabelsTab.tsx | 3 +- .../SessionScoring/MatrixVersionsCard.tsx | 1 + .../SessionScoring/RetrainButton.tsx | 1 + .../SessionScoring/ScoringHealthCard.tsx | 20 +- .../SessionScoring/ThresholdSlider/Matrix.tsx | 83 + .../ThresholdSlider/Preview.tsx | 54 + .../SessionScoring/ThresholdSlider/Slider.tsx | 195 + .../index.tsx} | 269 +- .../SyncFromCloudModal/SyncFromCloudModal.tsx | 10 +- .../SyncStatusBadge/SyncStatusBadge.tsx | 48 +- frontend/components/SystemHealthCard.tsx | 90 + .../TeardownDialog/TeardownDialog.tsx | 4 +- .../TimezoneSwitcher/TimezoneSwitcher.tsx | 2 +- .../share-dashboard/CreateInviteDialog.tsx | 1 + .../share-dashboard/InvitationsPanel.tsx | 2 + frontend/components/ui/dialog.tsx | 42 +- frontend/components/ui/dropdown-menu.tsx | 2 +- frontend/components/ui/page-header.tsx | 12 +- frontend/components/ui/select.tsx | 69 +- frontend/components/ui/stat-card.tsx | 1 + frontend/components/ui/switch.tsx | 10 + frontend/components/ui/tooltip.tsx | 2 +- frontend/eslint.config.mjs | 7 + frontend/hooks/useActiveService.ts | 14 + frontend/hooks/useAnalystHeartbeat.ts | 16 +- frontend/hooks/useBootstrap.ts | 48 +- frontend/hooks/useDashboardBundle.ts | 87 + frontend/hooks/useFilterUrlSync.ts | 140 + frontend/hooks/useIsDataReady.ts | 30 +- frontend/hooks/useLogFieldsCatalog.ts | 37 +- frontend/hooks/usePageContext.ts | 42 - frontend/hooks/useReportConfig.ts | 29 +- frontend/hooks/useSSE.ts | 23 +- frontend/hooks/useShareStatusBanner.tsx | 34 +- frontend/hooks/useSyncStatus.ts | 78 + frontend/hooks/useTimeRange.ts | 21 + frontend/hooks/useTimeseriesToTraces.ts | 4 +- frontend/hooks/useTimezone.ts | 11 + frontend/hooks/useUrlFilterSync.ts | 10 +- frontend/hooks/useUrlServiceSync.ts | 97 +- frontend/lib/_preload-chunks.json | 2 +- frontend/lib/api.ts | 9 + frontend/lib/api/custom-fields.ts | 4 +- frontend/lib/date.ts | 6 +- frontend/lib/fetchWithTimeout.ts | 35 + frontend/lib/format.ts | 6 +- frontend/lib/ssr/bootstrap.ts | 140 + frontend/lib/table-utils.tsx | 40 +- frontend/lib/toast.ts | 116 + frontend/lib/workers/buildTrafficData.ts | 66 + frontend/lib/workers/chartDataWorker.ts | 24 + frontend/lib/workers/json-worker.ts | 8 + frontend/lib/workers/parseJson.ts | 26 + frontend/openapi.json | 2085 +++++--- frontend/package-lock.json | 185 +- frontend/package.json | 5 +- frontend/proxy.ts | 51 +- frontend/public/fastly.svg | 2 +- frontend/public/geo/dma.geojson | 2 +- frontend/public/globe.svg | 2 +- frontend/scripts/build-preload-manifest.mjs | 26 +- frontend/stores/filterStore.ts | 81 +- frontend/types/api.generated.ts | 2329 +++++---- frontend/types/filters.ts | 20 +- local-docs/library_evaluation.md | 82 + local-docs/performance_load_test_plan.md | 931 ++++ local-docs/rollback_runbook.md | 154 + local-docs/run_backup.sh | 41 + local-docs/surprises.md | 149 + mypy-baseline.txt | 0 pyproject.toml | 151 +- run.sh | 10 +- scripts/backup_service_configs.sh | 115 + scripts/baseline_metrics.sh | 102 + scripts/check_no_router_core_imports.sh | 52 + scripts/check_security_regression_count.sh | 45 + scripts/cleanup_orphan_raw_logs.py | 102 + scripts/dev/restore_dev_from_snapshot.sh | 161 + scripts/dev/snapshot_prod_to_dev.sh | 240 + scripts/dev/sync-from-remote.sh | 2 +- scripts/emit_perf_latest.py | 161 + scripts/loadtest_generator.py | 65 +- scripts/perf_gate.sh | 56 + scripts/refresh_fastly_cidrs.py | 170 + tests/conftest.py | 55 +- tests/core/test_buffer_commit_idempotent.py | 198 + tests/core/test_custom_field_fuzz.py | 140 + tests/core/test_data_migrations.py | 208 + tests/core/test_duckdb_helpers.py | 9 +- tests/core/test_duckdb_pool.py | 147 +- .../core/test_fastly_edge_writes_backfill.py | 6 +- tests/core/test_field_registry.py | 404 ++ tests/core/test_iceberg.py | 141 +- tests/core/test_iceberg_helpers.py | 81 +- tests/core/test_iceberg_self_heal.py | 157 + tests/core/test_local_compaction.py | 65 +- tests/core/test_metadata_db_crud.py | 17 +- tests/core/test_metadata_db_migrations.py | 4 +- tests/core/test_metadata_db_schema.py | 47 + tests/core/test_metadata_state.py | 163 + tests/core/test_query_registry.py | 465 ++ tests/core/test_reconciliation.py | 245 + tests/core/test_request_context.py | 258 + tests/core/test_request_telemetry.py | 161 + tests/core/test_rollups_day_bundles.py | 277 ++ tests/core/test_rollups_hour_bundling.py | 88 +- tests/core/test_rollups_recompute.py | 654 +++ tests/core/test_rollups_sessions.py | 432 ++ tests/core/test_rollups_time_series.py | 367 ++ tests/core/test_rollups_wellknown_bots.py | 237 + .../test_rollups_wellknown_bots_writer.py | 339 ++ tests/core/test_scheduler_timing.py | 5 +- tests/core/test_settings.py | 158 + tests/core/test_slow_queries_persist.py | 191 + tests/core/test_vcl_semantics.py | 47 +- tests/cron/test_compaction_jobs.py | 221 + tests/fixtures/fastly_stubs.vcl | 40 + tests/perf/__init__.py | 0 tests/perf/baseline.json | 17 + tests/perf/latest.json | 9 + tests/remote_access/test_middleware.py | 245 +- tests/remote_access/test_share_auth_routes.py | 7 +- tests/remote_access/test_share_db.py | 133 +- tests/remote_access/test_tunnel.py | 78 +- tests/repositories/_sql/__init__.py | 0 tests/repositories/_sql/test_alerts.py | 75 + tests/repositories/_sql/test_base.py | 253 + tests/repositories/_sql/test_dashboard.py | 283 ++ tests/repositories/_sql/test_insights.py | 405 ++ tests/repositories/_sql/test_network.py | 303 ++ tests/repositories/_sql/test_origin.py | 347 ++ tests/repositories/_sql/test_performance.py | 32 + tests/repositories/_sql/test_query.py | 92 + tests/repositories/_sql/test_security.py | 233 + tests/repositories/_sql/test_sessions.py | 188 + tests/repositories/_sql/test_usage.py | 22 + tests/repositories/test_alerts.py | 70 +- tests/repositories/test_base.py | 340 ++ tests/repositories/test_dashboard.py | 14 +- .../repositories/test_session_scoring_repo.py | 349 ++ tests/repositories/test_sessions.py | 93 +- tests/repositories/test_time_series_rollup.py | 205 + .../repositories/test_usage_storage_stats.py | 40 +- tests/repositories/test_views.py | 67 +- .../services/test_core_get_endpoints.py | 6 +- tests/routers/services/test_cron_router.py | 56 + tests/routers/test_admin_compaction.py | 347 ++ tests/routers/test_admin_get_endpoints.py | 67 + tests/routers/test_admin_health_snapshot.py | 216 + tests/routers/test_admin_log_accounting.py | 8 +- .../routers/test_admin_mutation_endpoints.py | 36 +- tests/routers/test_admin_queries.py | 252 + tests/routers/test_bootstrap.py | 6 +- tests/routers/test_cross_tenant_scope.py | 6 + tests/routers/test_dashboard_router.py | 194 + tests/routers/test_endpoints.py | 19 +- tests/routers/test_provision.py | 58 +- tests/routers/test_provision_teardown_auth.py | 13 +- tests/routers/test_rbac_audit_fixes.py | 309 ++ tests/routers/test_session_scoring_router.py | 86 +- tests/routers/test_usage_endpoints.py | 60 +- tests/routers/test_usage_router.py | 24 +- tests/scoring/test_normalize.py | 45 +- tests/test_deps.py | 48 +- tests/test_main.py | 35 + tests/test_no_trace_leakage_sweep.py | 4 + tests/test_provision_cli_handlers.py | 6 +- tests/test_proxy_headers_regression.py | 8 +- tests/test_scheduler.py | 187 +- tests/test_trust_topology.py | 171 + tests/utils/polling.py | 36 + tests/utils/test_fastly_utils.py | 89 + tests/utils/test_rdns_async.py | 166 + tests/utils/test_rdns_cache.py | 48 +- tests/utils/test_refresh_fastly_cidrs.py | 134 + tests/utils/test_retry.py | 210 + tests/utils/test_router_utils.py | 57 + tests/utils/test_sqlite_profiler.py | 96 + tests/utils/test_structlog_config.py | 79 + tests/utils/test_telemetry.py | 20 +- tests/utils/test_telemetry_proxy.py | 8 +- tests/utils/test_telemetry_proxy_phase2.py | 8 +- tests/utils/test_telemetry_proxy_phase3b.py | 6 +- .../test_telemetry_response_middleware.py | 6 +- tests/utils/test_telemetry_unit.py | 202 + tests/utils/test_terraform_gen.py | 205 +- tests/utils/test_tunnel_state.py | 125 + tests/utils/test_usage_logger.py | 34 +- uv.lock | 618 ++- 611 files changed, 83255 insertions(+), 37496 deletions(-) create mode 100644 .check_router_core_floor create mode 100644 .github/workflows/cidr-refresh.yml create mode 100644 backend/core/_duckdb_status.py create mode 100644 backend/core/_log_fields_data.py create mode 100644 backend/core/field_registry.py delete mode 100644 backend/core/iceberg.py create mode 100644 backend/core/iceberg/__init__.py create mode 100644 backend/core/iceberg/_core.py create mode 100644 backend/core/iceberg/buffer.py create mode 100644 backend/core/iceberg/fs.py create mode 100644 backend/core/iceberg/manifest.py create mode 100644 backend/core/iceberg/sync.py create mode 100644 backend/core/iceberg/view.py create mode 100644 backend/core/metadata/__init__.py create mode 100644 backend/core/metadata/alerts.py create mode 100644 backend/core/metadata/asn_cache.py create mode 100644 backend/core/metadata/base.py create mode 100644 backend/core/metadata/cron_log.py create mode 100644 backend/core/metadata/ingest_log.py create mode 100644 backend/core/metadata/reconciliation.py create mode 100644 backend/core/metadata/slow_queries.py create mode 100644 backend/core/metadata/state.py create mode 100644 backend/core/metadata/usage_log.py create mode 100644 backend/core/metadata/usage_log_db.py create mode 100644 backend/core/metadata/views.py create mode 100644 backend/core/query_attribution.py create mode 100644 backend/core/query_instrumentation.py create mode 100644 backend/core/query_registry.py create mode 100644 backend/core/request_context.py create mode 100644 backend/core/request_telemetry.py delete mode 100644 backend/core/rollups.py create mode 100644 backend/core/rollups/__init__.py create mode 100644 backend/core/rollups/_common.py create mode 100644 backend/core/rollups/day_bundles.py create mode 100644 backend/core/rollups/hour_bundles.py create mode 100644 backend/core/rollups/recompute.py create mode 100644 backend/core/rollups/sessions.py create mode 100644 backend/core/rollups/time_series.py create mode 100644 backend/core/rollups/wellknown_bots.py create mode 100644 backend/core/settings.py delete mode 100644 backend/core/share_db.py create mode 100644 backend/core/share_db/__init__.py create mode 100644 backend/core/share_db/audit.py create mode 100644 backend/core/share_db/connection.py create mode 100644 backend/core/share_db/invites.py create mode 100644 backend/core/share_db/passcode.py create mode 100644 backend/core/share_db/schema.py create mode 100644 backend/core/share_db/sessions.py create mode 100644 backend/core/share_db/settings.py create mode 100644 backend/core/share_db/tos.py create mode 100644 backend/core/share_db/validation.py create mode 100644 backend/cron/__init__.py create mode 100644 backend/cron/decorators.py create mode 100644 backend/cron/jobs/__init__.py create mode 100644 backend/cron/jobs/commit.py create mode 100644 backend/cron/jobs/compaction.py create mode 100644 backend/cron/jobs/expire.py create mode 100644 backend/cron/jobs/metadata.py create mode 100644 backend/cron/jobs/optimize.py create mode 100644 backend/cron/jobs/sync.py create mode 100644 backend/cron/scheduler.py create mode 100644 backend/repositories/_sql/__init__.py create mode 100644 backend/repositories/_sql/alerts.py create mode 100644 backend/repositories/_sql/base.py create mode 100644 backend/repositories/_sql/dashboard.py create mode 100644 backend/repositories/_sql/insights.py create mode 100644 backend/repositories/_sql/network.py create mode 100644 backend/repositories/_sql/origin.py create mode 100644 backend/repositories/_sql/performance.py create mode 100644 backend/repositories/_sql/query.py create mode 100644 backend/repositories/_sql/security.py create mode 100644 backend/repositories/_sql/sessions.py create mode 100644 backend/repositories/_sql/usage.py create mode 100644 backend/repositories/session_scoring.py delete mode 100644 backend/routers/admin.py create mode 100644 backend/routers/admin/__init__.py create mode 100644 backend/routers/admin/_dir_size.py create mode 100644 backend/routers/admin/_helpers.py create mode 100644 backend/routers/admin/_router.py create mode 100644 backend/routers/admin/bot_sources.py create mode 100644 backend/routers/admin/compaction.py create mode 100644 backend/routers/admin/downloads.py create mode 100644 backend/routers/admin/health.py create mode 100644 backend/routers/admin/iceberg.py create mode 100644 backend/routers/admin/ingest.py create mode 100644 backend/routers/admin/log_accounting.py create mode 100644 backend/routers/admin/pop_locations.py create mode 100644 backend/routers/admin/sync_status.py create mode 100644 backend/routers/admin/trees.py create mode 100644 backend/routers/admin_queries.py create mode 100644 backend/routers/admin_usage.py create mode 100644 backend/routers/session_scoring_admin.py create mode 100644 backend/utils/iceberg_expr.py create mode 100644 backend/utils/retry.py create mode 100644 backend/utils/structlog_config.py delete mode 100644 backend/utils/tunnel.py create mode 100644 backend/utils/tunnel/__init__.py create mode 100644 backend/utils/tunnel/fingerprint.py create mode 100644 backend/utils/tunnel/manager.py create mode 100644 backend/utils/tunnel/rate_limiter.py create mode 100644 backend/utils/tunnel/session.py create mode 100644 backend/utils/tunnel/state.py delete mode 100644 configs/ssh_known_hosts create mode 100644 docs/adr/01-storage-model.md create mode 100644 docs/adr/02-request-lifecycle.md create mode 100644 docs/adr/03-tenancy.md create mode 100644 docs/adr/04-middleware-order.md create mode 100644 docs/adr/05-frontend-rendering-boundary.md create mode 100644 docs/adr/06-view-warming.md create mode 100644 docs/adr/07-feature-budgets.md create mode 100644 docs/adr/08-observability.md create mode 100644 docs/adr/09-error-handling.md create mode 100644 docs/adr/10-schema-evolution.md create mode 100644 docs/adr/11-secret-rotation.md create mode 100644 docs/adr/12-api-versioning.md create mode 100644 docs/adr/13-backup-dr.md create mode 100644 docs/deploy/README.md create mode 100644 docs/deploy/aws_ec2.md create mode 100644 docs/deploy/azure_vm.md create mode 100644 docs/deploy/gce.md create mode 100644 docs/deploy/generic_linux.md create mode 100644 frontend/__tests__/components/ProvisionWizard/wizard-api.test.ts create mode 100644 frontend/__tests__/components/ProvisionWizard/wizard-config-helpers.test.ts create mode 100644 frontend/__tests__/components/ProvisionWizard/wizard-deploy.test.ts create mode 100644 frontend/__tests__/hooks/useFilterUrlSync.test.ts create mode 100644 frontend/__tests__/hooks/useFilteredActive.test.ts create mode 100644 frontend/__tests__/hooks/useKeyboardShortcuts.test.ts create mode 100644 frontend/__tests__/lib/api/custom-fields.test.ts create mode 100644 frontend/__tests__/lib/toast.test.ts create mode 100644 frontend/__tests__/lib/workers/buildTrafficData.test.ts create mode 100644 frontend/__tests__/lib/workers/parseJson.test.ts create mode 100644 frontend/__tests__/ssr/bootstrap.test.ts create mode 100644 frontend/app/_routing.md create mode 100644 frontend/app/admin/_sections/BotSourcesPanel.tsx create mode 100644 frontend/app/admin/_sections/CredentialsDialog.tsx create mode 100644 frontend/app/admin/_sections/DiagnosticsPanel.tsx create mode 100644 frontend/app/admin/_sections/GlobalSettings.tsx create mode 100644 frontend/app/admin/_sections/NgwafDialog.tsx create mode 100644 frontend/app/admin/_sections/OperationsOverview.tsx create mode 100644 frontend/app/admin/_sections/ServicesTable.tsx create mode 100644 frontend/app/admin/_sections/ServicesTableColumns.tsx create mode 100644 frontend/app/admin/_sections/SystemStatus.tsx create mode 100644 frontend/app/admin/queries/_helpers.ts create mode 100644 frontend/app/admin/queries/_hooks/useFilteredActive.ts create mode 100644 frontend/app/admin/queries/_hooks/useKeyboardShortcuts.ts create mode 100644 frontend/app/admin/queries/_hooks/useQueryMonitorUrlSync.ts create mode 100644 frontend/app/admin/queries/_sections/ActiveTable.tsx create mode 100644 frontend/app/admin/queries/_sections/CompletedTable.tsx create mode 100644 frontend/app/admin/queries/_sections/DbFilterChips.tsx create mode 100644 frontend/app/admin/queries/_sections/FilterChips.tsx create mode 100644 frontend/app/admin/queries/_sections/PollingIndicator.tsx create mode 100644 frontend/app/admin/queries/_sections/RowDetailDialog.tsx create mode 100644 frontend/app/admin/queries/_sections/ShortcutsHelp.tsx create mode 100644 frontend/app/admin/queries/_sections/SummaryStrip.tsx create mode 100644 frontend/app/admin/queries/_sections/queryColumns.tsx create mode 100644 frontend/app/admin/queries/_types.ts create mode 100644 frontend/app/admin/queries/page.tsx create mode 100644 frontend/app/admin/usage-log/_sections/Filters.tsx create mode 100644 frontend/app/admin/usage-log/_sections/UsageChart.tsx create mode 100644 frontend/app/admin/usage-log/_sections/UsageTable.tsx create mode 100644 frontend/app/admin/usage-log/_sections/shared.ts create mode 100644 frontend/app/alerts/_sections/AlertEditor.tsx create mode 100644 frontend/app/alerts/_sections/AlertPreview.tsx create mode 100644 frontend/app/alerts/_sections/AlertsList.tsx create mode 100644 frontend/app/dashboard/_sections/CardGrid.tsx create mode 100644 frontend/app/dashboard/_sections/GeoMap.tsx create mode 100644 frontend/app/dashboard/_sections/TrafficChart.tsx create mode 100644 frontend/app/dashboard/_sections/categories.ts create mode 100644 frontend/app/dashboard/_sections/chartHelpers.ts create mode 100644 frontend/app/dashboard/_sections/types.ts create mode 100644 frontend/app/logs/_sections/AuditColumns.tsx create mode 100644 frontend/app/logs/_sections/CronColumns.tsx create mode 100644 frontend/app/logs/_sections/CronExplanations.ts create mode 100644 frontend/app/logs/_sections/CronScheduleBox.tsx create mode 100644 frontend/app/logs/_sections/CronTab.tsx create mode 100644 frontend/app/logs/_sections/FloatingOperationsDock.tsx create mode 100644 frontend/app/logs/_sections/IngestionTab.tsx create mode 100644 frontend/app/logs/_sections/QuickActionsBar.tsx create mode 100644 frontend/app/logs/_sections/SSEModal.tsx create mode 100644 frontend/app/logs/_sections/SchemaTab.tsx create mode 100644 frontend/app/logs/_sections/ServiceHistoryTab.tsx create mode 100644 frontend/app/logs/_state.ts create mode 100644 frontend/app/origin/_sections/Aggregates.tsx create mode 100644 frontend/app/origin/_sections/LatencyHeatmap.tsx create mode 100644 frontend/app/origin/_sections/Timeseries.tsx create mode 100644 frontend/app/query/_sections/ModeToggle.tsx create mode 100644 frontend/app/query/_sections/QueryToolbar.tsx create mode 100644 frontend/app/query/_sections/RawSqlMode.tsx create mode 100644 frontend/app/query/_sections/ResultsTable.tsx create mode 100644 frontend/app/query/_sections/StructuredMode.tsx create mode 100644 frontend/app/query/_sql_builder.ts create mode 100644 frontend/app/security/_sections/BotsSection.tsx create mode 100644 frontend/app/security/_sections/HeaderAnomaliesSection.tsx create mode 100644 frontend/app/security/_sections/NetworkSection.tsx create mode 100644 frontend/app/security/_sections/securityInfo.tsx create mode 100644 frontend/app/sessions/_sections/ScoringControls.tsx create mode 100644 frontend/app/sessions/_sections/SessionDetail.tsx create mode 100644 frontend/app/sessions/_sections/SessionsTable.tsx create mode 100644 frontend/components/CostCalculator/Inputs.tsx create mode 100644 frontend/components/CostCalculator/Pricing.tsx create mode 100644 frontend/components/CostCalculator/Results.tsx create mode 100644 frontend/components/CostCalculator/calc.ts create mode 100644 frontend/components/CostCalculator/parts.tsx create mode 100644 frontend/components/CronSettingsModal/Preview.tsx create mode 100644 frontend/components/CronSettingsModal/Schedule.tsx create mode 100644 frontend/components/CronSettingsModal/Triggers.tsx create mode 100644 frontend/components/CronSettingsModal/constants.ts create mode 100644 frontend/components/DataTable/Body.tsx create mode 100644 frontend/components/DataTable/ColumnPicker.tsx create mode 100644 frontend/components/DataTable/Header.tsx create mode 100644 frontend/components/DataTable/Toolbar.tsx delete mode 100644 frontend/components/Insights/InsightHelpModal.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/index.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/cache.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/errors.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/optimization.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/performance.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/security.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/sections/traffic.tsx create mode 100644 frontend/components/Insights/InsightHelpModal/types.ts create mode 100644 frontend/components/LogSettingsModal/CustomFields.tsx create mode 100644 frontend/components/LogSettingsModal/FieldGroups.tsx create mode 100644 frontend/components/LogSettingsModal/Preview.tsx delete mode 100644 frontend/components/Map/NetworkMap.tsx create mode 100644 frontend/components/Map/NetworkMap/MapLayer.tsx create mode 100644 frontend/components/Map/NetworkMap/OverlayLayer.tsx create mode 100644 frontend/components/Map/NetworkMap/controls.tsx create mode 100644 frontend/components/Map/NetworkMap/index.tsx create mode 100644 frontend/components/PlotlyChart/ChartA11yTable.tsx create mode 100644 frontend/components/PlotlyChart/__tests__/tracesToTable.test.ts create mode 100644 frontend/components/PlotlyChart/tracesToTable.ts create mode 100644 frontend/components/ProvisionWizard/JsonImportSection.tsx create mode 100644 frontend/components/ProvisionWizard/WizardFooter.tsx create mode 100644 frontend/components/ProvisionWizard/WizardHeader.tsx create mode 100644 frontend/components/ProvisionWizard/steps/AnalyzeStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/ConfirmStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/ExecuteStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/FieldsStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/JoinStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/ModeStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/NgwafStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/ServiceStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/SettingsStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/StorageStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/TerraformStep.tsx create mode 100644 frontend/components/ProvisionWizard/steps/TokenStep.tsx create mode 100644 frontend/components/ProvisionWizard/types.ts create mode 100644 frontend/components/ProvisionWizard/useWizardState.ts create mode 100644 frontend/components/ProvisionWizard/wizard-api.ts create mode 100644 frontend/components/ProvisionWizard/wizard-config-helpers.ts create mode 100644 frontend/components/ProvisionWizard/wizard-deploy.ts create mode 100644 frontend/components/ProvisionWizard/wizard-effects.ts create mode 100644 frontend/components/SessionScoring/ThresholdSlider/Matrix.tsx create mode 100644 frontend/components/SessionScoring/ThresholdSlider/Preview.tsx create mode 100644 frontend/components/SessionScoring/ThresholdSlider/Slider.tsx rename frontend/components/SessionScoring/{ThresholdSlider.tsx => ThresholdSlider/index.tsx} (59%) create mode 100644 frontend/hooks/useActiveService.ts create mode 100644 frontend/hooks/useDashboardBundle.ts create mode 100644 frontend/hooks/useFilterUrlSync.ts delete mode 100644 frontend/hooks/usePageContext.ts create mode 100644 frontend/hooks/useSyncStatus.ts create mode 100644 frontend/hooks/useTimeRange.ts create mode 100644 frontend/hooks/useTimezone.ts create mode 100644 frontend/lib/fetchWithTimeout.ts create mode 100644 frontend/lib/ssr/bootstrap.ts create mode 100644 frontend/lib/toast.ts create mode 100644 frontend/lib/workers/buildTrafficData.ts create mode 100644 frontend/lib/workers/chartDataWorker.ts create mode 100644 frontend/lib/workers/json-worker.ts create mode 100644 frontend/lib/workers/parseJson.ts create mode 100644 local-docs/library_evaluation.md create mode 100644 local-docs/performance_load_test_plan.md create mode 100644 local-docs/rollback_runbook.md create mode 100755 local-docs/run_backup.sh create mode 100644 local-docs/surprises.md create mode 100644 mypy-baseline.txt create mode 100755 scripts/backup_service_configs.sh create mode 100755 scripts/baseline_metrics.sh create mode 100755 scripts/check_no_router_core_imports.sh create mode 100755 scripts/check_security_regression_count.sh create mode 100755 scripts/cleanup_orphan_raw_logs.py create mode 100755 scripts/dev/restore_dev_from_snapshot.sh create mode 100755 scripts/dev/snapshot_prod_to_dev.sh create mode 100644 scripts/emit_perf_latest.py create mode 100755 scripts/perf_gate.sh create mode 100644 scripts/refresh_fastly_cidrs.py create mode 100644 tests/core/test_buffer_commit_idempotent.py create mode 100644 tests/core/test_custom_field_fuzz.py create mode 100644 tests/core/test_data_migrations.py create mode 100644 tests/core/test_field_registry.py create mode 100644 tests/core/test_iceberg_self_heal.py create mode 100644 tests/core/test_metadata_state.py create mode 100644 tests/core/test_query_registry.py create mode 100644 tests/core/test_reconciliation.py create mode 100644 tests/core/test_request_context.py create mode 100644 tests/core/test_request_telemetry.py create mode 100644 tests/core/test_rollups_day_bundles.py create mode 100644 tests/core/test_rollups_recompute.py create mode 100644 tests/core/test_rollups_sessions.py create mode 100644 tests/core/test_rollups_time_series.py create mode 100644 tests/core/test_rollups_wellknown_bots.py create mode 100644 tests/core/test_rollups_wellknown_bots_writer.py create mode 100644 tests/core/test_settings.py create mode 100644 tests/core/test_slow_queries_persist.py create mode 100644 tests/cron/test_compaction_jobs.py create mode 100644 tests/fixtures/fastly_stubs.vcl create mode 100644 tests/perf/__init__.py create mode 100644 tests/perf/baseline.json create mode 100644 tests/perf/latest.json create mode 100644 tests/repositories/_sql/__init__.py create mode 100644 tests/repositories/_sql/test_alerts.py create mode 100644 tests/repositories/_sql/test_base.py create mode 100644 tests/repositories/_sql/test_dashboard.py create mode 100644 tests/repositories/_sql/test_insights.py create mode 100644 tests/repositories/_sql/test_network.py create mode 100644 tests/repositories/_sql/test_origin.py create mode 100644 tests/repositories/_sql/test_performance.py create mode 100644 tests/repositories/_sql/test_query.py create mode 100644 tests/repositories/_sql/test_security.py create mode 100644 tests/repositories/_sql/test_sessions.py create mode 100644 tests/repositories/_sql/test_usage.py create mode 100644 tests/repositories/test_session_scoring_repo.py create mode 100644 tests/repositories/test_time_series_rollup.py create mode 100644 tests/routers/test_admin_compaction.py create mode 100644 tests/routers/test_admin_health_snapshot.py create mode 100644 tests/routers/test_admin_queries.py create mode 100644 tests/routers/test_dashboard_router.py create mode 100644 tests/routers/test_rbac_audit_fixes.py create mode 100644 tests/test_trust_topology.py create mode 100644 tests/utils/polling.py create mode 100644 tests/utils/test_rdns_async.py create mode 100644 tests/utils/test_refresh_fastly_cidrs.py create mode 100644 tests/utils/test_retry.py create mode 100644 tests/utils/test_structlog_config.py create mode 100644 tests/utils/test_telemetry_unit.py create mode 100644 tests/utils/test_tunnel_state.py diff --git a/.check_router_core_floor b/.check_router_core_floor new file mode 100644 index 00000000..5bc6609e --- /dev/null +++ b/.check_router_core_floor @@ -0,0 +1 @@ +117 diff --git a/.env.example b/.env.example index 435b605f..9c1afa6d 100644 --- a/.env.example +++ b/.env.example @@ -47,6 +47,13 @@ # backend runs on a different host than the frontend. # NEXT_PUBLIC_API_URL=http://127.0.0.1:8000 +# ── Observability ────────────────────────────────────────────────────────────── +# OpenTelemetry exporter. Default 'none' — no spans/metrics leave the process. +# Set 'console' to dump spans and 60s metric snapshots to stdout (loud; useful +# locally when chasing a perf regression). Don't set 'console' in prod — it +# pollutes log aggregation with ~1 MB/min of JSON. +# OTEL_EXPORTER=console + # ── Docker only ──────────────────────────────────────────────────────────────── # Set automatically by docker-compose; not needed for local dev. # API_PROXY_URL=http://backend:8000 diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 00c0e2ae..981a67bd 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -32,8 +32,15 @@ jobs: - name: Format check (ruff) run: uv run ruff format --check . - - name: Type check (mypy) - run: uv run mypy backend/ + - name: Type check (mypy, filtered through mypy-baseline) + # Pre-existing errors accepted via mypy-baseline.txt; the filter + # exits non-zero only on NET-NEW errors. Refresh the baseline after + # a burndown PR with + # uv run mypy backend/ 2>&1 | uv run mypy-baseline sync + # and commit mypy-baseline.txt. Burndown plan + + # bucket scoping live in + # pending-docs/session_2026-06-10_otel_dump_and_log_extents.md. + run: uv run mypy backend/ 2>&1 | uv run mypy-baseline filter - name: Install falco run: | @@ -85,19 +92,50 @@ jobs: env: FALCO_REQUIRED: "1" TERRAFORM_VALIDATE: "1" - # Gate ratcheted as milestones land: + # Gate ratcheted as milestones land (convention: current actual − 2pp): # end Milestone A: 44% (baseline 46%, -2pp buffer) # end Milestone E: 47% (current 49% — keeps the 2pp buffer) # post-Milestone E coverage backfill: 55% (current 59% — 4pp buffer) # confidence-batch (insights+admin+services+dashboard+origin+ # hypothesis+regression+E2E smoke): 78% (current 83% — 5pp buffer) + # post live-query-monitor (2026-06-11): 80% (current 82%) + # post backend coverage waves (reconciliation/compaction/session_scoring + # /data_migrations/tunnel-state/dashboard-router/views/sqlite_profiler): + # 82% (current 83% — 1pp buffer; tight while v2.0 target 85% lands). + # v2.0 final wave (2026-06-12): per-module tests for the post-split + # rollups/ + admin/ packages (rollups/sessions 85, rollups/time_series + # 84, rollups/day_bundles 76, rollups/recompute 96, admin/compaction + # 100, admin/health 100): 85% (current 85% — the v2.0 target hit). # # `-n auto` parallelizes via pytest-xdist (TESTING_PLAN_3 item 21). # Verified safe: per-service SQLite (`{id}.metadata.db`) + per-test # tmp_path give file isolation; autouse `_reset_module_caches` resets # the 8 module-level caches between tests; moto fixtures are per-test. # Local run: 2268 passed in 58s under `-n auto` vs ~3min serial. - run: uv run pytest -n auto --cov=backend --cov-report=term --cov-fail-under=78 + run: uv run pytest -n auto --cov=backend --cov-report=term --cov-fail-under=85 + + - name: Security-regression count gate + # v2.0 cleanup Phase 0.8: asserts the + # @pytest.mark.security_regression count never drops below the + # baseline floor (24 — derived from audit-findings/ verified + # fixes). A refactor cannot silently delete coverage of a + # verified fix without surfacing the change. + run: bash scripts/check_security_regression_count.sh + + - name: Emit perf samples (CI-scale synthetic load) + # Produces tests/perf/latest.json from a 100K-row in-memory + # DuckDB dataset (~2 s wall). The gate below compares to + # tests/perf/baseline.json and fails on >regression_pct_threshold% + # over baseline (50 % default; tuned for GH Actions runner + # variance at CI scale). + run: uv run python scripts/emit_perf_latest.py + + - name: Perf gate (load-harness baseline) + # Compares the just-emitted latest.json against baseline.json. + # Production targets (≤2800 / ≤1900 ms) are documented in + # baseline.json's production_targets_comment for traceability + # but enforced by the manual loadtest probe, not this CI gate. + run: bash scripts/perf_gate.sh frontend: name: Frontend (Node) @@ -140,7 +178,17 @@ jobs: run: npx tsc --noEmit - name: Tests (vitest with coverage) - # Gate ratcheted as milestones land: + # Gate ratcheted as milestones land (convention: current actual − 2pp): # end Milestone A: 40% (baseline 42.7%, -2pp buffer) - # end Milestone E: 44% (current 46.55% — keeps the 2pp buffer) - run: npx vitest run --coverage --coverage.thresholds.lines=44 + # end Milestone E: 44% (current 46.55%) + # post live-query-monitor (2026-06-11): 53% (current 55.19%) + # post lib/toast + lib/api/custom-fields + lib/workers/parseJson tests + # (2026-06-12): 55% (current 57.12%) + # post ProvisionWizard/wizard-config-helpers tests + # (2026-06-12): 56% (current 58.42%) + # post ProvisionWizard/wizard-api tests + # (2026-06-12): 57% (current 59.8%) + # post ProvisionWizard/wizard-deploy tests + # (2026-06-12): 58% (current 61.66%) — final v2.0 target hit + # per cleanup_plan §10.14. + run: npx vitest run --coverage --coverage.thresholds.lines=58 diff --git a/.github/workflows/cidr-refresh.yml b/.github/workflows/cidr-refresh.yml new file mode 100644 index 00000000..47909585 --- /dev/null +++ b/.github/workflows/cidr-refresh.yml @@ -0,0 +1,53 @@ +name: Refresh Fastly CIDRs + +# Weekly refresh of the Fastly edge CIDR list in the repo-root Caddyfile. +# The @from_fastly_v4 matcher gates X-Forwarded-For rewriting on Fastly's +# published v4 ranges; a stale list silently classifies traffic from new +# POPs as direct (untrusted) until somebody refreshes it and reloads +# Caddy. The script is well-tested (scripts/refresh_fastly_cidrs.py); +# this workflow just runs it on a cadence and opens a PR if the file +# changed. Off-minute schedule on purpose so the runner pool isn't +# hammered at :00 alongside everybody else's hourly jobs. + +on: + schedule: + - cron: '13 9 * * 1' # Mondays at 09:13 UTC + workflow_dispatch: {} + +permissions: + contents: write + pull-requests: write + +jobs: + refresh: + name: Fetch + open PR on diff + runs-on: forge-amd64-medium + steps: + - uses: actions/checkout@v6 + + - name: Install uv + uses: astral-sh/setup-uv@v7 + with: + enable-cache: true + python-version: "3.13" + + - name: Refresh Caddyfile + # No-op if the published list already matches what's in the + # Caddyfile (script prints "No changes …" and exits 0). Writes + # the updated matcher block otherwise; peter-evans/create-pull- + # request below only opens a PR when the working tree is dirty. + run: uv run python scripts/refresh_fastly_cidrs.py + + - name: Open PR if Caddyfile changed + uses: peter-evans/create-pull-request@v7 + with: + commit-message: 'chore: refresh Fastly edge CIDR list in Caddyfile' + branch: chore/refresh-fastly-cidrs + delete-branch: true + title: 'chore: refresh Fastly edge CIDR list' + body: | + Automated update from `scripts/refresh_fastly_cidrs.py`, triggered by the weekly `cidr-refresh.yml` workflow. + + The `@from_fastly_v4` matcher in [Caddyfile](../blob/main/Caddyfile) gates the `X-Forwarded-For` rewrite on Fastly-published edge ranges. A stale list silently classifies traffic from new POPs as direct (untrusted) until Caddy reloads. + + After merge: run `~/restart.sh caddy` (or equivalent) on the VM to pick up the new ranges. diff --git a/.gitignore b/.gitignore index 33202558..acd23043 100644 --- a/.gitignore +++ b/.gitignore @@ -76,6 +76,11 @@ compute/scorer/pkg/ # split_per_page.py) live here for now; treat the whole tree as throwaway. /scratch/ +# Performance-audit campaign artifacts: HAR captures, per-sample telemetry, +# aggregated p50/p95/p99 summaries, per-page reports + improvement plans. +# Throwaway — regenerable by re-running scratch/perf_audit.mjs. +/performance-report/ + # Local-only VS Code config (file-watcher / Pylance excludes for the # regenerating .next + cache trees). Personal to each contributor's editor # setup — not promoted to the repo by default. diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5a150d76..1da1f29b 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,26 +1,44 @@ repos: + # Pinned ruff version must stay reasonably close to the version in + # pyproject.toml (currently ruff>=0.11) — drift triggers pre-existing + # rule changes (UP038, E731 strictness) that the project's actual ruff + # has already retired. Bump together when bumping either side. - repo: https://github.com/astral-sh/ruff-pre-commit - rev: v0.11.0 + rev: v0.15.15 hooks: - id: ruff args: [--fix] - id: ruff-format - - repo: https://github.com/pre-commit/mirrors-mypy - rev: v1.15.0 + # mypy runs via the project's own uv env (matches what CI runs) and is + # piped through mypy-baseline so pre-existing errors stay accepted and + # only NET-NEW errors fail the commit. The baseline lives in + # mypy-baseline.txt at the repo root; refresh it after a burndown PR with + # uv run mypy backend/ 2>&1 | uv run mypy-baseline sync + # and commit the updated file. Burndown plan in + # pending-docs/session_2026-06-10_otel_dump_and_log_extents.md. + - repo: local hooks: - id: mypy - additional_dependencies: - - types-boto3 - - types-pytz - - fastapi - - pydantic + name: mypy (full backend/, filtered through mypy-baseline) + language: system + # Always check the whole backend/ tree, not just changed files — + # per-file mypy only visits a partial import graph, which makes + # mypy-baseline report unrelated baseline entries as "fixed" and + # exit non-zero. Cost: ~10s per commit; benefit: matches CI exactly. + entry: bash -c 'uv run mypy backend/ 2>&1 | uv run mypy-baseline filter' + files: '^backend/.*\.py$' + pass_filenames: false - repo: https://github.com/pre-commit/pre-commit-hooks rev: v5.0.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer + # openapi-typescript emits openapi.json without a trailing newline; + # end-of-file-fixer adds one, then the next regen-openapi run + # strips it. Excluding the generated artifact breaks the cycle. + exclude: '^frontend/openapi\.json$' - id: check-yaml - id: check-json - id: check-merge-conflict @@ -60,3 +78,16 @@ repos: language: system pass_filenames: false entry: bash -c 'cd frontend && npx tsc --noEmit' + + # v2.0 cleanup (Phase 0.12): pre-push gate that the + # @pytest.mark.security_regression count hasn't dropped below + # the Phase 0 floor (24). Catches a refactor that silently + # removes coverage of a verified security fix before push, + # not in CI. `stages: [pre-push]` keeps it off the per-commit + # hot path (the gate takes ~2s to collect 3k+ tests). + - id: security-regression-count + name: Assert security_regression test count >= floor + stages: [pre-push] + language: system + pass_filenames: false + entry: bash scripts/check_security_regression_count.sh diff --git a/AGENTS.md b/AGENTS.md index 7bf0fb01..23247a63 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -59,13 +59,39 @@ User-facing pitch + features list lives in [README.md](README.md). This file doc The DuckDB `logs` view stitches the Iceberg table and the local Parquet buffer so queries always see all data without callers caring which layer holds which row. +### Package layout (post v2.0 carve-ups) + +Several historical monoliths were split into cohesive packages with thin re-export shims at the old paths so existing imports keep working: + +| Old path | New package | Shim status | +|---|---|---| +| `backend/core/iceberg.py` | [`backend/core/iceberg/`](backend/core/iceberg/) (`_core.py` + `fs.py`) | package `__init__.py` re-exports the historical public surface; the monkeypatched s3fs methods are now `FosS3FileSystem` / `CachedS3FileSystem` subclasses in `fs.py` | +| `backend/core/metadata_db.py` | [`backend/core/metadata/`](backend/core/metadata/) (`base`, `alerts`, `views`, `ingest_log`, `cron_log`, `asn_cache`, `usage_log`, `reconciliation`, `state`) | thin shim at [`backend/core/metadata_db.py`](backend/core/metadata_db.py) re-exports the full surface plus a `_ShimModule` proxy so `monkeypatch.setattr(metadata_db, "_DATA_DIR", ...)` still flips the live binding inside `metadata.base` | +| `backend/core/share_db.py` | [`backend/core/share_db/`](backend/core/share_db/) (`connection`, `schema`, `invites`, `sessions`, `audit`, `passcode`, `tos`, `settings`, `validation`) | package `__init__.py` re-exports the historical public surface; passcode hashing is argon2id (legacy scrypt verify branch stays for transparent rehash-on-login) | +| `backend/utils/tunnel.py` | [`backend/utils/tunnel/`](backend/utils/tunnel/) (`manager`, `session`, `rate_limiter`, `state`, `fingerprint`) | package `__init__.py` re-exports `get_tunnel_manager`, `AnalystSession`, etc. SSH-to-localhost.run code path (`_TUNNEL_URL_RE`, sleep listener, reconnect logic, `use_tunnel=True` branches) was deleted in v2.0 — only direct-mode (HTTPS public_endpoint) is supported. The `use_tunnel=True` kwarg still exists as a back-compat keyword that raises a clear error | +| `backend/scheduler.py` | [`backend/cron/`](backend/cron/) (`scheduler.py`, `decorators.py`, `jobs/{sync,commit,compaction,optimize,expire,metadata}.py`) | thin shim at [`backend/scheduler.py`](backend/scheduler.py) re-exports `get_scheduler`, `Scheduler`, `cron_task`, every `_run_*` job body, and the watchdog constants | +| `backend/routers/session_scoring.py` (was 2442) | [`backend/routers/session_scoring.py`](backend/routers/session_scoring.py) (1327) + [`backend/routers/session_scoring_admin.py`](backend/routers/session_scoring_admin.py) (1193) | sidecar holds retrain + admin-config endpoints (enforce-threshold, exclude-regex, enforce-status-code, matrix-versions, rotate-key, audit, threshold GET/PUT, evaluation/per-reason, dashboard composite); registers on the shared router via import-for-side-effects at the bottom of `session_scoring.py` | +| `backend/routers/admin.py` (was 1650) | [`backend/routers/admin/`](backend/routers/admin/) (`pop_locations`, `ingest`, `trees`, `downloads`, `sync_status`, `compaction`, `health`, `log_accounting`, `iceberg`, `bot_sources`, `_helpers`, `_dir_size`, `_router`) + [`backend/routers/admin_usage.py`](backend/routers/admin_usage.py) (sidecar) | v2.0 carve: 14 sub-modules each < 350 lines. `admin/__init__.py` re-exports the historical public surface (`router`, `compute_sync_status_cached`, `compute_log_accounting`, `LOG_ACCOUNTING_*`, `SustainedLossAlert`, `_QueueFile`, `_stream_from_worker`, `_fetch_file_to_zip`, `_resolve_source`, `_get_dir_size`, `ClientDisconnected`). `admin_usage.py` still attaches its endpoints to the shared `router` via `importlib.import_module` from the package init | +| `backend/core/rollups.py` (was 2045) | [`backend/core/rollups/`](backend/core/rollups/) (`_common`, `time_series`, `sessions`, `hour_bundles`, `day_bundles`, `recompute`, `wellknown_bots`) | v2.0 carve: 8 sub-modules, largest 352 lines. `rollups/__init__.py` re-exports 41 symbols so `from backend.core.rollups import X` (or `from backend.core import rollups; rollups.X`) keeps working unchanged. Shared bits — constants, ident validators, path helpers, query builders, `_VIRTUAL_FIELD_BACKING` — live in `_common.py` | +| `backend/core/log_fields.py` (was 1904) | [`backend/core/log_fields.py`](backend/core/log_fields.py) (659) + [`backend/core/_log_fields_data.py`](backend/core/_log_fields_data.py) (1277) | data-only carve: `LOG_FIELD_CATALOG`, `GROUP_INFO`, `GROUP_DEPENDENCIES`, `PRESETS`, `INSIGHT_DEFINITIONS` moved to the sidecar and re-imported. Zero behaviour change | +| `backend/core/duckdb.py` (was 2110) | [`backend/core/duckdb.py`](backend/core/duckdb.py) (1099) + [`backend/core/_duckdb_status.py`](backend/core/_duckdb_status.py) (1119) | `get_sync_status`, `refresh_config_status`, `update_top_values`, `get_ingested_files`, `delete_ingested_files`, `get_schema`, `_clear_schema_cache`, `get_asn_names` / `format_asn_label` / `enrich_asn_labels`, `update_cron_duration`, `log_usage_calls`, `backfill_fastly_edge_writes`, `reconcile_fastly_stats`, `purge_usage_log` move to the sidecar. Re-exported back into `backend.core.duckdb`. Sidecar late-binds shared helpers from the main module via `_db_main` to dodge the circular import | + +Other new modules introduced by the cleanup: + +- [`backend/repositories/_sql/`](backend/repositories/_sql/) — named, parameterized SQL templates extracted out of inline repo strings (one file per repo concern: `dashboard`, `security`, `network`, `origin`, etc.). Repository functions keep their names and signatures; they call into the templates instead of carrying SQL inline. +- [`backend/core/field_registry.py`](backend/core/field_registry.py) — Phase 7 (shipped, including step 13) typed registry that owns per-field declarations (code, display name, type, valid aggregations, valid filter ops, derivations, security-regex hooks). All readers migrated (dashboard CTE generator, rollup spec builder, top_n logic, SQL validator, scoring matrix labels, plus 8 step-13 callers: `services/core.py`, `provision/orchestrator.py`, `provision/fastly_api.py`, `provision/cli.py`, `iceberg/_core.py`, `ingest.py`, `models/custom_fields.py`, `state_sync.py`). Same-identity re-exports of every helper + constant preserve `from log_fields import X` callers. +- [`backend/core/request_context.py`](backend/core/request_context.py) — Phase 2 single FastAPI dependency that bundles `service_id`, `source`, `con`, `telemetry`, `analyst_session`, `cached_temps`. Replaces the v1 `AnalyticsDeps` bundle (deleted at the v2.0 cut — Phase 8.1/8.2) and folds `require_service_access` into context construction (there is no path that builds a context without enforcing tenancy). 23 analytics endpoints across 8 routers (dashboard / query / sessions / security / network / origin / performance / insights) now take `ctx: RequestContext = Depends(build_request_context)` directly. +- [`backend/core/request_telemetry.py`](backend/core/request_telemetry.py) — Phase 1 thin wrapper around the OTel tracer that owns section spans, query attribution, call log, cache state, and the `app.thread_wait_ms` custom metric instrumented at `_Pool.acquire`. Lives on `RequestContext`. +- [`backend/core/settings.py`](backend/core/settings.py) — Phase 3.5 `Settings(BaseSettings)` class (pydantic-settings) that owns every env var. Required-in-prod settings are pydantic validators. +- [`backend/core/iceberg/_core.py`](backend/core/iceberg/_core.py) `execute_with_stale_view_retry(con, src, fn)` — self-heal wrapper for code paths that open raw DuckDB connections instead of going through `QueryRunner`. On stale-buffer "No files found" errors, busts `_view_cache` via `clear_source_caches(keep_snapshot_cache=True)` + `update_iceberg_view(force=True)` then retries `fn` once. Used by `rdns_cache` discovery, `rollups` DESCRIBE sites, and `/api/query`. Pre-fix prod incidents: ~8h of 100%-failing rdns runs + analyst-visible query errors on the same buffer-deletion race. + ### Personas (where the two onboarding paths live) The README explains the two collaboration modes for end users. Implementation pointers: - **Admin** (`access_level: "read_write"`) — full ingest/management surface. Config: `configs/{logging_service_id}.json`. - **Analyst Path A — independent instance** (durable, JSON-config join). Read-only FOS credentials, runs its own copy of the app. Components: `POST /api/services/{service_id}/generate-viewer-key` → [`api_invite_analyst()`](backend/routers/services/core.py), `GET /api/provision/join` (SSE), [`InviteAnalystDialog`](frontend/components/InviteAnalystDialog/), ProvisionWizard "join" mode. -- **Analyst Path B — live shared instance** (SSH-tunnelled). No FOS credentials, uses admin's running process. See [Live Dashboard Sharing](#live-dashboard-sharing) below for components. +- **Analyst Path B — live shared instance** (direct-mode against an HTTPS public_endpoint; the SSH-tunnel-to-localhost.run option was deleted in v2.0). No FOS credentials, uses admin's running process. See [Live Dashboard Sharing](#live-dashboard-sharing) below for components. **Both paths must keep working.** Don't remove either. Don't introduce a "unified" replacement without keeping the JSON-config flow intact — it's the only option when the admin's instance can't stay running. @@ -154,8 +180,8 @@ lf = cfg.get("log_fields") or {"schema_version": 2, "custom_fields": []} Brief summaries; click through to source for details. -### Scheduler ([backend/scheduler.py](backend/scheduler.py)) -Single `BackgroundScheduler`. `_sync_jobs()` adds/removes per-service jobs on `reload()`. Per-run progress events tracked in [backend/cron_progress.py](backend/cron_progress.py) and streamed via SSE. +### Scheduler ([backend/cron/](backend/cron/)) +Single `BackgroundScheduler` owned by [backend/cron/scheduler.py](backend/cron/scheduler.py). `_sync_jobs()` adds/removes per-service jobs on `reload()`. The `@cron_task` decorator (telemetry context + usage-log flush + watchdog hard-cap) lives in [backend/cron/decorators.py](backend/cron/decorators.py). Per-job bodies live under [backend/cron/jobs/](backend/cron/jobs/) (`sync`, `commit`, `compaction`, `optimize`, `expire`, `metadata`). Per-run progress events tracked in [backend/cron_progress.py](backend/cron_progress.py) and streamed via SSE. [backend/scheduler.py](backend/scheduler.py) is a thin compat shim that re-exports the same public symbols. ### NGWAF Bot Detection ([backend/utils/ngwaf.py](backend/utils/ngwaf.py), [backend/utils/ngwaf_bot_cache.py](backend/utils/ngwaf_bot_cache.py)) Syncs VERIFIED-BOT requests from `GET https://api.fastly.com/ngwaf/v1/workspaces/{id}/requests`. JSON:API pagination via `meta.next_cursor`. Shared SQLite cache at `data/ngwaf/ngwaf_bot_cache.db`. Enriches log rows with `waf_req_id` + `waf_sig LIKE '%VERIFIED-BOT%'`. @@ -168,7 +194,7 @@ Both stored in per-service `metadata.db` (SQLite). Alerts are threshold-based wi ### State Sync ([backend/state_sync.py](backend/state_sync.py)) `export_admin_state` writes `audit_logs` + `views` from per-service SQLite, plus `log_format_history` + `custom_fields` from the config JSON, to `{prefix}/iceberg/meta/admin_state.json`. **Alerts are not synced** — each instance maintains its own. Only `read_write` services export. -### FOS Usage Logging ([backend/utils/usage_logger.py](backend/utils/usage_logger.py), [backend/core/metadata_db.py](backend/core/metadata_db.py)) +### FOS Usage Logging ([backend/utils/usage_logger.py](backend/utils/usage_logger.py), [backend/core/metadata/usage_log.py](backend/core/metadata/usage_log.py)) Every FOS Class A/B op and CDN download recorded to per-service `usage_log` SQLite for cost analysis. - Global toggle: `data/system/usage_logging.json` - Process-context tagging via `set_process_context()` in [backend/utils/telemetry.py](backend/utils/telemetry.py) — tags entries with `cron:sync:svc1` or `api:GET /api/...` @@ -176,42 +202,66 @@ Every FOS Class A/B op and CDN download recorded to per-service `usage_log` SQLi - Costs computed at query time from rate config — changing rates recomputes history. - Admin endpoints: `GET/PATCH /api/admin/usage-logging`, `GET/DELETE /api/admin/usage-log`, `GET /api/admin/usage-log/export`. Frontend: `/admin/usage-log`. -### Log-Line Accounting ([backend/routers/admin.py](backend/routers/admin.py) `api_log_accounting`) +### Log-Line Accounting ([backend/routers/admin/log_accounting.py](backend/routers/admin/log_accounting.py) `api_log_accounting`) Per-bucket reconciliation between Fastly's `/stats/service/{id}` log-emission counter and our `sum(row_count) FROM ingested_files`. - Field probe order: `log → log_records → log_entries → logging_requests`; first non-zero wins. All-zero logs a warning. - In-flight clamp: current bucket is in totals but excluded from sustained-loss scan (Fastly Stats lags ingest). - Sustained-loss alert: ≥2 consecutive completed buckets with `gap_pct ≥ 0.05`. - Frontend cadence: `staleTime 30s`, `refetchInterval 60s` → ≤1 Fastly Stats call/min per open admin tab. -### Iceberg Pointer + Summary Hash-Throttle ([backend/core/iceberg.py](backend/core/iceberg.py)) +### Iceberg Pointer + Summary Hash-Throttle ([backend/core/iceberg/_core.py](backend/core/iceberg/_core.py)) Every commit writes `metadata_location.txt` (unavoidable) and `table_summary.json` (skippable). The latter is content-hashed against `_table_summary_hash_cache`; identical payloads skip the PUT. Saves one FOS PUT per no-op commit in steady state. Cache is module-scope, process-lifetime. ### DuckDB Connection Pool ([backend/core/duckdb_pool.py](backend/core/duckdb_pool.py)) Per-service LIFO pool replaces per-request `duckdb.connect()` + S3 / iceberg setup + view rebind (~50ms steady-state). Pool size is `DUCKDB_POOL_MAX_SIZE` (default 8). All pool connections open with `read_only=False` — `get_connection` forces this so cron writers and pool readers don't trip DuckDB's "different configuration" error on the same file. Optional per-connection tuning: `DUCKDB_POOL_CONN_MEMORY_LIMIT` (e.g. `256MB`) caps RSS growth under concurrent large scans; `DUCKDB_POOL_CONN_THREADS` reduces context-switching when `pool_size × per_conn_threads` exceeds physical cores. View-binding happens outside the pool lock to avoid deadlocking the FastAPI thread pool when an Iceberg snapshot reload blocks. -### Hourly Top-N Rollups ([backend/core/rollups.py](backend/core/rollups.py), [scripts/backfill_rollups.py](scripts/backfill_rollups.py)) -Precomputes per-hour Top-N aggregates for the dashboard's most-asked fields (ip, country, url, custom fields) and writes them under `/data/rollups/`. Closed hours read from the rollup; the current ("live") hour merges the rollup with a fast scan of the buffer. Plus a per-minute time-series bundle (`rollups/timeseries/...`) used by the dashboard chart to skip the wide Iceberg scan. Skipped buckets fall back to the raw scan path. Generated by `local_compact_{id}` after each compaction pass; the global `optimize_{id}` job rebuilds the day's worth on each run. +**Pool wait observability** — `_Pool.acquire` records every checkout's wall-clock wait time to (a) the OTel `app.thread_wait_ms` histogram tagged `{outcome: reused | created | timeout, waited: true | false, service}` for off-box analysis via `docker logs app-backend-1 | grep app.thread_wait_ms`, AND (b) a bounded in-process ring buffer (~1024 samples per service) consumed by `Pool.stats().wait` (p50/p95/p99/max/mean). `GET /api/admin/health-snapshot` exposes the per-service stats; the `SystemHealthCard` on `/admin` renders top-level Pool wait p95 / Pool in-use / idle cards plus an expandable per-service table. ADR-03 escalation rule: p95 > 50ms ⇒ consider separate-process cron isolation; > 200ms flags red. Both paths are non-blocking (try/except around the recorder) so instrumentation can never break a checkout. + +### Hourly Top-N Rollups ([backend/core/rollups/](backend/core/rollups/), [scripts/backfill_rollups.py](scripts/backfill_rollups.py)) +Precomputes per-hour Top-N aggregates for the dashboard's most-asked fields (ip, country, url, custom fields) and writes them under `/rollups/`. Closed hours read from the rollup; the current ("live") hour merges the rollup with a fast scan of the buffer. Plus a per-minute time-series bundle (`rollups/hour_bundled/hour=H/time_series.parquet`) used by the dashboard chart to skip the wide Iceberg scan. Skipped buckets fall back to the raw scan path. Generated by `local_compact_{id}` after each compaction pass; the global `optimize_{id}` job rebuilds the day's worth on each run. + +**Bundle tiers** (cheapest first wins in the reader): +- `rollups/day_bundled/day=D/all_fields.parquet` — one parquet per day, all fields. Reader prefers this for fully-in-window closed days. +- `rollups/hour_bundled/hour=H/all_fields.parquet` — one parquet per hour, all fields. Reader uses for partial-day boundary hours + any day without a day-bundle. +- `rollups/hour/field=F/hour=H/*.parquet` — per-(field, hour). Original source of truth; the bundle writers read from here. +- `rollups/day/field=F/day=D/*.parquet` — per-(field, day). Source for the day-bundler. + +**Virtual fields** (`waf_sig_ind`, `edge_score_reason_ind` — see `_VIRTUAL_FIELD_BACKING` in `rollups/_common.py`) are CSV-unnested at WRITE time so the dashboard reader serves them through the standard rollup path instead of paying a 30-day unnest-during-query each request. Wired in `_run_per_field_copy` (rollups/recompute.py) via `_build_virtual_field_copy_query` (rollups/_common.py). Adding a new virtual field requires (a) appending to `_VIRTUAL_FIELD_BACKING`, (b) ensuring its `backing` column is on the schema, (c) a one-shot rebundle migration so existing hour/day bundles pick it up (see next point). + +**Stale-bundle hazard.** `bundle_hours` / `bundle_days` use mtime to skip up-to-date bundles, and the cron only re-bundles HOURS THAT JUST RECEIVED DATA. Closed historical hours never get re-touched. If you add a new field to the rollup writer (real or virtual), the per-(field, hour) parquets land but the bundled `all_fields.parquet` for closed hours stays without them — the dashboard's bundled-rollup reader returns 0 rows for the new field and the runtime fallback fires. Fix: add a data migration that deletes the closed bundles and runs `backfill_*_bundles` (canonical pattern: `_rollups_virtual_field_rebundle` in [backend/core/data_migrations.py](backend/core/data_migrations.py)). + +**Live-hour batch must filter virtual fields out** before `execute_top_n_batch` (in `_base.py`'s `execute_top_n_rollups`): the SQL projects `field_name AS value` and virtual names aren't real columns on the live temp table. Passing them through BinderException's the whole UNION ALL and silently drops the live-hour merge for real fields too. See `live_fields = [f for f in fields if f in actual_cols]` at the merge site. + +**`live_temp` narrow projection** ([backend/repositories/dashboard.py](backend/repositories/dashboard.py)): only `conn_requests` + `timestamp` on the `chart_metric == "requests"` path. The runtime CSV-unnest fallback for virtual fields (`_exploded_top_n`) queries the BASE table via stashed `orig_table_name` / `orig_where_clause` / `orig_params`, not the temp, so the temp doesn't need to carry `waf_sig` / `edge_score_reason`. Map_data is derived from `all_top_res` instead of a separate query on the temp, so `country` isn't needed either. If you add a new consumer that reads from the temp, add its columns to `narrow_col_set` AND verify the chart_metric branches. + +**`get_top_bots` rollup-served UAs** ([backend/repositories/security.py](backend/repositories/security.py)): on the unfiltered path (`not filters`), top UAs come from `execute_top_n_rollups(["ua"], ..., limit=50000)` instead of scanning the iceberg view for the `ua` column. The NGWAF JOIN still needs the raw temp because `waf_req_id` is high-cardinality and not rollup-served — but the temp is single-column (`waf_req_id` only) when the rollup path serves UAs. Filtered requests fall back to the original combined `(ua, waf_req_id)` temp. ### Response Telemetry Middleware ([backend/utils/telemetry_response_middleware.py](backend/utils/telemetry_response_middleware.py)) Backstop for endpoints that return a plain `dict` instead of going through `BaseResponse.with_telemetry`. Inspects JSON object responses, injects `_debug_queries` / `_debug_calls` / `_is_cached` from the contextvar collectors if missing. **Must be added INNER to `CompressMiddleware`** (i.e. `add_middleware(TelemetryResponseBodyMiddleware)` BEFORE `add_middleware(CompressMiddleware)`) so it sees the raw JSON, not br/zstd/gzip-encoded bytes. Skips streaming responses, non-dict bodies, and already-instrumented responses. Gated on `DEBUG_RESPONSES`; failure modes are silent + non-blocking. +### Live Query Monitor ([backend/core/query_registry.py](backend/core/query_registry.py), [backend/routers/admin_queries.py](backend/routers/admin_queries.py), [frontend/app/admin/queries/](frontend/app/admin/queries/)) +Real-time view of every executing DuckDB + SQLite query — attribution (analyst / admin / cron / system), caller `file:line`, pool slot, duration ticking up live, kind-aware Kill button that calls `con.interrupt()`. Page at `/admin/queries`, admin-only via `RemoteAccessMiddleware`. Polling at 300 ms; the Active panel promotes "completed in the last 10 s" rows as faded entries with an outcome badge so typical-traffic (p50 ≈ 0.2 ms, max ≈ 29 ms) queries are visible. Notable Slow Queries panel filters the completed-history ring buffer by threshold (100ms / 500ms / 1s / 2s / 5s), sorted slowest first. + +Instrumentation lives at two seams: SQLite `InstrumentedCursor` ([backend/utils/sqlite_profiler.py](backend/utils/sqlite_profiler.py)) registers/deregisters around `execute*`; DuckDB `InstrumentedDuckDBConnection` + `_InstrumentedResult` ([backend/core/query_instrumentation.py](backend/core/query_instrumentation.py)) wraps the connection returned from `checkout_connection` so deregistration happens at terminal-fetch time (fetchdf, arrow, etc.) rather than at `execute()` — DuckDB's execute returns in ~ms while fetch can run for seconds. Per-query overhead measured ~21 µs (~0.3% of dashboard bundle wall time). Cancel path is safe under pool reuse: a stamped `_conn_to_query[id(con)]` is verified under lock before `interrupt()` so a stale UI click never cancels a different query that's checked out the same physical connection later. + +Audit log fires on every successful cancel (`audit_log` in [backend/utils/structlog_config.py](backend/utils/structlog_config.py)) with the actor + full target attribution. OTel histograms: `app.active_queries.count`, `app.query_duration_ms`, `app.queries_cancelled_total`. Kill switches: `QUERY_MONITOR_ENABLED=0` hides the endpoints (404), `QUERY_REGISTRY_DISABLED=1` bypasses the hot path entirely for zero overhead. Design + post-spec polish history in [pending-docs/design_live_query_monitoring.md](pending-docs/design_live_query_monitoring.md). + ### CDN-Fronted Log Delivery FOS reads are fronted by a Fastly CDN VCL service (`cdn_service_id`, `cdn_url`, `cdn_secret`). The CDN validates a shared-secret query param to gate access; rate-limited to blunt brute-force. Separate from the logging service ID. ### Live Dashboard Sharing -Components for the live-shared-instance remote-analyst feature (Path B). Three sharing modes are exposed to the admin: +Components for the live-shared-instance remote-analyst feature (Path B). Two direct-mode sharing modes are exposed to the admin (the SSH-reverse-tunnel via localhost.run was deleted in v2.0): -1. **SSH reverse tunnel** via localhost.run (default, easiest) -2. **Admin-provided hostname** (e.g. `https://logs.example.com`) — no third-party relay -3. **Admin-provided IP** (e.g. `https://203.0.113.42:8443`) — no relay, no DNS +1. **Admin-provided hostname** (e.g. `https://logs.example.com`) +2. **Admin-provided IP** (e.g. `https://203.0.113.42:8443`) -Modes 2 and 3 share a single backend code path: `ShareStartPayload.use_tunnel=False` + `public_endpoint=`. The mode selector in the UI is presentational — the backend only cares whether `use_tunnel` is set and (when false) that `public_endpoint` starts with `https://` (cookies need `secure=true`). +Both share a single backend code path: `ShareStartPayload.use_tunnel=False` + `public_endpoint=`. The mode selector in the UI is presentational — the backend only cares that `public_endpoint` starts with `https://` (cookies need `secure=true`). `use_tunnel=True` still exists as a back-compat keyword and now raises a clear error. Components: -- [backend/utils/tunnel.py](backend/utils/tunnel.py) — `TunnelManager` owns `ssh -R 80:localhost:8000 nokey@localhost.run` in tunnel mode, parses assigned `https://*.lhrun.dev` hostname, tracks `TunnelState`. In direct mode (hostname / IP), no subprocess is spawned — the admin-supplied `public_endpoint` is stored and `public_url()` returns it verbatim. Process singleton via `get_tunnel_manager()`; `reset_for_tests()` for pytest. +- [backend/utils/tunnel/](backend/utils/tunnel/) — package split: `manager.py` owns the `TunnelManager` singleton (direct-mode lifecycle, sever-all panic), `session.py` holds `AnalystSession`, `rate_limiter.py` is the sliding-window `_LoginRateLimiter`, `state.py` persists `tunnel_state.json`, `fingerprint.py` computes the session fingerprint hash. Process singleton via `get_tunnel_manager()`; `reset_for_tests()` for pytest. - [backend/utils/remote_access.py](backend/utils/remote_access.py) — `RemoteAccessMiddleware` does DNS-rebinding gate (Host/Origin allow-lists, including `testclient`/`testserver` for pytest), blocks admin paths on remote requests, applies response hardening (CSP, X-Frame-Options DENY, no-store, no-referrer). `_StaticAssetLimiter` rate-limits static assets to blunt scrapes. -- [backend/core/share_db.py](backend/core/share_db.py) — singleton SQLite at `data/system/remote_share.db`: `remote_invites`, `invite_services`, `remote_sessions`, `remote_share_audit_logs`, `share_settings`, `remote_invite_claim_tokens`, `share_tos_versions`. WAL mode, numbered migrations, bcrypt passcodes, per-IP/per-email lockout. +- [backend/core/share_db/](backend/core/share_db/) — package split: `connection.py` (pool + corruption self-heal with quarantine), `schema.py` (own MIGRATIONS dict + `apply_pending` + `PRAGMA user_version`), `invites.py`, `sessions.py`, `audit.py`, `passcode.py` (argon2id current default; scrypt verify branch stays for transparent rehash-on-login upgrade), `tos.py`, `settings.py`, `validation.py`. Singleton SQLite at `data/system/remote_share.db`: `remote_invites`, `invite_services`, `remote_sessions`, `remote_share_audit_logs`, `share_settings`, `remote_invite_claim_tokens`, `share_tos_versions`. WAL mode, per-IP/per-email lockout. - [backend/routers/share_auth.py](backend/routers/share_auth.py) (`/api/share/*`) — analyst-facing: `login`, `logout`, `acknowledge`, `heartbeat`, `claim/{token}`. Tagged so middleware lets them through the tunnel. - [backend/routers/share_admin.py](backend/routers/share_admin.py) (`/api/admin/share/*`, **blocked over tunnel**) — admin-facing: tunnel lifecycle, invite CRUD, session evict, panic/sever-all, backup export/import, GDPR erase, settings. - Frontend: [ShareDashboardDialog](frontend/components/ShareDashboardDialog/), [/share-login](frontend/app/share-login/) (TOS-gated), [useAnalystHeartbeat](frontend/hooks/useAnalystHeartbeat.ts), [useShareStatusBanner](frontend/hooks/useShareStatusBanner.tsx). Watermark mounts in `AppLayout` when `bootstrap.settings.is_remote_analyst === true`. @@ -263,6 +313,20 @@ A global middleware in [frontend/lib/api.ts](frontend/lib/api.ts) checks `respon **Streaming/binary endpoints** (SSE, blobs) use raw `fetch()` — leave a comment so future readers don't "fix" it. +### Server-side bootstrap pre-fetch ([frontend/lib/ssr/bootstrap.ts](frontend/lib/ssr/bootstrap.ts), [frontend/app/layout.tsx](frontend/app/layout.tsx)) + +The root layout SSR-fetches `/api/bootstrap`, dehydrates it into the React Query cache (via a new `HydrationBoundary` in `QueryProvider`), and ships the JSON inline in the first HTML paint. `useBootstrap` and every hook that reads `bootstrap.*` via `queryClient.getQueryData(['bootstrap'])` find the data already cached on first render — no client-side bootstrap RTT, no `'No service selected'` flash, share banner in the initial paint. + +Adding a new SSR pre-fetch (e.g., for a per-page endpoint): + +1. **Use `node:http.request`, NOT `fetch()`.** Node's `fetch()` always overrides the `Host` header from the URL. The backend's `_remote_host_allowed` gate rejects remote-classified requests whose Host isn't the public endpoint — so without preserved Host, the SSR fetch returns 400 host_not_allowed and silently falls through to the client. +2. **Trust topology is `X-Remote-Analyst: 1`, not `X-Proxied-By-Caddy`.** The SSR runtime hits the backend over loopback. `is_request_remote` ([backend/utils/remote_access.py](backend/utils/remote_access.py)) classifies based on `request.client.host` first, so a forwarded Caddy marker is IGNORED. `X-Remote-Analyst: 1` is the loopback-honored primitive (gated on `tunnel_manager.is_sharing_active()`). Forward it ONLY when the inbound request carries `X-Proxied-By-Caddy` — otherwise the admin SSH-tunnel path is mis-classified as analyst and 400'd. (See history: the 2026-06-11 SSR-leak incident reverted in `f3d8dd7` / `546c279` was the previous-attempt version that forwarded `X-Proxied-By-Caddy` directly. Backend ignored it, returned admin payload, dehydration leaked admin fields into public HTML.) +3. **Always wrap in try/catch + bounded timeout, return `null` on any failure.** SSR errors must NEVER propagate into a broken page — the layout falls back to client fetch when the helper returns null. 5s is generous for prod cron contention; never block SSR longer. +4. **`force-dynamic` is REQUIRED** in any layout/page that does a per-request SSR fetch via `cookies()` / `headers()` from an imported helper. Next.js's static-analysis pass only detects direct `cookies()` calls in the component file itself — calls from an imported module won't flip the route to dynamic. Without `export const dynamic = "force-dynamic"` the layout gets SSG'd at build time (when the backend isn't reachable) and the dehydrated state is permanently empty. +5. **Adversarial test required:** before deploying, hit the prod public URL anonymous AND the admin tunnel and verify the dehydrated state shape. Anonymous public must contain only the `needs_login` stub (NO `sharing_active`, NO `ngwaf_workspace_id`, NO `sync_status`). Admin must contain the full payload. + +The `serviceStore` Zustand slice hydrates from the SSR-cached bootstrap in `useBootstrap`'s post-mount `useEffect` — for the one-render window before that effect fires, use [`useEffectiveServiceId`](frontend/hooks/useIsDataReady.ts) which falls back to `bootstrap.active_service_id` from the React Query cache. Direct reads of `useServiceStore(s => s.activeServiceId)` flash "No service selected" on first paint. + ### Canonical patterns (May 2026 DRY refactor — use these in new code) 1. **`response_model=` on every router handler.** Without it the OpenAPI emits `Record`. Routes using `Depends(get_source)` should also lift `service_id: str` into the signature so it appears as a path parameter. @@ -270,11 +334,12 @@ A global middleware in [frontend/lib/api.ts](frontend/lib/api.ts) checks `respon 3. **`ReportLayout`** for analytics pages — bundles `usePageContext + useReportConfig + useFilterPayload + useUrlFilterSync + useServiceQuery + ChartIntervalButtons + ReportShell`. Fall back to `ReportShell` only for multi-query or non-standard chrome pages. 4. **`HelpDialog`** from [components/ui/help-dialog.tsx](frontend/components/ui/help-dialog.tsx) — don't compose `Dialog + DialogHeader + DialogTitle` by hand for help content. 5. **`useBaseMap`** for any MapLibre setup. Don't duplicate the world-layer + theming inline. -6. **`metadata_db.record_audit(service_id, event_type=..., details=...)`** — direct. The `duckdb.log_audit_event` shim and `repositories/audit.py` pass-through were removed. +6. **`metadata.record_audit(service_id, event_type=..., details=...)`** — direct (or via the `metadata_db` shim; both resolve to the same `metadata.audit` impl). The `duckdb.log_audit_event` shim and `repositories/audit.py` pass-through were removed. 7. **`date_utils.parse_iso_utc` / `iso_z` / `iso_z_now`** — don't hand-roll `datetime.fromisoformat(s.replace("Z", "+00:00"))`. -8. **`@cron_task` decorator** in [backend/scheduler.py](backend/scheduler.py) — handles `start_call_tracking`, `set_process_context`, `flush_usage_log` finally-block. +8. **`@cron_task` decorator** in [backend/cron/decorators.py](backend/cron/decorators.py) — handles `start_call_tracking`, `set_process_context`, `flush_usage_log` finally-block, watchdog hard-cap. Re-exported from [backend/scheduler.py](backend/scheduler.py) for compat. 9. **`empty_schema_response(runner)`** in [_base.py](backend/repositories/_base.py) — return this when a repo function hits a service with no logs. 10. **`origin_latency_us_expr(actual_cols)`** in `_base.py` — don't hand-roll the `COALESCE("ottfb", "ttfb" * 1000000.0)` fragment. +11. **`useEffectiveServiceId`** in [hooks/useIsDataReady.ts](frontend/hooks/useIsDataReady.ts) — read this instead of `useServiceStore(s => s.activeServiceId)` whenever the answer matters on FIRST PAINT (gating views, building cache keys, "no service selected" branches). It falls back to `bootstrap.active_service_id` from the SSR-hydrated React Query cache so the page doesn't flash empty before the persisted Zustand store catches up. ### Next.js navigation + loading conventions (READ BEFORE TOUCHING FRONTEND) @@ -375,7 +440,7 @@ re-renders triggered by store subscriptions. The trace shows which. - `backend/utils/audit_helpers.py` (referenced the long-removed DuckDB `_ingested_files` table) - `backend/repositories/audit.py` (was a 27-line pass-through) - `scripts/validate_logs.py` / `.sh` (depended on removed bits) -- `backend/core/duckdb.log_audit_event` shim (call `metadata_db.record_audit` directly; test patches must target `backend.core.metadata_db.record_audit`) +- `backend/core/duckdb.log_audit_event` shim (call `metadata.record_audit` directly; test patches must target `backend.core.metadata.audit.record_audit` — or `backend.core.metadata_db.record_audit` via the shim, which the `_ShimModule` proxy mirrors onto the live binding) - `QueryRunner.safe_select` / `safe_select_list` (use `actual_cols` directly) ## Testing @@ -457,10 +522,10 @@ A job fired after the config was deleted. The next `reload()` evicts the stale j The RHS of `~` or `!~` must be a literal. No variables, no concatenation. Use `regsub()` / `regsuball()` for dynamic logic. ### 15. Operational metadata lives in per-service SQLite, not DuckDB -Alerts, views, audit, cron history, ingested-file dedup, ASN names, source registration, usage telemetry → `data/services/{id}.metadata.db` (WAL). Read/write via [backend/core/metadata_db.py](backend/core/metadata_db.py) — never via DuckDB. JOINs against log data: ATTACH the SQLite read-only as `meta` via `attach_metadata_db()`, or pre-fetch and inline as a parameterised IN list (see `dashboard.py` ASN search). +Alerts, views, audit, cron history, ingested-file dedup, ASN names, source registration, usage telemetry → `data/services/{id}.metadata.db` (WAL). Read/write via [backend/core/metadata/](backend/core/metadata/) (or the [backend/core/metadata_db.py](backend/core/metadata_db.py) shim for old import paths) — never via DuckDB. JOINs against log data: ATTACH the SQLite read-only as `meta` via `attach_metadata_db()`, or pre-fetch and inline as a parameterised IN list (see `dashboard.py` ASN search). New write paths use the `@sync_db_retry` (tenacity-backed) decorator to handle SQLite `OperationalError` busy/locked under WAL contention. ### 16. Monkeypatches → catalog in [MONKEYPATCHES.md](MONKEYPATCHES.md) -We patch six s3fs methods + one PyIceberg `SqlCatalog.load_table` at import time for telemetry-proxy routing, immutable-bytes caching, and table-object reuse. Every patch is documented in MONKEYPATCHES.md with site, motivating incident, and cleanup path. Update that file in the same commit when you add/modify/remove a patch. +Historically we patched six s3fs methods + one PyIceberg `SqlCatalog.load_table` at import time. Phase 4 of the v2.0 carve-up replaced the s3fs patches with `FosS3FileSystem` / `CachedS3FileSystem` subclasses in [backend/core/iceberg/fs.py](backend/core/iceberg/fs.py) registered as a pyiceberg `FileIO`. Whatever remains is documented in MONKEYPATCHES.md with site, motivating incident, and cleanup path. Update that file in the same commit when you add/modify/remove a patch. ### 17. MSW + openapi-fetch ordering — `server.listen()` must run at module load `openapi-fetch` captures `globalThis.fetch` at `createClient` time. [frontend/lib/api.ts](frontend/lib/api.ts) creates its client at module load, so MSW's `server.listen()` MUST execute at the top of [frontend/vitest.setup.ts](frontend/vitest.setup.ts) — **not inside `beforeAll`**. If listen runs after lib/api.ts is imported, the captured fetch is the unpatched original and every test silently bypasses MSW. Symptom: handlers never fire, requests hit real loopback. Don't move that call into a hook. @@ -475,11 +540,25 @@ Our [frontend/vitest.config.ts](frontend/vitest.config.ts) sets `globals: false` The tunnel exposes the same FastAPI app to the public internet. Middleware classifies by `Host` and blocks remote requests from admin paths — including `/api/admin/share/*`. When you add an endpoint analysts must reach, register under `/api/share/*` or update `_is_blocked_path()`. Don't remove the `testclient`/`testserver` allow-list entries — they're what let pytest hit admin routes. ### 21. `sync_data` orphan-cleanup vs local-compaction outputs -Local compaction writes merged rollups to three places: `/data/daily/`, `/data/weekly/`, and `/data/timestamp_hour=*/compacted_*.parquet`. None of these are tracked by the iceberg snapshot, so they are NOT in `cloud_files`/`active_paths`. The orphan-cleanup loop in [backend/core/iceberg.py](backend/core/iceberg.py) `sync_data()` walks the cache and deletes anything not in `active_paths`; without explicit allow-rules it nukes every compacted output, and the [`local_compacted_files` registry](backend/core/metadata_db.py) then blocks re-download of the source files — silently dropping rows from the view (production: 1.65M → 302K on 2026-05-31, then 1.66M → 1.62M on 2026-06-01 from the per-partition `compacted_*` variant). The fix is two-pronged: orphan-cleanup restricts its walk to `timestamp_hour=*` dirs AND skips `compacted_*.parquet` filenames. **If you add a new local-only output pattern, add it to both the dir skip and the file skip.** Integration coverage in [tests/core/test_local_compaction.py](tests/core/test_local_compaction.py)::`test_compaction_outputs_survive_iceberg_sync_orphan_cleanup` exercises the round-trip with real `compact_local_partitions` + real `sync_data`. +Local compaction writes merged rollups to three places: `/data/daily/`, `/data/weekly/`, and `/data/timestamp_hour=*/compacted_*.parquet`. None of these are tracked by the iceberg snapshot, so they are NOT in `cloud_files`/`active_paths`. The orphan-cleanup loop in [backend/core/iceberg/_core.py](backend/core/iceberg/_core.py) `sync_data()` walks the cache and deletes anything not in `active_paths`; without explicit allow-rules it nukes every compacted output, and the [`local_compacted_files` registry](backend/core/metadata/ingest_log.py) then blocks re-download of the source files — silently dropping rows from the view (production: 1.65M → 302K on 2026-05-31, then 1.66M → 1.62M on 2026-06-01 from the per-partition `compacted_*` variant). The fix is two-pronged: orphan-cleanup restricts its walk to `timestamp_hour=*` dirs AND skips `compacted_*.parquet` filenames. **If you add a new local-only output pattern, add it to both the dir skip and the file skip.** Integration coverage in [tests/core/test_local_compaction.py](tests/core/test_local_compaction.py)::`test_compaction_outputs_survive_iceberg_sync_orphan_cleanup` exercises the round-trip with real `compact_local_partitions` + real `sync_data`. ### 22. `unattended-upgrades` can OOM a memory-tight VM A 16 GB Linux VM running backend + frontend + caddy holds a steady-state working set in the 10-13 GB range. The Debian/Ubuntu nightly `apt-daily-upgrade.timer` forks a transient 1-2 GB downloader on top of that, which can trip an OOM kill that wedges the kernel (sshd dies; needs a VM reset). The mitigation is to `systemctl mask apt-daily.timer apt-daily-upgrade.timer unattended-upgrades.service` on the host and re-assert it on every restart so a re-image / apt-reinstall can't silently re-enable them. Trade-off: no automatic security patching — patch manually on a planned maintenance window with the backend container stopped. **If you provision a VM with more RAM, you may safely re-enable upgrades.** +### 23. SSR upstream fetch must use `node:http`, not `fetch()` +Node's `fetch()` always rewrites the `Host` header from the URL — there's no way to override it. The backend's `_remote_host_allowed` gate ([backend/utils/remote_access.py](backend/utils/remote_access.py)) rejects remote-classified requests whose `Host` isn't the public endpoint. SSR helpers like [frontend/lib/ssr/bootstrap.ts](frontend/lib/ssr/bootstrap.ts) use `node:http.request` which preserves arbitrary headers verbatim. If you write a new SSR helper, do NOT reach for `fetch()` — copy the `rawRequest` pattern. The 2026-06-11 SSR-leak incident (reverts `f3d8dd7` / `546c279`) was the first version using `fetch()`; the `Host` got rewritten to `127.0.0.1:8000`, the backend classified as admin-from-loopback, and the full admin bootstrap dehydrated into anonymous public HTML. + +### 24. Rollup writers must rebundle bundles after adding a field +`bundle_hours` / `bundle_days` use mtime to skip up-to-date bundles. The cron only re-bundles HOURS THAT JUST RECEIVED DATA. Closed historical hours never re-touch. So a new field added to the rollup writer (real or virtual) lands as a per-(field, hour) parquet but the bundled `all_fields.parquet` for closed hours stays without it — the dashboard's bundled-rollup reader returns 0 rows for the new field and the runtime fallback fires (defeats the perf win). Fix: ship a one-shot data migration that deletes the closed `all_fields.parquet` files and runs `backfill_*_bundles` so they get rewritten with the new field. Canonical pattern: `_rollups_virtual_field_rebundle` in [backend/core/data_migrations.py](backend/core/data_migrations.py). + +### 25. Virtual fields blow up the live-hour batch if not filtered out +`execute_top_n_rollups` in [_base.py](backend/repositories/_base.py) needs the active-hour merge to include real fields' new rows. The live-hour SQL projects `field_name AS value` and BinderExceptions on any name that's not a column on the live temp. Virtual fields like `waf_sig_ind` don't exist as real columns — passing them through silently kills the whole UNION ALL (the outer `except Exception: pass` swallows it) and drops the live-hour merge for REAL fields too. Always filter to `actual_cols` before the batch: +```python +live_fields = [f for f in fields if f in actual_cols] +if live_fields: + live_res, _ = self.execute_top_n_batch(live_fields, tmp_name, ...) +``` + ## AI Agent Directives These apply to every change, regardless of scope. @@ -526,6 +605,33 @@ These apply to every change, regardless of scope. 17. All new endpoints get at least one test in `tests/routers/`. 18. Regenerate OpenAPI types after the endpoint lands: `cd frontend && npm run gen:types`. +### Architectural choices to preserve + +The 2026-06 retrospective surfaced several structural decisions the audit specifically validated. Don't rewrite these in a future reimagining: + +- **ADR-driven architecture with decisions captured AFTER the lesson lands.** This is the velocity strategy, not a debt. Continue the cadence — write the ADR after a phase ships, not before. +- **[MONKEYPATCHES.md](MONKEYPATCHES.md) as a living inventory** with root-cause attribution per patch (incident date, why upstream can't fix, removal criteria). +- **Property-based testing** (Hypothesis) for filter/query roundtrips. Catches drift without hand-written matrices. +- **RequestContext** making tenancy structurally impossible to bypass — can't construct without `_enforce_service_access`. +- **Modular package carves with re-export shims** for backward compat during refactor (the `metadata_db.py` / `scheduler.py` pattern). +- **Named exception classes + explicit retry policies** (vs. generic `except Exception`). +- **Three-tier docs scheme** (pending-docs / local-docs / docs) — intentional and works for a public-repo solo project. +- **MVP-then-iterate cadence with phase-based cleanup.** Don't propose "spike before shipping" rewrites — solo bandwidth and information-unavailability at v1.0 time make iterate-then-cleanup the right trade-off. + +### Anti-patterns explicitly rejected + +If a refactor proposal matches one of these, push back. Each was investigated and rejected during the 2026-06 audit; the rationale is preserved here so future-you / future-agent doesn't relitigate: + +- **Generic "schema codegen" infrastructure** for FilterSpec — `openapi-typescript` already handles the 80% case; codegen can't express the procedural collision-handling logic that's the actual duplication. +- **Premature `usePagination` / `PaginationConfig` context** when there are only 2 paginated endpoints with genuinely different sort semantics. +- **Centralized `RoleProvider` context** — role is 2 orthogonal flags (`analyst_session` × `is_remote_analyst`), not a hierarchy; an enum would have locked in a false model when SHARE-INVITED was added. +- **Multi-language scoring codegen** (Python ↔ Rust) — parity is enforced cheaply by fixture tests; codegen adds versioned-schema overhead and constrains schema evolution. +- **Pre-formatted server-side response values** — `TopTenTable` needs raw values for click handlers and map ops; pre-formatting forces double payload and locks display format into the API contract. +- **Cache-coherence "state machine" abstractions** — the bottleneck is DuckDB view rebuild time, not cache layer policy; a state machine wouldn't have prevented the 2026-06-09 transient-empty-result incident. +- **Unified `QueryExecutor`** for retry — stale-view and compaction-race are different error classes with different recovery costs; collapsing them creates a leaky abstraction. +- **Tentacle-parameter threading** through repository signatures (e.g., passing `RequestContext.cached_temps` to every repo function) — couples request scope to data layer. +- **Custom `FsspecFileIO` subclass to "fix" the s3fs monkeypatches** — investigated 2026-05-21 and rejected; pyiceberg instantiates `S3FileSystem` directly inside its `_s3()` builder, bypassing the FileIO layer entirely. Wait for upstream `supply-your-own-FileSystem-class` hook (tracked in [MONKEYPATCHES.md](MONKEYPATCHES.md)). + ## Keeping This File Current Update this file in the same commit that introduces: diff --git a/CHANGELOG.md b/CHANGELOG.md index 3309df7c..65f878a7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,250 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog 1.1.0](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [2.0.0] - 2026-06-12 + +Architecture cleanup release. The post-`v1.2.0` perf branch closed the +worst read-path latency by stacking remediation on top of an +architecture that wasn't designed for the workload; this release pays +that down. The largest backend files were carved into per-concern +packages, telemetry moved to OpenTelemetry + structlog, tenancy got a +typed `RequestContext` boundary, frontend hydration warm-up hacks were +replaced with policy, and the test + type gates ratcheted to a level +that catches regressions on the way in. Composite endpoints land as a +hard cutover — frontend + backend ship together, granular endpoints +deleted. + +### Architecture + +- **`backend/core/iceberg.py` (4,232 LOC)** → `iceberg/` package + (`view`, `catalog`, `warehouse`, `manifest`, `fs`, `_core`, + `buffer`, `ddl`, `snapshot_cache`, `dedup`, …). Custom + `FosFsspecFileIO(FsspecFileIO)` + `CachedFosS3FileSystem(S3FileSystem)` + subclasses replace 5 of the 6 historical `s3fs` monkeypatches; + only the `ThreadPoolExecutor.submit` ContextVar wrapper remains + (see [MONKEYPATCHES.md](MONKEYPATCHES.md)). +- **`backend/scheduler.py` (2,843 LOC)** → `backend/cron/` package + with `scheduler`, `decorators`, and per-job modules under + `cron/jobs/` (`sync`, `commit`, `compaction`, `optimize`, `expire`, + `metadata`, `gap_heal`, `rollup_compact_daily`). The scheduler + picks the **separate-pool** isolation strategy based on Phase 1 + thread-wait telemetry; the deferred-view-cache-invalidation hack + is gone. +- **`backend/core/metadata_db.py` (3,168 LOC)** → `backend/core/metadata/` + package with concern-partitioned mixins (`base`, `alerts`, `views`, + `ingest_log`, `cron_log`, `asn_cache`, `usage_log`, `reconciliation`, + `state`). `metadata_db.py` becomes a thin backward-compatible shim. +- **`backend/utils/tunnel.py` (1,022 LOC)** → `backend/utils/tunnel/` + package (`manager`, `session`, `rate_limiter`, `state`, + `fingerprint`). The SSH-to-localhost.run path is **deleted entirely** + (~400 lines): no more SSH subprocess + sleep-listener + reconnect + state machine. Direct-mode only; production has always used direct. +- **`backend/core/share_db.py` (1,312 LOC)** → `backend/core/share_db/` + package (`connection`, `schema`, `invites`, `sessions`, `audit`, + `passcode`, `tos`, `settings`). `argon2-cffi` replaces `scrypt` for + passcode hashing. +- **`backend/routers/admin.py` (1,650 LOC)** → `backend/routers/admin/` + package (14 sub-modules: `pop_locations`, `ingest`, `trees`, + `downloads`, `sync_status`, `compaction`, `health`, + `log_accounting`, `iceberg`, `bot_sources` + shared + `_helpers` / `_dir_size` / `_router`). +- **`backend/core/rollups.py` (2,045 LOC)** → `backend/core/rollups/` + package (8 sub-modules: `_common`, `time_series`, `sessions`, + `hour_bundles`, `day_bundles`, `recompute`, `wellknown_bots`). +- **`RequestContext` replaces `AnalyticsDeps`** ([`backend/core/request_context.py`](backend/core/request_context.py)). + Tenancy is enforced at context construction; routes never parse a + `service_id` from a path param. The security-load-bearing private + `read_only` attribute is now structurally unexposable as a query + param. +- **Composite endpoints + hard cutover** — `dashboard/bundle`, + `security/bundle`, `network/bundle` ship together with the frontend + swap. Granular per-card endpoints deleted, `_meta_con` parallel path + dropped, `is_cached/_is_cached` alias collapsed, + `AnalyticsDeps = RequestContext` shim removed. Top-5 backend files + now ≤ 1,461 LOC; no backend file > 1,500. + +### Telemetry, observability + +- **OpenTelemetry** (`opentelemetry-api/sdk` + + `fastapi`/`botocore`/`aiohttp` instrumentors) replaces the four + fragmented custom telemetry surfaces. Console exporter ships by + default; backends (Jaeger / Tempo / Honeycomb / …) are a + deploy-config decision, not part of this release. +- **`structlog`** wires `trace_id` + `span_id` into structured log + output via a custom processor. +- **`process_context_scope` + `_ACTIVE_CONTEXTS` mirror kept** at + [`backend/utils/telemetry.py`](backend/utils/telemetry.py). OTel context + propagation uses Python ContextVars under the hood, which inherit + the cross-thread limitation (fsspec iothread, pyiceberg + ThreadPoolExecutor) the manual mirror was built to solve; removing + the mirror would re-introduce the ~80%-NULL telemetry bucket + observed on 2026-05-20. Docstring + plan entry document the + reasoning. +- **`RequestTelemetry`** thin wrapper owns section spans, query + attribution, call log, and the custom `app.thread_wait_ms` metric + that fed the Phase 6 separate-pool decision. + +### Reliability, perf + +- **`aiodns` + `asyncio.gather` + bulk-transaction sqlite writes** in + [`backend/utils/rdns_cache.py`](backend/utils/rdns_cache.py) replace the + serial-blocking `socket.gethostbyaddr` loop that wedged the sync + worker for minutes on bulk lookups. +- **`tenacity`** decorator-based retry replaces ad-hoc try/except loops + for Fastly API + NGWAF + SQLite WAL-busy paths; centralised policy + on `Settings`. +- **`pydantic-settings`** centralises env-var reads + boot validation + (the "TRUSTED_PROXY_IPS required in prod" gate is now a pydantic + validator). +- **`cachetools`** replaces `bounded_cache` / `rdns_cache` / + `ngwaf_bot_cache` in-process LRU/TTL implementations. +- **Structured `.tf.json`** generation replaces f-string HCL + + `_hcl_escape` regex (`backend/utils/terraform_gen.py`), eliminating + the custom-HCL escaping injection vector. +- **`orjson` via FastAPI `ORJSONResponse`** for ~5–10× faster JSON + serialisation on composite endpoint payloads. +- **`rich` + `typer`** for the provision CLI; `httpx` everywhere + except `telemetry_proxy.py` (which stays on `aiohttp` for the proxy + server role). +- **`nuqs`** as the URL state source on the frontend, replacing the + custom Zustand/Effect sync hooks that produced hydration desync on + refresh. +- **`session_scoring._cached`** clears `_inflight` on the cache-hit + path too, not only on producer-path teardown — concurrent callers + on a hot cache key no longer leak the inflight registration when + the producer finishes before they wake up. +- **`iceberg/buffer.tombstone_buffer_files`** logs + skips on + marker-write failure (the immediate-`os.remove` fallback re-opened + the in-flight-query race the tombstone grace window exists to + close). Pair regression test pins the contract. +- **`DROP TABLE IF EXISTS` identifier quoting** at 11 temp-table + cleanup sites so the drop tolerates reserved keywords / hyphenated + service slugs that would otherwise raise. + +### Trust topology, middleware + +- **Middleware order asserted at boot AND in tests** — the + multi-paragraph prose comments in `main.py` were replaced with + one-line `# INVARIANT` markers + a boot-time crash if + `app.user_middleware` doesn't match the declared tuple. Snapshot + tests cover Caddy + docker-compose middleware order too. +- **`@pytest.mark.security_regression` marker + monotonic-count CI + gate** (floor: 24, from `audit-findings/`). Every test covering a + verified security fix carries the mark; a refactor cannot silently + drop coverage of a known fix. +- **Trust-topology snapshot tests** pin Caddy `@from_fastly` matcher, + XFF forwarding, `/share-login` rate-limit, and the backend + `--forwarded-allow-ips=127.0.0.1` flags. +- **`raise_internal(logger, exc, code, status)`** replaces + `raise HTTPException(detail={"error": str(e)})` at every backend + except site that previously echoed the original exception message + to the client. Detail is now `{"error": , "error_id": <8-hex>}`; + the full exception lands in the server log with the same + `error_id` so operators triage without the upstream body / token + fragments leaking on the wire. +- **`escape_sql_literal`** applied at every `read_parquet()` / + `glob()` site that interpolates a computed path. Closes the + injection surface a partially-validated path could open through + DuckDB's `read_parquet()` glob expansion. +- **Caddy container drops privileges** — `caddy/Dockerfile` adds + `USER caddy` (the base image ships the user). Caddy is the only + externally-facing socket and binds nothing below port 1024, so + there's no reason to keep `root` in the runtime. + +### Frontend + +- **RSC/CSR boundary** documented in `app/_routing.md`. The + hidden-Plotly + hidden-MapLibre + `setTimeout` warm-up hacks are + dropped; replaced with `modulepreload` + the styledata-event swap + pattern. +- **16 frontend files > 500 LOC split.** `ProvisionWizard.tsx` + (3,582 LOC) → `wizard/steps/*` + `state.ts` + `api.ts`; + `app/logs/page.tsx` (2,136 LOC) → `_sections/*` + `_state.ts`. + `app/admin`, `app/dashboard`, `app/alerts`, `app/security`, etc. + all post-split < 500. **No frontend file > 499 LOC.** +- **Live Query Monitor** — live-first sort, peak-memory column, + keyboard shortcuts, URL-persisted filters, per-run inline expand + for ×N cron-grouped rows, ≥ 30 s stuck-query pulse, copy-SQL, + sound notification removed. +- **Operations Overview cards** on the admin landing page surface + ingest gap + live query activity + slow-query count so the things + operators actually care about don't live three clicks deep. + Tone-coded (default → attention → warning → critical) so a + sustained_loss event jumps out. +- **Stable React keys on dynamic lists** — `DebugPanel`, `CronLiveLog`, + the network metro leaderboard, the query toolbar, and the + custom-field drawer now key off a stable identity instead of array + index. `useSSE` attaches a monotonic `_id` to each line so + append-only feeds (cron progress, query streams) keep stable keys + across re-renders. +- **Accessibility pass** — `FieldGroups` and `FileBrowser` disclosure + widgets are real `