diff --git a/.check_router_core_floor b/.check_router_core_floor new file mode 100644 index 00000000..5bc6609e --- /dev/null +++ b/.check_router_core_floor @@ -0,0 +1 @@ +117 diff --git a/.env.example b/.env.example index 435b605f..9c1afa6d 100644 --- a/.env.example +++ b/.env.example @@ -47,6 +47,13 @@ # backend runs on a different host than the frontend. # NEXT_PUBLIC_API_URL=http://127.0.0.1:8000 +# ── Observability ────────────────────────────────────────────────────────────── +# OpenTelemetry exporter. Default 'none' — no spans/metrics leave the process. +# Set 'console' to dump spans and 60s metric snapshots to stdout (loud; useful +# locally when chasing a perf regression). Don't set 'console' in prod — it +# pollutes log aggregation with ~1 MB/min of JSON. +# OTEL_EXPORTER=console + # ── Docker only ──────────────────────────────────────────────────────────────── # Set automatically by docker-compose; not needed for local dev. # API_PROXY_URL=http://backend:8000 diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 00c0e2ae..981a67bd 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -32,8 +32,15 @@ jobs: - name: Format check (ruff) run: uv run ruff format --check . - - name: Type check (mypy) - run: uv run mypy backend/ + - name: Type check (mypy, filtered through mypy-baseline) + # Pre-existing errors accepted via mypy-baseline.txt; the filter + # exits non-zero only on NET-NEW errors. Refresh the baseline after + # a burndown PR with + # uv run mypy backend/ 2>&1 | uv run mypy-baseline sync + # and commit mypy-baseline.txt. Burndown plan + + # bucket scoping live in + # pending-docs/session_2026-06-10_otel_dump_and_log_extents.md. + run: uv run mypy backend/ 2>&1 | uv run mypy-baseline filter - name: Install falco run: | @@ -85,19 +92,50 @@ jobs: env: FALCO_REQUIRED: "1" TERRAFORM_VALIDATE: "1" - # Gate ratcheted as milestones land: + # Gate ratcheted as milestones land (convention: current actual − 2pp): # end Milestone A: 44% (baseline 46%, -2pp buffer) # end Milestone E: 47% (current 49% — keeps the 2pp buffer) # post-Milestone E coverage backfill: 55% (current 59% — 4pp buffer) # confidence-batch (insights+admin+services+dashboard+origin+ # hypothesis+regression+E2E smoke): 78% (current 83% — 5pp buffer) + # post live-query-monitor (2026-06-11): 80% (current 82%) + # post backend coverage waves (reconciliation/compaction/session_scoring + # /data_migrations/tunnel-state/dashboard-router/views/sqlite_profiler): + # 82% (current 83% — 1pp buffer; tight while v2.0 target 85% lands). + # v2.0 final wave (2026-06-12): per-module tests for the post-split + # rollups/ + admin/ packages (rollups/sessions 85, rollups/time_series + # 84, rollups/day_bundles 76, rollups/recompute 96, admin/compaction + # 100, admin/health 100): 85% (current 85% — the v2.0 target hit). # # `-n auto` parallelizes via pytest-xdist (TESTING_PLAN_3 item 21). # Verified safe: per-service SQLite (`{id}.metadata.db`) + per-test # tmp_path give file isolation; autouse `_reset_module_caches` resets # the 8 module-level caches between tests; moto fixtures are per-test. # Local run: 2268 passed in 58s under `-n auto` vs ~3min serial. - run: uv run pytest -n auto --cov=backend --cov-report=term --cov-fail-under=78 + run: uv run pytest -n auto --cov=backend --cov-report=term --cov-fail-under=85 + + - name: Security-regression count gate + # v2.0 cleanup Phase 0.8: asserts the + # @pytest.mark.security_regression count never drops below the + # baseline floor (24 — derived from audit-findings/ verified + # fixes). A refactor cannot silently delete coverage of a + # verified fix without surfacing the change. + run: bash scripts/check_security_regression_count.sh + + - name: Emit perf samples (CI-scale synthetic load) + # Produces tests/perf/latest.json from a 100K-row in-memory + # DuckDB dataset (~2 s wall). The gate below compares to + # tests/perf/baseline.json and fails on >regression_pct_threshold% + # over baseline (50 % default; tuned for GH Actions runner + # variance at CI scale). + run: uv run python scripts/emit_perf_latest.py + + - name: Perf gate (load-harness baseline) + # Compares the just-emitted latest.json against baseline.json. + # Production targets (≤2800 / ≤1900 ms) are documented in + # baseline.json's production_targets_comment for traceability + # but enforced by the manual loadtest probe, not this CI gate. + run: bash scripts/perf_gate.sh frontend: name: Frontend (Node) @@ -140,7 +178,17 @@ jobs: run: npx tsc --noEmit - name: Tests (vitest with coverage) - # Gate ratcheted as milestones land: + # Gate ratcheted as milestones land (convention: current actual − 2pp): # end Milestone A: 40% (baseline 42.7%, -2pp buffer) - # end Milestone E: 44% (current 46.55% — keeps the 2pp buffer) - run: npx vitest run --coverage --coverage.thresholds.lines=44 + # end Milestone E: 44% (current 46.55%) + # post live-query-monitor (2026-06-11): 53% (current 55.19%) + # post lib/toast + lib/api/custom-fields + lib/workers/parseJson tests + # (2026-06-12): 55% (current 57.12%) + # post ProvisionWizard/wizard-config-helpers tests + # (2026-06-12): 56% (current 58.42%) + # post ProvisionWizard/wizard-api tests + # (2026-06-12): 57% (current 59.8%) + # post ProvisionWizard/wizard-deploy tests + # (2026-06-12): 58% (current 61.66%) — final v2.0 target hit + # per cleanup_plan §10.14. + run: npx vitest run --coverage --coverage.thresholds.lines=58 diff --git a/.github/workflows/cidr-refresh.yml b/.github/workflows/cidr-refresh.yml new file mode 100644 index 00000000..47909585 --- /dev/null +++ b/.github/workflows/cidr-refresh.yml @@ -0,0 +1,53 @@ +name: Refresh Fastly CIDRs + +# Weekly refresh of the Fastly edge CIDR list in the repo-root Caddyfile. +# The @from_fastly_v4 matcher gates X-Forwarded-For rewriting on Fastly's +# published v4 ranges; a stale list silently classifies traffic from new +# POPs as direct (untrusted) until somebody refreshes it and reloads +# Caddy. The script is well-tested (scripts/refresh_fastly_cidrs.py); +# this workflow just runs it on a cadence and opens a PR if the file +# changed. Off-minute schedule on purpose so the runner pool isn't +# hammered at :00 alongside everybody else's hourly jobs. + +on: + schedule: + - cron: '13 9 * * 1' # Mondays at 09:13 UTC + workflow_dispatch: {} + +permissions: + contents: write + pull-requests: write + +jobs: + refresh: + name: Fetch + open PR on diff + runs-on: forge-amd64-medium + steps: + - uses: actions/checkout@v6 + + - name: Install uv + uses: astral-sh/setup-uv@v7 + with: + enable-cache: true + python-version: "3.13" + + - name: Refresh Caddyfile + # No-op if the published list already matches what's in the + # Caddyfile (script prints "No changes …" and exits 0). Writes + # the updated matcher block otherwise; peter-evans/create-pull- + # request below only opens a PR when the working tree is dirty. + run: uv run python scripts/refresh_fastly_cidrs.py + + - name: Open PR if Caddyfile changed + uses: peter-evans/create-pull-request@v7 + with: + commit-message: 'chore: refresh Fastly edge CIDR list in Caddyfile' + branch: chore/refresh-fastly-cidrs + delete-branch: true + title: 'chore: refresh Fastly edge CIDR list' + body: | + Automated update from `scripts/refresh_fastly_cidrs.py`, triggered by the weekly `cidr-refresh.yml` workflow. + + The `@from_fastly_v4` matcher in [Caddyfile](../blob/main/Caddyfile) gates the `X-Forwarded-For` rewrite on Fastly-published edge ranges. A stale list silently classifies traffic from new POPs as direct (untrusted) until Caddy reloads. + + After merge: run `~/restart.sh caddy` (or equivalent) on the VM to pick up the new ranges. diff --git a/.gitignore b/.gitignore index 33202558..b8f28756 100644 --- a/.gitignore +++ b/.gitignore @@ -61,6 +61,12 @@ tests/fixtures/scoring/ # scripts/scoring/train.py against a fresh trace extract. compute/scorer/matrix.json +# Per-tenant matrices pulled from FOS on every backend startup +# (see backend/main.py:_ensure_scoring_matrix). matrix.default.json +# stays tracked — that's the in-repo fallback the scoring endpoint +# uses when neither the shared matrix.json nor a tenant matrix exists. +compute/scorer/matrix_*.json + # Rust build artifacts. compute/scorer/target/ compute/scorer/bin/ @@ -76,6 +82,11 @@ compute/scorer/pkg/ # split_per_page.py) live here for now; treat the whole tree as throwaway. /scratch/ +# Performance-audit campaign artifacts: HAR captures, per-sample telemetry, +# aggregated p50/p95/p99 summaries, per-page reports + improvement plans. +# Throwaway — regenerable by re-running scratch/perf_audit.mjs. +/performance-report/ + # Local-only VS Code config (file-watcher / Pylance excludes for the # regenerating .next + cache trees). Personal to each contributor's editor # setup — not promoted to the repo by default. diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5a150d76..1da1f29b 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,26 +1,44 @@ repos: + # Pinned ruff version must stay reasonably close to the version in + # pyproject.toml (currently ruff>=0.11) — drift triggers pre-existing + # rule changes (UP038, E731 strictness) that the project's actual ruff + # has already retired. Bump together when bumping either side. - repo: https://github.com/astral-sh/ruff-pre-commit - rev: v0.11.0 + rev: v0.15.15 hooks: - id: ruff args: [--fix] - id: ruff-format - - repo: https://github.com/pre-commit/mirrors-mypy - rev: v1.15.0 + # mypy runs via the project's own uv env (matches what CI runs) and is + # piped through mypy-baseline so pre-existing errors stay accepted and + # only NET-NEW errors fail the commit. The baseline lives in + # mypy-baseline.txt at the repo root; refresh it after a burndown PR with + # uv run mypy backend/ 2>&1 | uv run mypy-baseline sync + # and commit the updated file. Burndown plan in + # pending-docs/session_2026-06-10_otel_dump_and_log_extents.md. + - repo: local hooks: - id: mypy - additional_dependencies: - - types-boto3 - - types-pytz - - fastapi - - pydantic + name: mypy (full backend/, filtered through mypy-baseline) + language: system + # Always check the whole backend/ tree, not just changed files — + # per-file mypy only visits a partial import graph, which makes + # mypy-baseline report unrelated baseline entries as "fixed" and + # exit non-zero. Cost: ~10s per commit; benefit: matches CI exactly. + entry: bash -c 'uv run mypy backend/ 2>&1 | uv run mypy-baseline filter' + files: '^backend/.*\.py$' + pass_filenames: false - repo: https://github.com/pre-commit/pre-commit-hooks rev: v5.0.0 hooks: - id: trailing-whitespace - id: end-of-file-fixer + # openapi-typescript emits openapi.json without a trailing newline; + # end-of-file-fixer adds one, then the next regen-openapi run + # strips it. Excluding the generated artifact breaks the cycle. + exclude: '^frontend/openapi\.json$' - id: check-yaml - id: check-json - id: check-merge-conflict @@ -60,3 +78,16 @@ repos: language: system pass_filenames: false entry: bash -c 'cd frontend && npx tsc --noEmit' + + # v2.0 cleanup (Phase 0.12): pre-push gate that the + # @pytest.mark.security_regression count hasn't dropped below + # the Phase 0 floor (24). Catches a refactor that silently + # removes coverage of a verified security fix before push, + # not in CI. `stages: [pre-push]` keeps it off the per-commit + # hot path (the gate takes ~2s to collect 3k+ tests). + - id: security-regression-count + name: Assert security_regression test count >= floor + stages: [pre-push] + language: system + pass_filenames: false + entry: bash scripts/check_security_regression_count.sh diff --git a/AGENTS.md b/AGENTS.md index 7bf0fb01..23247a63 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -59,13 +59,39 @@ User-facing pitch + features list lives in [README.md](README.md). This file doc The DuckDB `logs` view stitches the Iceberg table and the local Parquet buffer so queries always see all data without callers caring which layer holds which row. +### Package layout (post v2.0 carve-ups) + +Several historical monoliths were split into cohesive packages with thin re-export shims at the old paths so existing imports keep working: + +| Old path | New package | Shim status | +|---|---|---| +| `backend/core/iceberg.py` | [`backend/core/iceberg/`](backend/core/iceberg/) (`_core.py` + `fs.py`) | package `__init__.py` re-exports the historical public surface; the monkeypatched s3fs methods are now `FosS3FileSystem` / `CachedS3FileSystem` subclasses in `fs.py` | +| `backend/core/metadata_db.py` | [`backend/core/metadata/`](backend/core/metadata/) (`base`, `alerts`, `views`, `ingest_log`, `cron_log`, `asn_cache`, `usage_log`, `reconciliation`, `state`) | thin shim at [`backend/core/metadata_db.py`](backend/core/metadata_db.py) re-exports the full surface plus a `_ShimModule` proxy so `monkeypatch.setattr(metadata_db, "_DATA_DIR", ...)` still flips the live binding inside `metadata.base` | +| `backend/core/share_db.py` | [`backend/core/share_db/`](backend/core/share_db/) (`connection`, `schema`, `invites`, `sessions`, `audit`, `passcode`, `tos`, `settings`, `validation`) | package `__init__.py` re-exports the historical public surface; passcode hashing is argon2id (legacy scrypt verify branch stays for transparent rehash-on-login) | +| `backend/utils/tunnel.py` | [`backend/utils/tunnel/`](backend/utils/tunnel/) (`manager`, `session`, `rate_limiter`, `state`, `fingerprint`) | package `__init__.py` re-exports `get_tunnel_manager`, `AnalystSession`, etc. SSH-to-localhost.run code path (`_TUNNEL_URL_RE`, sleep listener, reconnect logic, `use_tunnel=True` branches) was deleted in v2.0 — only direct-mode (HTTPS public_endpoint) is supported. The `use_tunnel=True` kwarg still exists as a back-compat keyword that raises a clear error | +| `backend/scheduler.py` | [`backend/cron/`](backend/cron/) (`scheduler.py`, `decorators.py`, `jobs/{sync,commit,compaction,optimize,expire,metadata}.py`) | thin shim at [`backend/scheduler.py`](backend/scheduler.py) re-exports `get_scheduler`, `Scheduler`, `cron_task`, every `_run_*` job body, and the watchdog constants | +| `backend/routers/session_scoring.py` (was 2442) | [`backend/routers/session_scoring.py`](backend/routers/session_scoring.py) (1327) + [`backend/routers/session_scoring_admin.py`](backend/routers/session_scoring_admin.py) (1193) | sidecar holds retrain + admin-config endpoints (enforce-threshold, exclude-regex, enforce-status-code, matrix-versions, rotate-key, audit, threshold GET/PUT, evaluation/per-reason, dashboard composite); registers on the shared router via import-for-side-effects at the bottom of `session_scoring.py` | +| `backend/routers/admin.py` (was 1650) | [`backend/routers/admin/`](backend/routers/admin/) (`pop_locations`, `ingest`, `trees`, `downloads`, `sync_status`, `compaction`, `health`, `log_accounting`, `iceberg`, `bot_sources`, `_helpers`, `_dir_size`, `_router`) + [`backend/routers/admin_usage.py`](backend/routers/admin_usage.py) (sidecar) | v2.0 carve: 14 sub-modules each < 350 lines. `admin/__init__.py` re-exports the historical public surface (`router`, `compute_sync_status_cached`, `compute_log_accounting`, `LOG_ACCOUNTING_*`, `SustainedLossAlert`, `_QueueFile`, `_stream_from_worker`, `_fetch_file_to_zip`, `_resolve_source`, `_get_dir_size`, `ClientDisconnected`). `admin_usage.py` still attaches its endpoints to the shared `router` via `importlib.import_module` from the package init | +| `backend/core/rollups.py` (was 2045) | [`backend/core/rollups/`](backend/core/rollups/) (`_common`, `time_series`, `sessions`, `hour_bundles`, `day_bundles`, `recompute`, `wellknown_bots`) | v2.0 carve: 8 sub-modules, largest 352 lines. `rollups/__init__.py` re-exports 41 symbols so `from backend.core.rollups import X` (or `from backend.core import rollups; rollups.X`) keeps working unchanged. Shared bits — constants, ident validators, path helpers, query builders, `_VIRTUAL_FIELD_BACKING` — live in `_common.py` | +| `backend/core/log_fields.py` (was 1904) | [`backend/core/log_fields.py`](backend/core/log_fields.py) (659) + [`backend/core/_log_fields_data.py`](backend/core/_log_fields_data.py) (1277) | data-only carve: `LOG_FIELD_CATALOG`, `GROUP_INFO`, `GROUP_DEPENDENCIES`, `PRESETS`, `INSIGHT_DEFINITIONS` moved to the sidecar and re-imported. Zero behaviour change | +| `backend/core/duckdb.py` (was 2110) | [`backend/core/duckdb.py`](backend/core/duckdb.py) (1099) + [`backend/core/_duckdb_status.py`](backend/core/_duckdb_status.py) (1119) | `get_sync_status`, `refresh_config_status`, `update_top_values`, `get_ingested_files`, `delete_ingested_files`, `get_schema`, `_clear_schema_cache`, `get_asn_names` / `format_asn_label` / `enrich_asn_labels`, `update_cron_duration`, `log_usage_calls`, `backfill_fastly_edge_writes`, `reconcile_fastly_stats`, `purge_usage_log` move to the sidecar. Re-exported back into `backend.core.duckdb`. Sidecar late-binds shared helpers from the main module via `_db_main` to dodge the circular import | + +Other new modules introduced by the cleanup: + +- [`backend/repositories/_sql/`](backend/repositories/_sql/) — named, parameterized SQL templates extracted out of inline repo strings (one file per repo concern: `dashboard`, `security`, `network`, `origin`, etc.). Repository functions keep their names and signatures; they call into the templates instead of carrying SQL inline. +- [`backend/core/field_registry.py`](backend/core/field_registry.py) — Phase 7 (shipped, including step 13) typed registry that owns per-field declarations (code, display name, type, valid aggregations, valid filter ops, derivations, security-regex hooks). All readers migrated (dashboard CTE generator, rollup spec builder, top_n logic, SQL validator, scoring matrix labels, plus 8 step-13 callers: `services/core.py`, `provision/orchestrator.py`, `provision/fastly_api.py`, `provision/cli.py`, `iceberg/_core.py`, `ingest.py`, `models/custom_fields.py`, `state_sync.py`). Same-identity re-exports of every helper + constant preserve `from log_fields import X` callers. +- [`backend/core/request_context.py`](backend/core/request_context.py) — Phase 2 single FastAPI dependency that bundles `service_id`, `source`, `con`, `telemetry`, `analyst_session`, `cached_temps`. Replaces the v1 `AnalyticsDeps` bundle (deleted at the v2.0 cut — Phase 8.1/8.2) and folds `require_service_access` into context construction (there is no path that builds a context without enforcing tenancy). 23 analytics endpoints across 8 routers (dashboard / query / sessions / security / network / origin / performance / insights) now take `ctx: RequestContext = Depends(build_request_context)` directly. +- [`backend/core/request_telemetry.py`](backend/core/request_telemetry.py) — Phase 1 thin wrapper around the OTel tracer that owns section spans, query attribution, call log, cache state, and the `app.thread_wait_ms` custom metric instrumented at `_Pool.acquire`. Lives on `RequestContext`. +- [`backend/core/settings.py`](backend/core/settings.py) — Phase 3.5 `Settings(BaseSettings)` class (pydantic-settings) that owns every env var. Required-in-prod settings are pydantic validators. +- [`backend/core/iceberg/_core.py`](backend/core/iceberg/_core.py) `execute_with_stale_view_retry(con, src, fn)` — self-heal wrapper for code paths that open raw DuckDB connections instead of going through `QueryRunner`. On stale-buffer "No files found" errors, busts `_view_cache` via `clear_source_caches(keep_snapshot_cache=True)` + `update_iceberg_view(force=True)` then retries `fn` once. Used by `rdns_cache` discovery, `rollups` DESCRIBE sites, and `/api/query`. Pre-fix prod incidents: ~8h of 100%-failing rdns runs + analyst-visible query errors on the same buffer-deletion race. + ### Personas (where the two onboarding paths live) The README explains the two collaboration modes for end users. Implementation pointers: - **Admin** (`access_level: "read_write"`) — full ingest/management surface. Config: `configs/{logging_service_id}.json`. - **Analyst Path A — independent instance** (durable, JSON-config join). Read-only FOS credentials, runs its own copy of the app. Components: `POST /api/services/{service_id}/generate-viewer-key` → [`api_invite_analyst()`](backend/routers/services/core.py), `GET /api/provision/join` (SSE), [`InviteAnalystDialog`](frontend/components/InviteAnalystDialog/), ProvisionWizard "join" mode. -- **Analyst Path B — live shared instance** (SSH-tunnelled). No FOS credentials, uses admin's running process. See [Live Dashboard Sharing](#live-dashboard-sharing) below for components. +- **Analyst Path B — live shared instance** (direct-mode against an HTTPS public_endpoint; the SSH-tunnel-to-localhost.run option was deleted in v2.0). No FOS credentials, uses admin's running process. See [Live Dashboard Sharing](#live-dashboard-sharing) below for components. **Both paths must keep working.** Don't remove either. Don't introduce a "unified" replacement without keeping the JSON-config flow intact — it's the only option when the admin's instance can't stay running. @@ -154,8 +180,8 @@ lf = cfg.get("log_fields") or {"schema_version": 2, "custom_fields": []} Brief summaries; click through to source for details. -### Scheduler ([backend/scheduler.py](backend/scheduler.py)) -Single `BackgroundScheduler`. `_sync_jobs()` adds/removes per-service jobs on `reload()`. Per-run progress events tracked in [backend/cron_progress.py](backend/cron_progress.py) and streamed via SSE. +### Scheduler ([backend/cron/](backend/cron/)) +Single `BackgroundScheduler` owned by [backend/cron/scheduler.py](backend/cron/scheduler.py). `_sync_jobs()` adds/removes per-service jobs on `reload()`. The `@cron_task` decorator (telemetry context + usage-log flush + watchdog hard-cap) lives in [backend/cron/decorators.py](backend/cron/decorators.py). Per-job bodies live under [backend/cron/jobs/](backend/cron/jobs/) (`sync`, `commit`, `compaction`, `optimize`, `expire`, `metadata`). Per-run progress events tracked in [backend/cron_progress.py](backend/cron_progress.py) and streamed via SSE. [backend/scheduler.py](backend/scheduler.py) is a thin compat shim that re-exports the same public symbols. ### NGWAF Bot Detection ([backend/utils/ngwaf.py](backend/utils/ngwaf.py), [backend/utils/ngwaf_bot_cache.py](backend/utils/ngwaf_bot_cache.py)) Syncs VERIFIED-BOT requests from `GET https://api.fastly.com/ngwaf/v1/workspaces/{id}/requests`. JSON:API pagination via `meta.next_cursor`. Shared SQLite cache at `data/ngwaf/ngwaf_bot_cache.db`. Enriches log rows with `waf_req_id` + `waf_sig LIKE '%VERIFIED-BOT%'`. @@ -168,7 +194,7 @@ Both stored in per-service `metadata.db` (SQLite). Alerts are threshold-based wi ### State Sync ([backend/state_sync.py](backend/state_sync.py)) `export_admin_state` writes `audit_logs` + `views` from per-service SQLite, plus `log_format_history` + `custom_fields` from the config JSON, to `{prefix}/iceberg/meta/admin_state.json`. **Alerts are not synced** — each instance maintains its own. Only `read_write` services export. -### FOS Usage Logging ([backend/utils/usage_logger.py](backend/utils/usage_logger.py), [backend/core/metadata_db.py](backend/core/metadata_db.py)) +### FOS Usage Logging ([backend/utils/usage_logger.py](backend/utils/usage_logger.py), [backend/core/metadata/usage_log.py](backend/core/metadata/usage_log.py)) Every FOS Class A/B op and CDN download recorded to per-service `usage_log` SQLite for cost analysis. - Global toggle: `data/system/usage_logging.json` - Process-context tagging via `set_process_context()` in [backend/utils/telemetry.py](backend/utils/telemetry.py) — tags entries with `cron:sync:svc1` or `api:GET /api/...` @@ -176,42 +202,66 @@ Every FOS Class A/B op and CDN download recorded to per-service `usage_log` SQLi - Costs computed at query time from rate config — changing rates recomputes history. - Admin endpoints: `GET/PATCH /api/admin/usage-logging`, `GET/DELETE /api/admin/usage-log`, `GET /api/admin/usage-log/export`. Frontend: `/admin/usage-log`. -### Log-Line Accounting ([backend/routers/admin.py](backend/routers/admin.py) `api_log_accounting`) +### Log-Line Accounting ([backend/routers/admin/log_accounting.py](backend/routers/admin/log_accounting.py) `api_log_accounting`) Per-bucket reconciliation between Fastly's `/stats/service/{id}` log-emission counter and our `sum(row_count) FROM ingested_files`. - Field probe order: `log → log_records → log_entries → logging_requests`; first non-zero wins. All-zero logs a warning. - In-flight clamp: current bucket is in totals but excluded from sustained-loss scan (Fastly Stats lags ingest). - Sustained-loss alert: ≥2 consecutive completed buckets with `gap_pct ≥ 0.05`. - Frontend cadence: `staleTime 30s`, `refetchInterval 60s` → ≤1 Fastly Stats call/min per open admin tab. -### Iceberg Pointer + Summary Hash-Throttle ([backend/core/iceberg.py](backend/core/iceberg.py)) +### Iceberg Pointer + Summary Hash-Throttle ([backend/core/iceberg/_core.py](backend/core/iceberg/_core.py)) Every commit writes `metadata_location.txt` (unavoidable) and `table_summary.json` (skippable). The latter is content-hashed against `_table_summary_hash_cache`; identical payloads skip the PUT. Saves one FOS PUT per no-op commit in steady state. Cache is module-scope, process-lifetime. ### DuckDB Connection Pool ([backend/core/duckdb_pool.py](backend/core/duckdb_pool.py)) Per-service LIFO pool replaces per-request `duckdb.connect()` + S3 / iceberg setup + view rebind (~50ms steady-state). Pool size is `DUCKDB_POOL_MAX_SIZE` (default 8). All pool connections open with `read_only=False` — `get_connection` forces this so cron writers and pool readers don't trip DuckDB's "different configuration" error on the same file. Optional per-connection tuning: `DUCKDB_POOL_CONN_MEMORY_LIMIT` (e.g. `256MB`) caps RSS growth under concurrent large scans; `DUCKDB_POOL_CONN_THREADS` reduces context-switching when `pool_size × per_conn_threads` exceeds physical cores. View-binding happens outside the pool lock to avoid deadlocking the FastAPI thread pool when an Iceberg snapshot reload blocks. -### Hourly Top-N Rollups ([backend/core/rollups.py](backend/core/rollups.py), [scripts/backfill_rollups.py](scripts/backfill_rollups.py)) -Precomputes per-hour Top-N aggregates for the dashboard's most-asked fields (ip, country, url, custom fields) and writes them under `/data/rollups/`. Closed hours read from the rollup; the current ("live") hour merges the rollup with a fast scan of the buffer. Plus a per-minute time-series bundle (`rollups/timeseries/...`) used by the dashboard chart to skip the wide Iceberg scan. Skipped buckets fall back to the raw scan path. Generated by `local_compact_{id}` after each compaction pass; the global `optimize_{id}` job rebuilds the day's worth on each run. +**Pool wait observability** — `_Pool.acquire` records every checkout's wall-clock wait time to (a) the OTel `app.thread_wait_ms` histogram tagged `{outcome: reused | created | timeout, waited: true | false, service}` for off-box analysis via `docker logs app-backend-1 | grep app.thread_wait_ms`, AND (b) a bounded in-process ring buffer (~1024 samples per service) consumed by `Pool.stats().wait` (p50/p95/p99/max/mean). `GET /api/admin/health-snapshot` exposes the per-service stats; the `SystemHealthCard` on `/admin` renders top-level Pool wait p95 / Pool in-use / idle cards plus an expandable per-service table. ADR-03 escalation rule: p95 > 50ms ⇒ consider separate-process cron isolation; > 200ms flags red. Both paths are non-blocking (try/except around the recorder) so instrumentation can never break a checkout. + +### Hourly Top-N Rollups ([backend/core/rollups/](backend/core/rollups/), [scripts/backfill_rollups.py](scripts/backfill_rollups.py)) +Precomputes per-hour Top-N aggregates for the dashboard's most-asked fields (ip, country, url, custom fields) and writes them under `/rollups/`. Closed hours read from the rollup; the current ("live") hour merges the rollup with a fast scan of the buffer. Plus a per-minute time-series bundle (`rollups/hour_bundled/hour=H/time_series.parquet`) used by the dashboard chart to skip the wide Iceberg scan. Skipped buckets fall back to the raw scan path. Generated by `local_compact_{id}` after each compaction pass; the global `optimize_{id}` job rebuilds the day's worth on each run. + +**Bundle tiers** (cheapest first wins in the reader): +- `rollups/day_bundled/day=D/all_fields.parquet` — one parquet per day, all fields. Reader prefers this for fully-in-window closed days. +- `rollups/hour_bundled/hour=H/all_fields.parquet` — one parquet per hour, all fields. Reader uses for partial-day boundary hours + any day without a day-bundle. +- `rollups/hour/field=F/hour=H/*.parquet` — per-(field, hour). Original source of truth; the bundle writers read from here. +- `rollups/day/field=F/day=D/*.parquet` — per-(field, day). Source for the day-bundler. + +**Virtual fields** (`waf_sig_ind`, `edge_score_reason_ind` — see `_VIRTUAL_FIELD_BACKING` in `rollups/_common.py`) are CSV-unnested at WRITE time so the dashboard reader serves them through the standard rollup path instead of paying a 30-day unnest-during-query each request. Wired in `_run_per_field_copy` (rollups/recompute.py) via `_build_virtual_field_copy_query` (rollups/_common.py). Adding a new virtual field requires (a) appending to `_VIRTUAL_FIELD_BACKING`, (b) ensuring its `backing` column is on the schema, (c) a one-shot rebundle migration so existing hour/day bundles pick it up (see next point). + +**Stale-bundle hazard.** `bundle_hours` / `bundle_days` use mtime to skip up-to-date bundles, and the cron only re-bundles HOURS THAT JUST RECEIVED DATA. Closed historical hours never get re-touched. If you add a new field to the rollup writer (real or virtual), the per-(field, hour) parquets land but the bundled `all_fields.parquet` for closed hours stays without them — the dashboard's bundled-rollup reader returns 0 rows for the new field and the runtime fallback fires. Fix: add a data migration that deletes the closed bundles and runs `backfill_*_bundles` (canonical pattern: `_rollups_virtual_field_rebundle` in [backend/core/data_migrations.py](backend/core/data_migrations.py)). + +**Live-hour batch must filter virtual fields out** before `execute_top_n_batch` (in `_base.py`'s `execute_top_n_rollups`): the SQL projects `field_name AS value` and virtual names aren't real columns on the live temp table. Passing them through BinderException's the whole UNION ALL and silently drops the live-hour merge for real fields too. See `live_fields = [f for f in fields if f in actual_cols]` at the merge site. + +**`live_temp` narrow projection** ([backend/repositories/dashboard.py](backend/repositories/dashboard.py)): only `conn_requests` + `timestamp` on the `chart_metric == "requests"` path. The runtime CSV-unnest fallback for virtual fields (`_exploded_top_n`) queries the BASE table via stashed `orig_table_name` / `orig_where_clause` / `orig_params`, not the temp, so the temp doesn't need to carry `waf_sig` / `edge_score_reason`. Map_data is derived from `all_top_res` instead of a separate query on the temp, so `country` isn't needed either. If you add a new consumer that reads from the temp, add its columns to `narrow_col_set` AND verify the chart_metric branches. + +**`get_top_bots` rollup-served UAs** ([backend/repositories/security.py](backend/repositories/security.py)): on the unfiltered path (`not filters`), top UAs come from `execute_top_n_rollups(["ua"], ..., limit=50000)` instead of scanning the iceberg view for the `ua` column. The NGWAF JOIN still needs the raw temp because `waf_req_id` is high-cardinality and not rollup-served — but the temp is single-column (`waf_req_id` only) when the rollup path serves UAs. Filtered requests fall back to the original combined `(ua, waf_req_id)` temp. ### Response Telemetry Middleware ([backend/utils/telemetry_response_middleware.py](backend/utils/telemetry_response_middleware.py)) Backstop for endpoints that return a plain `dict` instead of going through `BaseResponse.with_telemetry`. Inspects JSON object responses, injects `_debug_queries` / `_debug_calls` / `_is_cached` from the contextvar collectors if missing. **Must be added INNER to `CompressMiddleware`** (i.e. `add_middleware(TelemetryResponseBodyMiddleware)` BEFORE `add_middleware(CompressMiddleware)`) so it sees the raw JSON, not br/zstd/gzip-encoded bytes. Skips streaming responses, non-dict bodies, and already-instrumented responses. Gated on `DEBUG_RESPONSES`; failure modes are silent + non-blocking. +### Live Query Monitor ([backend/core/query_registry.py](backend/core/query_registry.py), [backend/routers/admin_queries.py](backend/routers/admin_queries.py), [frontend/app/admin/queries/](frontend/app/admin/queries/)) +Real-time view of every executing DuckDB + SQLite query — attribution (analyst / admin / cron / system), caller `file:line`, pool slot, duration ticking up live, kind-aware Kill button that calls `con.interrupt()`. Page at `/admin/queries`, admin-only via `RemoteAccessMiddleware`. Polling at 300 ms; the Active panel promotes "completed in the last 10 s" rows as faded entries with an outcome badge so typical-traffic (p50 ≈ 0.2 ms, max ≈ 29 ms) queries are visible. Notable Slow Queries panel filters the completed-history ring buffer by threshold (100ms / 500ms / 1s / 2s / 5s), sorted slowest first. + +Instrumentation lives at two seams: SQLite `InstrumentedCursor` ([backend/utils/sqlite_profiler.py](backend/utils/sqlite_profiler.py)) registers/deregisters around `execute*`; DuckDB `InstrumentedDuckDBConnection` + `_InstrumentedResult` ([backend/core/query_instrumentation.py](backend/core/query_instrumentation.py)) wraps the connection returned from `checkout_connection` so deregistration happens at terminal-fetch time (fetchdf, arrow, etc.) rather than at `execute()` — DuckDB's execute returns in ~ms while fetch can run for seconds. Per-query overhead measured ~21 µs (~0.3% of dashboard bundle wall time). Cancel path is safe under pool reuse: a stamped `_conn_to_query[id(con)]` is verified under lock before `interrupt()` so a stale UI click never cancels a different query that's checked out the same physical connection later. + +Audit log fires on every successful cancel (`audit_log` in [backend/utils/structlog_config.py](backend/utils/structlog_config.py)) with the actor + full target attribution. OTel histograms: `app.active_queries.count`, `app.query_duration_ms`, `app.queries_cancelled_total`. Kill switches: `QUERY_MONITOR_ENABLED=0` hides the endpoints (404), `QUERY_REGISTRY_DISABLED=1` bypasses the hot path entirely for zero overhead. Design + post-spec polish history in [pending-docs/design_live_query_monitoring.md](pending-docs/design_live_query_monitoring.md). + ### CDN-Fronted Log Delivery FOS reads are fronted by a Fastly CDN VCL service (`cdn_service_id`, `cdn_url`, `cdn_secret`). The CDN validates a shared-secret query param to gate access; rate-limited to blunt brute-force. Separate from the logging service ID. ### Live Dashboard Sharing -Components for the live-shared-instance remote-analyst feature (Path B). Three sharing modes are exposed to the admin: +Components for the live-shared-instance remote-analyst feature (Path B). Two direct-mode sharing modes are exposed to the admin (the SSH-reverse-tunnel via localhost.run was deleted in v2.0): -1. **SSH reverse tunnel** via localhost.run (default, easiest) -2. **Admin-provided hostname** (e.g. `https://logs.example.com`) — no third-party relay -3. **Admin-provided IP** (e.g. `https://203.0.113.42:8443`) — no relay, no DNS +1. **Admin-provided hostname** (e.g. `https://logs.example.com`) +2. **Admin-provided IP** (e.g. `https://203.0.113.42:8443`) -Modes 2 and 3 share a single backend code path: `ShareStartPayload.use_tunnel=False` + `public_endpoint=`. The mode selector in the UI is presentational — the backend only cares whether `use_tunnel` is set and (when false) that `public_endpoint` starts with `https://` (cookies need `secure=true`). +Both share a single backend code path: `ShareStartPayload.use_tunnel=False` + `public_endpoint=`. The mode selector in the UI is presentational — the backend only cares that `public_endpoint` starts with `https://` (cookies need `secure=true`). `use_tunnel=True` still exists as a back-compat keyword and now raises a clear error. Components: -- [backend/utils/tunnel.py](backend/utils/tunnel.py) — `TunnelManager` owns `ssh -R 80:localhost:8000 nokey@localhost.run` in tunnel mode, parses assigned `https://*.lhrun.dev` hostname, tracks `TunnelState`. In direct mode (hostname / IP), no subprocess is spawned — the admin-supplied `public_endpoint` is stored and `public_url()` returns it verbatim. Process singleton via `get_tunnel_manager()`; `reset_for_tests()` for pytest. +- [backend/utils/tunnel/](backend/utils/tunnel/) — package split: `manager.py` owns the `TunnelManager` singleton (direct-mode lifecycle, sever-all panic), `session.py` holds `AnalystSession`, `rate_limiter.py` is the sliding-window `_LoginRateLimiter`, `state.py` persists `tunnel_state.json`, `fingerprint.py` computes the session fingerprint hash. Process singleton via `get_tunnel_manager()`; `reset_for_tests()` for pytest. - [backend/utils/remote_access.py](backend/utils/remote_access.py) — `RemoteAccessMiddleware` does DNS-rebinding gate (Host/Origin allow-lists, including `testclient`/`testserver` for pytest), blocks admin paths on remote requests, applies response hardening (CSP, X-Frame-Options DENY, no-store, no-referrer). `_StaticAssetLimiter` rate-limits static assets to blunt scrapes. -- [backend/core/share_db.py](backend/core/share_db.py) — singleton SQLite at `data/system/remote_share.db`: `remote_invites`, `invite_services`, `remote_sessions`, `remote_share_audit_logs`, `share_settings`, `remote_invite_claim_tokens`, `share_tos_versions`. WAL mode, numbered migrations, bcrypt passcodes, per-IP/per-email lockout. +- [backend/core/share_db/](backend/core/share_db/) — package split: `connection.py` (pool + corruption self-heal with quarantine), `schema.py` (own MIGRATIONS dict + `apply_pending` + `PRAGMA user_version`), `invites.py`, `sessions.py`, `audit.py`, `passcode.py` (argon2id current default; scrypt verify branch stays for transparent rehash-on-login upgrade), `tos.py`, `settings.py`, `validation.py`. Singleton SQLite at `data/system/remote_share.db`: `remote_invites`, `invite_services`, `remote_sessions`, `remote_share_audit_logs`, `share_settings`, `remote_invite_claim_tokens`, `share_tos_versions`. WAL mode, per-IP/per-email lockout. - [backend/routers/share_auth.py](backend/routers/share_auth.py) (`/api/share/*`) — analyst-facing: `login`, `logout`, `acknowledge`, `heartbeat`, `claim/{token}`. Tagged so middleware lets them through the tunnel. - [backend/routers/share_admin.py](backend/routers/share_admin.py) (`/api/admin/share/*`, **blocked over tunnel**) — admin-facing: tunnel lifecycle, invite CRUD, session evict, panic/sever-all, backup export/import, GDPR erase, settings. - Frontend: [ShareDashboardDialog](frontend/components/ShareDashboardDialog/), [/share-login](frontend/app/share-login/) (TOS-gated), [useAnalystHeartbeat](frontend/hooks/useAnalystHeartbeat.ts), [useShareStatusBanner](frontend/hooks/useShareStatusBanner.tsx). Watermark mounts in `AppLayout` when `bootstrap.settings.is_remote_analyst === true`. @@ -263,6 +313,20 @@ A global middleware in [frontend/lib/api.ts](frontend/lib/api.ts) checks `respon **Streaming/binary endpoints** (SSE, blobs) use raw `fetch()` — leave a comment so future readers don't "fix" it. +### Server-side bootstrap pre-fetch ([frontend/lib/ssr/bootstrap.ts](frontend/lib/ssr/bootstrap.ts), [frontend/app/layout.tsx](frontend/app/layout.tsx)) + +The root layout SSR-fetches `/api/bootstrap`, dehydrates it into the React Query cache (via a new `HydrationBoundary` in `QueryProvider`), and ships the JSON inline in the first HTML paint. `useBootstrap` and every hook that reads `bootstrap.*` via `queryClient.getQueryData(['bootstrap'])` find the data already cached on first render — no client-side bootstrap RTT, no `'No service selected'` flash, share banner in the initial paint. + +Adding a new SSR pre-fetch (e.g., for a per-page endpoint): + +1. **Use `node:http.request`, NOT `fetch()`.** Node's `fetch()` always overrides the `Host` header from the URL. The backend's `_remote_host_allowed` gate rejects remote-classified requests whose Host isn't the public endpoint — so without preserved Host, the SSR fetch returns 400 host_not_allowed and silently falls through to the client. +2. **Trust topology is `X-Remote-Analyst: 1`, not `X-Proxied-By-Caddy`.** The SSR runtime hits the backend over loopback. `is_request_remote` ([backend/utils/remote_access.py](backend/utils/remote_access.py)) classifies based on `request.client.host` first, so a forwarded Caddy marker is IGNORED. `X-Remote-Analyst: 1` is the loopback-honored primitive (gated on `tunnel_manager.is_sharing_active()`). Forward it ONLY when the inbound request carries `X-Proxied-By-Caddy` — otherwise the admin SSH-tunnel path is mis-classified as analyst and 400'd. (See history: the 2026-06-11 SSR-leak incident reverted in `f3d8dd7` / `546c279` was the previous-attempt version that forwarded `X-Proxied-By-Caddy` directly. Backend ignored it, returned admin payload, dehydration leaked admin fields into public HTML.) +3. **Always wrap in try/catch + bounded timeout, return `null` on any failure.** SSR errors must NEVER propagate into a broken page — the layout falls back to client fetch when the helper returns null. 5s is generous for prod cron contention; never block SSR longer. +4. **`force-dynamic` is REQUIRED** in any layout/page that does a per-request SSR fetch via `cookies()` / `headers()` from an imported helper. Next.js's static-analysis pass only detects direct `cookies()` calls in the component file itself — calls from an imported module won't flip the route to dynamic. Without `export const dynamic = "force-dynamic"` the layout gets SSG'd at build time (when the backend isn't reachable) and the dehydrated state is permanently empty. +5. **Adversarial test required:** before deploying, hit the prod public URL anonymous AND the admin tunnel and verify the dehydrated state shape. Anonymous public must contain only the `needs_login` stub (NO `sharing_active`, NO `ngwaf_workspace_id`, NO `sync_status`). Admin must contain the full payload. + +The `serviceStore` Zustand slice hydrates from the SSR-cached bootstrap in `useBootstrap`'s post-mount `useEffect` — for the one-render window before that effect fires, use [`useEffectiveServiceId`](frontend/hooks/useIsDataReady.ts) which falls back to `bootstrap.active_service_id` from the React Query cache. Direct reads of `useServiceStore(s => s.activeServiceId)` flash "No service selected" on first paint. + ### Canonical patterns (May 2026 DRY refactor — use these in new code) 1. **`response_model=` on every router handler.** Without it the OpenAPI emits `Record`. Routes using `Depends(get_source)` should also lift `service_id: str` into the signature so it appears as a path parameter. @@ -270,11 +334,12 @@ A global middleware in [frontend/lib/api.ts](frontend/lib/api.ts) checks `respon 3. **`ReportLayout`** for analytics pages — bundles `usePageContext + useReportConfig + useFilterPayload + useUrlFilterSync + useServiceQuery + ChartIntervalButtons + ReportShell`. Fall back to `ReportShell` only for multi-query or non-standard chrome pages. 4. **`HelpDialog`** from [components/ui/help-dialog.tsx](frontend/components/ui/help-dialog.tsx) — don't compose `Dialog + DialogHeader + DialogTitle` by hand for help content. 5. **`useBaseMap`** for any MapLibre setup. Don't duplicate the world-layer + theming inline. -6. **`metadata_db.record_audit(service_id, event_type=..., details=...)`** — direct. The `duckdb.log_audit_event` shim and `repositories/audit.py` pass-through were removed. +6. **`metadata.record_audit(service_id, event_type=..., details=...)`** — direct (or via the `metadata_db` shim; both resolve to the same `metadata.audit` impl). The `duckdb.log_audit_event` shim and `repositories/audit.py` pass-through were removed. 7. **`date_utils.parse_iso_utc` / `iso_z` / `iso_z_now`** — don't hand-roll `datetime.fromisoformat(s.replace("Z", "+00:00"))`. -8. **`@cron_task` decorator** in [backend/scheduler.py](backend/scheduler.py) — handles `start_call_tracking`, `set_process_context`, `flush_usage_log` finally-block. +8. **`@cron_task` decorator** in [backend/cron/decorators.py](backend/cron/decorators.py) — handles `start_call_tracking`, `set_process_context`, `flush_usage_log` finally-block, watchdog hard-cap. Re-exported from [backend/scheduler.py](backend/scheduler.py) for compat. 9. **`empty_schema_response(runner)`** in [_base.py](backend/repositories/_base.py) — return this when a repo function hits a service with no logs. 10. **`origin_latency_us_expr(actual_cols)`** in `_base.py` — don't hand-roll the `COALESCE("ottfb", "ttfb" * 1000000.0)` fragment. +11. **`useEffectiveServiceId`** in [hooks/useIsDataReady.ts](frontend/hooks/useIsDataReady.ts) — read this instead of `useServiceStore(s => s.activeServiceId)` whenever the answer matters on FIRST PAINT (gating views, building cache keys, "no service selected" branches). It falls back to `bootstrap.active_service_id` from the SSR-hydrated React Query cache so the page doesn't flash empty before the persisted Zustand store catches up. ### Next.js navigation + loading conventions (READ BEFORE TOUCHING FRONTEND) @@ -375,7 +440,7 @@ re-renders triggered by store subscriptions. The trace shows which. - `backend/utils/audit_helpers.py` (referenced the long-removed DuckDB `_ingested_files` table) - `backend/repositories/audit.py` (was a 27-line pass-through) - `scripts/validate_logs.py` / `.sh` (depended on removed bits) -- `backend/core/duckdb.log_audit_event` shim (call `metadata_db.record_audit` directly; test patches must target `backend.core.metadata_db.record_audit`) +- `backend/core/duckdb.log_audit_event` shim (call `metadata.record_audit` directly; test patches must target `backend.core.metadata.audit.record_audit` — or `backend.core.metadata_db.record_audit` via the shim, which the `_ShimModule` proxy mirrors onto the live binding) - `QueryRunner.safe_select` / `safe_select_list` (use `actual_cols` directly) ## Testing @@ -457,10 +522,10 @@ A job fired after the config was deleted. The next `reload()` evicts the stale j The RHS of `~` or `!~` must be a literal. No variables, no concatenation. Use `regsub()` / `regsuball()` for dynamic logic. ### 15. Operational metadata lives in per-service SQLite, not DuckDB -Alerts, views, audit, cron history, ingested-file dedup, ASN names, source registration, usage telemetry → `data/services/{id}.metadata.db` (WAL). Read/write via [backend/core/metadata_db.py](backend/core/metadata_db.py) — never via DuckDB. JOINs against log data: ATTACH the SQLite read-only as `meta` via `attach_metadata_db()`, or pre-fetch and inline as a parameterised IN list (see `dashboard.py` ASN search). +Alerts, views, audit, cron history, ingested-file dedup, ASN names, source registration, usage telemetry → `data/services/{id}.metadata.db` (WAL). Read/write via [backend/core/metadata/](backend/core/metadata/) (or the [backend/core/metadata_db.py](backend/core/metadata_db.py) shim for old import paths) — never via DuckDB. JOINs against log data: ATTACH the SQLite read-only as `meta` via `attach_metadata_db()`, or pre-fetch and inline as a parameterised IN list (see `dashboard.py` ASN search). New write paths use the `@sync_db_retry` (tenacity-backed) decorator to handle SQLite `OperationalError` busy/locked under WAL contention. ### 16. Monkeypatches → catalog in [MONKEYPATCHES.md](MONKEYPATCHES.md) -We patch six s3fs methods + one PyIceberg `SqlCatalog.load_table` at import time for telemetry-proxy routing, immutable-bytes caching, and table-object reuse. Every patch is documented in MONKEYPATCHES.md with site, motivating incident, and cleanup path. Update that file in the same commit when you add/modify/remove a patch. +Historically we patched six s3fs methods + one PyIceberg `SqlCatalog.load_table` at import time. Phase 4 of the v2.0 carve-up replaced the s3fs patches with `FosS3FileSystem` / `CachedS3FileSystem` subclasses in [backend/core/iceberg/fs.py](backend/core/iceberg/fs.py) registered as a pyiceberg `FileIO`. Whatever remains is documented in MONKEYPATCHES.md with site, motivating incident, and cleanup path. Update that file in the same commit when you add/modify/remove a patch. ### 17. MSW + openapi-fetch ordering — `server.listen()` must run at module load `openapi-fetch` captures `globalThis.fetch` at `createClient` time. [frontend/lib/api.ts](frontend/lib/api.ts) creates its client at module load, so MSW's `server.listen()` MUST execute at the top of [frontend/vitest.setup.ts](frontend/vitest.setup.ts) — **not inside `beforeAll`**. If listen runs after lib/api.ts is imported, the captured fetch is the unpatched original and every test silently bypasses MSW. Symptom: handlers never fire, requests hit real loopback. Don't move that call into a hook. @@ -475,11 +540,25 @@ Our [frontend/vitest.config.ts](frontend/vitest.config.ts) sets `globals: false` The tunnel exposes the same FastAPI app to the public internet. Middleware classifies by `Host` and blocks remote requests from admin paths — including `/api/admin/share/*`. When you add an endpoint analysts must reach, register under `/api/share/*` or update `_is_blocked_path()`. Don't remove the `testclient`/`testserver` allow-list entries — they're what let pytest hit admin routes. ### 21. `sync_data` orphan-cleanup vs local-compaction outputs -Local compaction writes merged rollups to three places: `/data/daily/`, `/data/weekly/`, and `/data/timestamp_hour=*/compacted_*.parquet`. None of these are tracked by the iceberg snapshot, so they are NOT in `cloud_files`/`active_paths`. The orphan-cleanup loop in [backend/core/iceberg.py](backend/core/iceberg.py) `sync_data()` walks the cache and deletes anything not in `active_paths`; without explicit allow-rules it nukes every compacted output, and the [`local_compacted_files` registry](backend/core/metadata_db.py) then blocks re-download of the source files — silently dropping rows from the view (production: 1.65M → 302K on 2026-05-31, then 1.66M → 1.62M on 2026-06-01 from the per-partition `compacted_*` variant). The fix is two-pronged: orphan-cleanup restricts its walk to `timestamp_hour=*` dirs AND skips `compacted_*.parquet` filenames. **If you add a new local-only output pattern, add it to both the dir skip and the file skip.** Integration coverage in [tests/core/test_local_compaction.py](tests/core/test_local_compaction.py)::`test_compaction_outputs_survive_iceberg_sync_orphan_cleanup` exercises the round-trip with real `compact_local_partitions` + real `sync_data`. +Local compaction writes merged rollups to three places: `/data/daily/`, `/data/weekly/`, and `/data/timestamp_hour=*/compacted_*.parquet`. None of these are tracked by the iceberg snapshot, so they are NOT in `cloud_files`/`active_paths`. The orphan-cleanup loop in [backend/core/iceberg/_core.py](backend/core/iceberg/_core.py) `sync_data()` walks the cache and deletes anything not in `active_paths`; without explicit allow-rules it nukes every compacted output, and the [`local_compacted_files` registry](backend/core/metadata/ingest_log.py) then blocks re-download of the source files — silently dropping rows from the view (production: 1.65M → 302K on 2026-05-31, then 1.66M → 1.62M on 2026-06-01 from the per-partition `compacted_*` variant). The fix is two-pronged: orphan-cleanup restricts its walk to `timestamp_hour=*` dirs AND skips `compacted_*.parquet` filenames. **If you add a new local-only output pattern, add it to both the dir skip and the file skip.** Integration coverage in [tests/core/test_local_compaction.py](tests/core/test_local_compaction.py)::`test_compaction_outputs_survive_iceberg_sync_orphan_cleanup` exercises the round-trip with real `compact_local_partitions` + real `sync_data`. ### 22. `unattended-upgrades` can OOM a memory-tight VM A 16 GB Linux VM running backend + frontend + caddy holds a steady-state working set in the 10-13 GB range. The Debian/Ubuntu nightly `apt-daily-upgrade.timer` forks a transient 1-2 GB downloader on top of that, which can trip an OOM kill that wedges the kernel (sshd dies; needs a VM reset). The mitigation is to `systemctl mask apt-daily.timer apt-daily-upgrade.timer unattended-upgrades.service` on the host and re-assert it on every restart so a re-image / apt-reinstall can't silently re-enable them. Trade-off: no automatic security patching — patch manually on a planned maintenance window with the backend container stopped. **If you provision a VM with more RAM, you may safely re-enable upgrades.** +### 23. SSR upstream fetch must use `node:http`, not `fetch()` +Node's `fetch()` always rewrites the `Host` header from the URL — there's no way to override it. The backend's `_remote_host_allowed` gate ([backend/utils/remote_access.py](backend/utils/remote_access.py)) rejects remote-classified requests whose `Host` isn't the public endpoint. SSR helpers like [frontend/lib/ssr/bootstrap.ts](frontend/lib/ssr/bootstrap.ts) use `node:http.request` which preserves arbitrary headers verbatim. If you write a new SSR helper, do NOT reach for `fetch()` — copy the `rawRequest` pattern. The 2026-06-11 SSR-leak incident (reverts `f3d8dd7` / `546c279`) was the first version using `fetch()`; the `Host` got rewritten to `127.0.0.1:8000`, the backend classified as admin-from-loopback, and the full admin bootstrap dehydrated into anonymous public HTML. + +### 24. Rollup writers must rebundle bundles after adding a field +`bundle_hours` / `bundle_days` use mtime to skip up-to-date bundles. The cron only re-bundles HOURS THAT JUST RECEIVED DATA. Closed historical hours never re-touch. So a new field added to the rollup writer (real or virtual) lands as a per-(field, hour) parquet but the bundled `all_fields.parquet` for closed hours stays without it — the dashboard's bundled-rollup reader returns 0 rows for the new field and the runtime fallback fires (defeats the perf win). Fix: ship a one-shot data migration that deletes the closed `all_fields.parquet` files and runs `backfill_*_bundles` so they get rewritten with the new field. Canonical pattern: `_rollups_virtual_field_rebundle` in [backend/core/data_migrations.py](backend/core/data_migrations.py). + +### 25. Virtual fields blow up the live-hour batch if not filtered out +`execute_top_n_rollups` in [_base.py](backend/repositories/_base.py) needs the active-hour merge to include real fields' new rows. The live-hour SQL projects `field_name AS value` and BinderExceptions on any name that's not a column on the live temp. Virtual fields like `waf_sig_ind` don't exist as real columns — passing them through silently kills the whole UNION ALL (the outer `except Exception: pass` swallows it) and drops the live-hour merge for REAL fields too. Always filter to `actual_cols` before the batch: +```python +live_fields = [f for f in fields if f in actual_cols] +if live_fields: + live_res, _ = self.execute_top_n_batch(live_fields, tmp_name, ...) +``` + ## AI Agent Directives These apply to every change, regardless of scope. @@ -526,6 +605,33 @@ These apply to every change, regardless of scope. 17. All new endpoints get at least one test in `tests/routers/`. 18. Regenerate OpenAPI types after the endpoint lands: `cd frontend && npm run gen:types`. +### Architectural choices to preserve + +The 2026-06 retrospective surfaced several structural decisions the audit specifically validated. Don't rewrite these in a future reimagining: + +- **ADR-driven architecture with decisions captured AFTER the lesson lands.** This is the velocity strategy, not a debt. Continue the cadence — write the ADR after a phase ships, not before. +- **[MONKEYPATCHES.md](MONKEYPATCHES.md) as a living inventory** with root-cause attribution per patch (incident date, why upstream can't fix, removal criteria). +- **Property-based testing** (Hypothesis) for filter/query roundtrips. Catches drift without hand-written matrices. +- **RequestContext** making tenancy structurally impossible to bypass — can't construct without `_enforce_service_access`. +- **Modular package carves with re-export shims** for backward compat during refactor (the `metadata_db.py` / `scheduler.py` pattern). +- **Named exception classes + explicit retry policies** (vs. generic `except Exception`). +- **Three-tier docs scheme** (pending-docs / local-docs / docs) — intentional and works for a public-repo solo project. +- **MVP-then-iterate cadence with phase-based cleanup.** Don't propose "spike before shipping" rewrites — solo bandwidth and information-unavailability at v1.0 time make iterate-then-cleanup the right trade-off. + +### Anti-patterns explicitly rejected + +If a refactor proposal matches one of these, push back. Each was investigated and rejected during the 2026-06 audit; the rationale is preserved here so future-you / future-agent doesn't relitigate: + +- **Generic "schema codegen" infrastructure** for FilterSpec — `openapi-typescript` already handles the 80% case; codegen can't express the procedural collision-handling logic that's the actual duplication. +- **Premature `usePagination` / `PaginationConfig` context** when there are only 2 paginated endpoints with genuinely different sort semantics. +- **Centralized `RoleProvider` context** — role is 2 orthogonal flags (`analyst_session` × `is_remote_analyst`), not a hierarchy; an enum would have locked in a false model when SHARE-INVITED was added. +- **Multi-language scoring codegen** (Python ↔ Rust) — parity is enforced cheaply by fixture tests; codegen adds versioned-schema overhead and constrains schema evolution. +- **Pre-formatted server-side response values** — `TopTenTable` needs raw values for click handlers and map ops; pre-formatting forces double payload and locks display format into the API contract. +- **Cache-coherence "state machine" abstractions** — the bottleneck is DuckDB view rebuild time, not cache layer policy; a state machine wouldn't have prevented the 2026-06-09 transient-empty-result incident. +- **Unified `QueryExecutor`** for retry — stale-view and compaction-race are different error classes with different recovery costs; collapsing them creates a leaky abstraction. +- **Tentacle-parameter threading** through repository signatures (e.g., passing `RequestContext.cached_temps` to every repo function) — couples request scope to data layer. +- **Custom `FsspecFileIO` subclass to "fix" the s3fs monkeypatches** — investigated 2026-05-21 and rejected; pyiceberg instantiates `S3FileSystem` directly inside its `_s3()` builder, bypassing the FileIO layer entirely. Wait for upstream `supply-your-own-FileSystem-class` hook (tracked in [MONKEYPATCHES.md](MONKEYPATCHES.md)). + ## Keeping This File Current Update this file in the same commit that introduces: diff --git a/CHANGELOG.md b/CHANGELOG.md index 3309df7c..deca261d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,352 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog 1.1.0](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +### Cleanup + +Post-2.0.0 cleanup sweep applying an in-tree audit's recommendations. +The pattern across the work was the same on every front: kill the dual +maintenance that survived the package carve-up. + +- **Three SQLite pools collapse into one.** `metadata.base`, + `metadata.usage_log_db`, and `share_db.connection` all owned + identical thread-local pool machinery (same module globals, same + PRAGMAs, same init lock). They now share `ThreadLocalPool` in + `backend/core/sqlite_pool.py`. share_db queries flow through + `InstrumentedConnection` for the first time — they now appear in + the Live Query Monitor under `service=__global_share__`. +- **Origin summary's per-query templates collapse into one path.** + `TEMP_SUMMARY_ROLLUP` + `TEMP_SUMMARY_BY_EDGE` are gone; the live + and TEMP-table paths both use `SUMMARY_GROUPING_SETS` through a + shared `_shape_summary` helper that reads rows by column name + (`cursor.description` dict access) instead of positional indices. +- **Cron job tails consolidated.** Five `finally:` blocks ending in + the same `if run_id: update_cron_duration ... except: pass` + boilerplate route through `finalize_cron_duration`. The 16+ + `load_config / 404` preambles funnel through `load_service_config`. + Three `start_cron_run → spawn-thread → 503` triples collapse into + one `start_or_resume_cron`. Per-hour bundle walks + (`collect_hourly_bundle_paths`) and the two cross-package migration + runners (`run_pending_migrations`) get the same treatment. +- **Mixins + helpers for the small repeated shapes.** + `LogExtentsMixin` (`earliest_log_at` + `latest_log_at`), + `OkResponse` (`ok: bool = True`), `_atomic_write_json`, + `_get_cfg_field`, `client_ip`, `shim_attr`, plus iceberg + `_iceberg_root_prefix` + `_metadata_pointer_candidates`. +- **`fetch_service_name` now routes through the shared `fastly()` + client** instead of an inline urllib body. Adds a `timeout` keyword + to `fastly()` (default 30 s preserves the existing behavior of the + ~50 other call sites) and the name-fetch call site pins + `timeout=10` + `max_retries=1` so the cold-path tail caps at ~21 s + vs the client default of ~127 s. Caller is behind a 300 s name + cache so steady-state cost is unchanged. +- **`_run_falco_lint` absorbs the falco subprocess plumbing** shared + by `vcl_utils.lint_log_format` (logging-endpoint VCL check) and + `vcl_validator.lint_vcl` (scoring-snippet VCL check). Each caller + keeps its own falco-not-available handling, timeout budget, and + output parser — the helper only owns the tempfile lifecycle, + `subprocess.run` invocation, and tempfile-path redaction. The two + use cases stay distinct on purpose (logging is best-effort, scoring + is a security boundary). + +### Fixed + +- `start_proxy_server` race that surfaced as + "proxy server is not running" when N reader threads called + `get_connection` simultaneously on a cold process. Concurrent + first-callers now serialise the thread-start decision and wait + on `_READY` outside the lock so every caller reads `_PORT` after + the server has bound. +- `get_metadata_storage_stats` + `cleanup_metadata` silently + ignored the `usage_log` table on every fresh service after + the v2.0 per-service-file split — the helpers still read + `metadata.db`. Routed through `usage_log_db` so admin storage + stats and the retention cleanup job actually see the rows. +- `sync.py` cron tail used to emit a misleading + "View refresh + warm: Xms" status event even on failure (the + success log sat outside the try/except). The shared + `refresh_view_and_warm_pool` puts the success log inside the + try/except so failure means no event. +- `start_cron_run` non-sync task types fell back to + `cron_compact.log_retention_days` via a buggy ternary; the + promoted `_TASK_TO_CRON_KEY` mapping plus a default 7-day + fallback gets the correct retention applied per task. +- `query_instrumentation._safe_weakref` silently no-op'd the + memory probe when wrapping non-weakref-able cursors; promoted + the registry-version's strong-ref-closure fallback so the probe + always tracks. +- `local_compaction` hour-tier tests were flaky on any clock more + than 30 days past the hardcoded sample dates — the fixture now + pins both `_DAILY_TIER_AGE_DAYS` and `_WEEKLY_TIER_AGE_DAYS` so + neither tier sweeps the test partitions out from under the + assertions. + +### Removed + +- `backend/utils/retry.py`, `backend/utils/cdn.py`, + `backend/core/settings.py` (Path-B removal of three migration + scaffolds that never adopted in tree). `pydantic-settings` + dropped from `pyproject.toml` + `uv.lock` (was the sole + consumer). +- Legacy `usage_log` DDL + 3 triggers + 4 indexes in + `metadata.base._SCHEMA` (the table moved to its own per-service + file pre-2.0). `migrate_from_metadata_db` and + `_migration_003_rebuild_usage_log_hourly_summary` deleted. +- Scrypt passcode verify path + `PASSCODE_DEFAULT_ALGO_KEY` + + `_migration_003_passcode_algo_marker` (cutover happened + pre-2.0; fresh installs have no scrypt rows). +- `TunnelState.use_tunnel` + `tunnel_url` + the + `share_admin` response keys that exposed them (always + False/None since v2.0 deleted the SSH path). +- Per-checkin `_cleanup_temp_tables` sweep in `duckdb_pool` — + the "safety net" was unreachable because the failure path + discards the connection before the sweep can run. + +## [2.0.0] - 2026-06-12 + +Architecture cleanup release. The post-`v1.2.0` perf branch closed the +worst read-path latency by stacking remediation on top of an +architecture that wasn't designed for the workload; this release pays +that down. The largest backend files were carved into per-concern +packages, telemetry moved to OpenTelemetry + structlog, tenancy got a +typed `RequestContext` boundary, frontend hydration warm-up hacks were +replaced with policy, and the test + type gates ratcheted to a level +that catches regressions on the way in. Composite endpoints land as a +hard cutover — frontend + backend ship together, granular endpoints +deleted. + +### Architecture + +- **`backend/core/iceberg.py` (4,232 LOC)** → `iceberg/` package + (`view`, `catalog`, `warehouse`, `manifest`, `fs`, `_core`, + `buffer`, `ddl`, `snapshot_cache`, `dedup`, …). Custom + `FosFsspecFileIO(FsspecFileIO)` + `CachedFosS3FileSystem(S3FileSystem)` + subclasses replace 5 of the 6 historical `s3fs` monkeypatches; + only the `ThreadPoolExecutor.submit` ContextVar wrapper remains + (see [MONKEYPATCHES.md](MONKEYPATCHES.md)). +- **`backend/scheduler.py` (2,843 LOC)** → `backend/cron/` package + with `scheduler`, `decorators`, and per-job modules under + `cron/jobs/` (`sync`, `commit`, `compaction`, `optimize`, `expire`, + `metadata`, `gap_heal`, `rollup_compact_daily`). The scheduler + picks the **separate-pool** isolation strategy based on Phase 1 + thread-wait telemetry; the deferred-view-cache-invalidation hack + is gone. +- **`backend/core/metadata_db.py` (3,168 LOC)** → `backend/core/metadata/` + package with concern-partitioned mixins (`base`, `alerts`, `views`, + `ingest_log`, `cron_log`, `asn_cache`, `usage_log`, `reconciliation`, + `state`). `metadata_db.py` becomes a thin backward-compatible shim. +- **`backend/utils/tunnel.py` (1,022 LOC)** → `backend/utils/tunnel/` + package (`manager`, `session`, `rate_limiter`, `state`, + `fingerprint`). The SSH-to-localhost.run path is **deleted entirely** + (~400 lines): no more SSH subprocess + sleep-listener + reconnect + state machine. Direct-mode only; production has always used direct. +- **`backend/core/share_db.py` (1,312 LOC)** → `backend/core/share_db/` + package (`connection`, `schema`, `invites`, `sessions`, `audit`, + `passcode`, `tos`, `settings`). `argon2-cffi` replaces `scrypt` for + passcode hashing. +- **`backend/routers/admin.py` (1,650 LOC)** → `backend/routers/admin/` + package (14 sub-modules: `pop_locations`, `ingest`, `trees`, + `downloads`, `sync_status`, `compaction`, `health`, + `log_accounting`, `iceberg`, `bot_sources` + shared + `_helpers` / `_dir_size` / `_router`). +- **`backend/core/rollups.py` (2,045 LOC)** → `backend/core/rollups/` + package (8 sub-modules: `_common`, `time_series`, `sessions`, + `hour_bundles`, `day_bundles`, `recompute`, `wellknown_bots`). +- **`RequestContext` replaces `AnalyticsDeps`** ([`backend/core/request_context.py`](backend/core/request_context.py)). + Tenancy is enforced at context construction; routes never parse a + `service_id` from a path param. The security-load-bearing private + `read_only` attribute is now structurally unexposable as a query + param. +- **Composite endpoints + hard cutover** — `dashboard/bundle`, + `security/bundle`, `network/bundle` ship together with the frontend + swap. Granular per-card endpoints deleted, `_meta_con` parallel path + dropped, `is_cached/_is_cached` alias collapsed, + `AnalyticsDeps = RequestContext` shim removed. Top-5 backend files + now ≤ 1,461 LOC; no backend file > 1,500. + +### Telemetry, observability + +- **OpenTelemetry** (`opentelemetry-api/sdk` + + `fastapi`/`botocore`/`aiohttp` instrumentors) replaces the four + fragmented custom telemetry surfaces. Console exporter ships by + default; backends (Jaeger / Tempo / Honeycomb / …) are a + deploy-config decision, not part of this release. +- **`structlog`** wires `trace_id` + `span_id` into structured log + output via a custom processor. +- **`process_context_scope` + `_ACTIVE_CONTEXTS` mirror kept** at + [`backend/utils/telemetry.py`](backend/utils/telemetry.py). OTel context + propagation uses Python ContextVars under the hood, which inherit + the cross-thread limitation (fsspec iothread, pyiceberg + ThreadPoolExecutor) the manual mirror was built to solve; removing + the mirror would re-introduce the ~80%-NULL telemetry bucket + observed on 2026-05-20. Docstring + plan entry document the + reasoning. +- **`RequestTelemetry`** thin wrapper owns section spans, query + attribution, call log, and the custom `app.thread_wait_ms` metric + that fed the Phase 6 separate-pool decision. + +### Reliability, perf + +- **`aiodns` + `asyncio.gather` + bulk-transaction sqlite writes** in + [`backend/utils/rdns_cache.py`](backend/utils/rdns_cache.py) replace the + serial-blocking `socket.gethostbyaddr` loop that wedged the sync + worker for minutes on bulk lookups. +- **`tenacity`** decorator-based retry replaces ad-hoc try/except loops + for Fastly API + NGWAF + SQLite WAL-busy paths; centralised policy + on `Settings`. +- **`pydantic-settings`** centralises env-var reads + boot validation + (the "TRUSTED_PROXY_IPS required in prod" gate is now a pydantic + validator). +- **`cachetools`** replaces `bounded_cache` / `rdns_cache` / + `ngwaf_bot_cache` in-process LRU/TTL implementations. +- **Structured `.tf.json`** generation replaces f-string HCL + + `_hcl_escape` regex (`backend/utils/terraform_gen.py`), eliminating + the custom-HCL escaping injection vector. +- **`orjson` via FastAPI `ORJSONResponse`** for ~5–10× faster JSON + serialisation on composite endpoint payloads. +- **`rich` + `typer`** for the provision CLI; `httpx` everywhere + except `telemetry_proxy.py` (which stays on `aiohttp` for the proxy + server role). +- **`nuqs`** as the URL state source on the frontend, replacing the + custom Zustand/Effect sync hooks that produced hydration desync on + refresh. +- **`session_scoring._cached`** clears `_inflight` on the cache-hit + path too, not only on producer-path teardown — concurrent callers + on a hot cache key no longer leak the inflight registration when + the producer finishes before they wake up. +- **`iceberg/buffer.tombstone_buffer_files`** logs + skips on + marker-write failure (the immediate-`os.remove` fallback re-opened + the in-flight-query race the tombstone grace window exists to + close). Pair regression test pins the contract. +- **`DROP TABLE IF EXISTS` identifier quoting** at 11 temp-table + cleanup sites so the drop tolerates reserved keywords / hyphenated + service slugs that would otherwise raise. + +### Trust topology, middleware + +- **Middleware order asserted at boot AND in tests** — the + multi-paragraph prose comments in `main.py` were replaced with + one-line `# INVARIANT` markers + a boot-time crash if + `app.user_middleware` doesn't match the declared tuple. Snapshot + tests cover Caddy + docker-compose middleware order too. +- **`@pytest.mark.security_regression` marker + monotonic-count CI + gate** (floor: 24, from `audit-findings/`). Every test covering a + verified security fix carries the mark; a refactor cannot silently + drop coverage of a known fix. +- **Trust-topology snapshot tests** pin Caddy `@from_fastly` matcher, + XFF forwarding, `/share-login` rate-limit, and the backend + `--forwarded-allow-ips=127.0.0.1` flags. +- **`raise_internal(logger, exc, code, status)`** replaces + `raise HTTPException(detail={"error": str(e)})` at every backend + except site that previously echoed the original exception message + to the client. Detail is now `{"error": , "error_id": <8-hex>}`; + the full exception lands in the server log with the same + `error_id` so operators triage without the upstream body / token + fragments leaking on the wire. +- **`escape_sql_literal`** applied at every `read_parquet()` / + `glob()` site that interpolates a computed path. Closes the + injection surface a partially-validated path could open through + DuckDB's `read_parquet()` glob expansion. +- **Caddy container drops privileges** — `caddy/Dockerfile` adds + `USER caddy` (the base image ships the user). Caddy is the only + externally-facing socket and binds nothing below port 1024, so + there's no reason to keep `root` in the runtime. + +### Frontend + +- **RSC/CSR boundary** documented in `app/_routing.md`. The + hidden-Plotly + hidden-MapLibre + `setTimeout` warm-up hacks are + dropped; replaced with `modulepreload` + the styledata-event swap + pattern. +- **16 frontend files > 500 LOC split.** `ProvisionWizard.tsx` + (3,582 LOC) → `wizard/steps/*` + `state.ts` + `api.ts`; + `app/logs/page.tsx` (2,136 LOC) → `_sections/*` + `_state.ts`. + `app/admin`, `app/dashboard`, `app/alerts`, `app/security`, etc. + all post-split < 500. **No frontend file > 499 LOC.** +- **Live Query Monitor** — live-first sort, peak-memory column, + keyboard shortcuts, URL-persisted filters, per-run inline expand + for ×N cron-grouped rows, ≥ 30 s stuck-query pulse, copy-SQL, + sound notification removed. +- **Operations Overview cards** on the admin landing page surface + ingest gap + live query activity + slow-query count so the things + operators actually care about don't live three clicks deep. + Tone-coded (default → attention → warning → critical) so a + sustained_loss event jumps out. +- **Stable React keys on dynamic lists** — `DebugPanel`, `CronLiveLog`, + the network metro leaderboard, the query toolbar, and the + custom-field drawer now key off a stable identity instead of array + index. `useSSE` attaches a monotonic `_id` to each line so + append-only feeds (cron progress, query streams) keep stable keys + across re-renders. +- **Accessibility pass** — `FieldGroups` and `FileBrowser` disclosure + widgets are real `