Skip to content

feat(scripts): discovery_snapshot.py — daily Discovery-Tracking cron (P3.1)#56

Merged
MoltyCel merged 1 commit into
mainfrom
feat/discovery-snapshot-cron
May 21, 2026
Merged

feat(scripts): discovery_snapshot.py — daily Discovery-Tracking cron (P3.1)#56
MoltyCel merged 1 commit into
mainfrom
feat/discovery-snapshot-cron

Conversation

@MoltyCel
Copy link
Copy Markdown
Owner

Summary

Discovery-Tracking P3.1 — CRON. Self-contained daily script scripts/discovery_snapshot.py, schreibt einen Snapshot/Tag in die discovery_snapshots-Tabelle (Migration aus PR #55).

Per SPEC docs/specs/2026-05-21_discovery-tracking-baseline-SPEC.md §3.5 + §5.2.

Was das Script macht (5 Quellen)

Quelle Methode Auth
self_probes GET 4 Discovery-Surfaces (sitemap URL-count, llms.txt MoltGuard-block, /guard/openapi.json path-count, /extendedAgentCard MoltGuard-extensions) none (TrustScout-DID für extendedAgentCard)
bot_hits parse /var/log/nginx/access.log* (last 7d), bot-UA × endpoint-class moltstack ist in adm-Gruppe → kein sudo nötig
github repo + traffic API, 6 MoltyCel-Repos GH_TOKEN aus ~/.moltrust_secrets — graceful pat-not-configured falls absent
gsc manual-pending (V0 §9.1)
errors non-fatale Failures gesammelt → source_run_status ok/partial/failed

Idempotenz

INSERT … ON CONFLICT (snapshot_at) DO UPDATE — doppelter Aufruf am selben Tag aktualisiert die Zeile, erzeugt nie eine 2. DB-Literal dollar-quoted ($disco$) → injection-safe ohne Escaping.

Privacy (§3.7)

nginx-Parser aggregiert ausschließlich User-Agent × endpoint-class. Keine IPs ins payload. moltstack-adm-Gruppen-Membership statt sudo = minimal-privilege.

Alerts

Telegram bei partial/failed (TELEGRAM_BOT_TOKEN/CHAT_ID aus secrets).

Flags

  • --dry-run — assemble + print, kein DB-write
  • --date YYYY-MM-DD — snapshot_at override (Backfill + Wegwerf-Test)

Test-Run (verifiziert 2026-05-21)

Gegen Wegwerf-Datum 2099-12-31 (full path inkl. DB-upsert):

4/4 probes · 16 bots / 1664 hits · 6/6 GitHub repos · upsert ok (row id=2, status=ok, 3950 bytes)
→ DELETE throwaway row → baseline 2026-05-21 UNBERÜHRT, 1 row total

Genau das „Test-Run, dann löschen — kein 2. Snapshot heute" aus dem Sprint-Auftrag.

Crontab-Eintrag (server-side, NICHT repo-managed)

Per CLAUDE.md §Geltungsbereich ist cron Server-Infra, nicht repo-verwaltet. Nach Merge manuell hinzugefügt + Audit-Eintrag:

30 0 * * * set -a && source /home/moltstack/.moltrust_secrets && set +a \
  && cd /home/moltstack/moltstack \
  && /home/moltstack/moltstack/venv/bin/python scripts/discovery_snapshot.py \
  >> logs/discovery_snapshot.log 2>&1

00:30 UTC täglich (vermeidet Backup-Window 03:00). Erster echter Cron-Fire: 2026-05-22 (Baseline 2026-05-21 bleibt frozen).

Pre-Commit-Diff (§8)

 scripts/discovery_snapshot.py | 341 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 341 insertions(+)

Genau 1 neues File, scripts/-Konvention (wie endpoint_probe.py, daily_stats.sh), kein Fremd-Scope.

§2.3 Cross-Review

Skip — Read-only Tracking, kein Auth-/Credential-Pfad geändert. Liest GH_TOKEN (read-only API) + nginx-Logs (adm-group) + schreibt aggregierte Metriken. Re-evaluate falls künftig GSC-OAuth dazukommt (P4).

Branch-Hygiene (§11.4)

Branch ab frischem origin/main (2298618, 0 behind), Worktree ~/moltrust-api-J.

Test plan

  • Merge per Merge-Commit
  • Deploy: server git pull (script landet via Repo)
  • Crontab-Eintrag manuell hinzufügen (00:30 UTC) + Audit-Notiz
  • Erster Cron-Fire 2026-05-22 00:30 UTC → snapshot_at=2026-05-22, status=ok erwartet
  • P3.2 (Dashboard) folgt als separater PR

Discovery-Tracking P3.1 per SPEC docs/specs/2026-05-21_discovery-tracking-
baseline-SPEC.md §3.5 + §5.2.

Self-contained daily cron script. Captures 5 sources into the
discovery_snapshots table (migration in PR #55):
- self_probes : GET 4 Discovery surfaces (sitemap.xml URL-count,
  llms.txt MoltGuard-block, /guard/openapi.json path-count,
  /extendedAgentCard MoltGuard-extensions)
- bot_hits    : parse /var/log/nginx/access.log* (last 7d), bot-UA ×
  endpoint-class. moltstack is in `adm` group → cron reads logs
  without sudo. Privacy §3.7: no IPs persisted, only UA-counts.
- github      : GH_TOKEN-authenticated repo + traffic API, 6 MoltyCel
  repos. Graceful "pat-not-configured" if GH_TOKEN absent.
- gsc         : manual-pending (V0 per §9.1).
- errors      : non-fatal failures collected; source_run_status
  ok/partial/failed computed accordingly.

Idempotenz: UPSERT ON CONFLICT (snapshot_at) DO UPDATE — repeated
same-day runs refresh the row, never create a 2nd. DB literal is
dollar-quoted ($disco$) — injection-safe without escaping.

Alerts: Telegram on partial/failed status (TELEGRAM_BOT_TOKEN/CHAT_ID
from ~/.moltrust_secrets).

Flags:
- --dry-run        assemble + print, no DB write
- --date YYYY-MM-DD  override snapshot_at (backfill / throwaway test)

Test-Run verified 2026-05-21 against throwaway date 2099-12-31:
4/4 probes, 16 bots / 1664 hits, 6/6 GitHub repos, upsert ok,
throwaway row deleted, baseline 2026-05-21 untouched.

Crontab entry (server-side, NOT repo-managed per CLAUDE.md §Geltungsbereich
— applied manually post-merge with audit note):
  30 0 * * * set -a && source /home/moltstack/.moltrust_secrets && set +a \
    && cd /home/moltstack/moltstack \
    && /home/moltstack/moltstack/venv/bin/python scripts/discovery_snapshot.py \
    >> logs/discovery_snapshot.log 2>&1
@MoltyCel MoltyCel merged commit c89c04d into main May 21, 2026
10 checks passed
@MoltyCel MoltyCel deleted the feat/discovery-snapshot-cron branch May 21, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant