Skip to content

infra: Index rebuild resilience — self-healing from log files #48

@lis186

Description

@lis186

Problem

index.ndjson is a single point of failure — restoreFromLogs only reads the index. If deleted, historical data is lost. Design goal: index = derived cache from log files, can self-heal.

Prior work

  • Design decisions locked (4): scope = ccxray-native logs only, trigger = startup drift check + manual CLI, on-drift = incremental orphan backfill, legacy = delta chain anchor as synthetic session key
  • Handoff: reason/260602-index-rebuild-resilience/handoff.md (authoritative)
  • Existing: scripts/rebuild-missing-index.js — partial implementation exists (see Existing Implementation below)
  • buildIndexLine() in server/entry.js already supports reconstruction

Scope

  • Evaluate existing rebuild script coverage
  • Implement startup drift detection
  • ccxray rebuild-index CLI command
  • Incremental orphan backfill

Before / After

BEFORE:
  index.ndjson deleted
       │
       ▼
  restoreFromLogs()
       │
       ▼
  reads index.ndjson → file missing or empty
       │
       ▼
  returns [] → dashboard shows nothing
       │
       ▼
  DATA LOSS — all history gone until manually recovered


AFTER:
  index.ndjson deleted or incomplete
       │
       ▼
  startup drift check
       │
       ├─ index missing → full rebuild from _req/_res log files
       │
       └─ index exists but entry count != file count
              │
              ▼
         incremental orphan backfill
              │
              ▼
  dashboard shows full history — self-healed

Architecture

Startup
  ├─ index.ndjson exists?
  │   ├─ YES → quick drift check (entry count vs file count)
  │   │         ├─ match → normal restore
  │   │         └─ mismatch → incremental orphan backfill
  │   └─ NO → full rebuild from log files
  │
  └─ CLI: ccxray rebuild-index
       └─ same pipeline, forced full rebuild

Rebuild logic:
  for each _req.json in LOGS_DIR:
    parsedBody = JSON.parse(read(_req.json))
    resData = JSON.parse(read(_res.json))
    entry = buildEntryFields(ctx)    ← existing function in wire-parsers
    line = buildIndexLine(entry)     ← existing function in entry.js
    append to index.ndjson

Key insight: buildIndexLine in server/entry.js already exists and is the canonical projection. Rebuild = replay the same pipeline that live entries use.

Design decisions (from reason/260602-index-rebuild-resilience/handoff.md):

  1. Scope: ccxray-native logs only (no cross-referencing ~/.claude or ~/.codex)
  2. Trigger: startup drift check + manual CLI
  3. On-drift: incremental orphan backfill (not full rebuild)
  4. Legacy: delta chain anchor as synthetic session key

Value

For users

  • No more permanent data loss from accidental index deletion
  • Self-healing on restart — "just works"
  • CLI escape hatch for manual recovery

For developers

  • buildIndexLine already exists — rebuild is replay, not new logic
  • Incremental backfill means startup stays fast for normal case
  • Foundation for future index format migrations

Side Effects

  • Full rebuild may be slow for large log directories (10K+ files)
  • Rebuilt index may differ slightly from original (field ordering, rounding)
  • Delta chain entries need special handling (prevId resolution)
  • Session metadata (cwd, title) derived from _req.json may be incomplete vs original

Existing Implementation

scripts/rebuild-missing-index.js exists on main (144 lines). It is a partial, one-shot recovery script — not yet integrated into startup or CLI.

What it does:

  • Reads index.ndjson, scans LOGS_DIR for _req.json files not in the index
  • Follows prevId delta chains (up to depth 50) to find an indexed ancestor
  • Inherits sessionId, cwd, agent, isSubagent from the ancestor entry
  • Parses _res.json when available (extracts usage, stopReason from SSE events)
  • Builds index entries manually (not via buildIndexLine) with hardcoded field layout
  • Writes index-patch.ndjson (dry-run by default, --apply to write)
  • User must manually cat index-patch.ndjson >> index.ndjson and restart hub

Gaps vs target design:

  1. No startup integration — must be run manually; no drift detection
  2. No CLI command — standalone script, not ccxray rebuild-index
  3. Only recovers delta-chain entries — entries without prevId (session anchors, standalone turns) are skipped entirely
  4. Does not use buildIndexLine — builds index lines with its own field construction, risking drift from the canonical projection
  5. Requires index to already existfs.readFileSync(indexPath) crashes if index is completely missing (the "full rebuild from nothing" case)
  6. No incremental backfill — always scans all files, no "only fill orphans" mode
  7. Missing fields — does not compute toolCount, toolCalls, title, thinkingDuration, maxContext, cost, receivedAt, coreHash, hasCredential, thinkingStripped
  8. Manual append step — not atomic; concurrent hub writes during cat >> could corrupt the index

Metadata

Metadata

Assignees

No one assigned

    Labels

    infraInfrastructure/tooling

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions