fix(serve-child): remove NDJSON->SQLite indexer (bug-28a9d7a7)#63
fix(serve-child): remove NDJSON->SQLite indexer (bug-28a9d7a7)#63shakestzd wants to merge 1 commit into
Conversation
The serve-child process used to spawn an NDJSON->SQLite indexer goroutine and expose /api/indexer/status for observability. Each indexer instance opened its own otelreceiver.NewWriter (sql.DB with MaxOpenConns=1) against the same WAL SQLite file the API handlers read from. Two writer-pool sql.DB handles serialise on the same write lock, generating SQLITE_BUSY storms whenever the otel-collect process and serve-child both wrote at once. The per-session otel-collect process is now the sole writer to otel_signals; serve-child is read-only with respect to that table. If NDJSON->SQLite replay is needed for pre-existing sessions, it belongs inside otel-collect so only one writer DB handle exists across the entire project. Adds TestServeChild_NoIndexerStatusEndpoint as a regression guard: the test fails fast if either the endpoint or the indexer is resurrected without addressing the SQLITE_BUSY root cause. Originally bundled into the registry-hardening PR #62; split out per review feedback so each behavioral change is reviewable in isolation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6f9c288440
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // NOTE(bug-28a9d7a7 Part B): The NDJSON→SQLite indexer and its dedicated | ||
| // otelreceiver.NewWriter have been removed. Opening a second sql.DB writer | ||
| // (MaxOpenConns=1) against the same WAL SQLite file contended with the | ||
| // read-pool handle used by the API handlers, causing SQLITE_BUSY errors. | ||
| // The per-session otel-collect process is now the sole writer to otel_signals. |
There was a problem hiding this comment.
Restore NDJSON→SQLite ingestion before removing indexer
This change removes the only runtime path that replays events.ndjson into otel_signals: runServeChild no longer starts indexer.Start, while the per-session collector still writes to NDJSON (cmd/htmlgraph/otel_collect_handler.go uses ndjson.Sink), and telemetry APIs/materialization read from SQLite (cmd/htmlgraph/api_otel.go, internal/otel/materialize/materialize.go). In practice, OTLP traces/logs/metrics produced after this commit are not indexed into otel_signals, so OTel dashboards and rollups will silently miss session telemetry.
Useful? React with 👍 / 👎.
| // TODO(bug-28a9d7a7): If NDJSON→SQLite replay is needed for pre-existing | ||
| // sessions, run the indexer inside the otel-collect process so only one | ||
| // writer DB handle exists across the entire project. |
There was a problem hiding this comment.
Preserve retention progress checks after indexer removal
Removing the serve-child indexer also removes the only writer of .index-offset checkpoints (written in internal/otel/indexer), but retention still gates archival on that file (internal/otel/retention/retention.go:indexerCaughtUp). With no checkpoint producer, archiveSession now returns early for every session that has events.ndjson, so retention cleanup/archival stops running for those sessions and stale session directories accumulate indefinitely.
Useful? React with 👍 / 👎.
Summary
Splits the indexer-removal hunk out of PR #62 (registry hardening) into its own reviewable PR per that review's feedback.
serve-childused to spawn an NDJSON→SQLite indexer goroutine and expose/api/indexer/statusfor observability. The indexer opened its ownotelreceiver.NewWriter(sql.DBwithMaxOpenConns=1) against the same WAL SQLite file the API handlers read from. Two writer-poolsql.DBhandles against one WAL file serialise on the same write lock — whenever the per-sessionotel-collectprocess andserve-childboth wrote at once, this generatedSQLITE_BUSYstorms (bug-28a9d7a7).The per-session
otel-collectprocess is now the sole writer tootel_signals;serve-childis read-only with respect to that table.If NDJSON→SQLite replay is needed for pre-existing sessions, it belongs inside
otel-collectso only one writer DB handle exists across the entire project.Test plan
go build ./... && go vet ./... && go test ./...— all passTestServeChild_NoIndexerStatusEndpointregression guard added — asserts the serve-child mux does not register/api/indexer/status. The test fails fast if either the endpoint or the indexer goroutine is resurrected without first addressing the WAL-writer-contention root cause.Notes for reviewers
serve_child.godiff is the inverse of the hunk that landed in fix(registry): prevent test-tempdir pollution + auto-prune stale entries #62 ea8b613 — the registry-hardening PR has been re-pushed with this hunk reverted, so this PR is the only place the behavior change appears now./api/indexer/statusreturns 404 going forward. No known consumers.🤖 Generated with Claude Code