Skip to content

fix(serve-child): remove NDJSON->SQLite indexer (bug-28a9d7a7)#63

Open
shakestzd wants to merge 1 commit into
mainfrom
fix-serve-child-remove-indexer
Open

fix(serve-child): remove NDJSON->SQLite indexer (bug-28a9d7a7)#63
shakestzd wants to merge 1 commit into
mainfrom
fix-serve-child-remove-indexer

Conversation

@shakestzd
Copy link
Copy Markdown
Owner

Summary

Splits the indexer-removal hunk out of PR #62 (registry hardening) into its own reviewable PR per that review's feedback.

serve-child used to spawn an NDJSON→SQLite indexer goroutine and expose /api/indexer/status for observability. The indexer opened its own otelreceiver.NewWriter (sql.DB with MaxOpenConns=1) against the same WAL SQLite file the API handlers read from. Two writer-pool sql.DB handles against one WAL file serialise on the same write lock — whenever the per-session otel-collect process and serve-child both wrote at once, this generated SQLITE_BUSY storms (bug-28a9d7a7).

The per-session otel-collect process is now the sole writer to otel_signals; serve-child is read-only with respect to that table.

If NDJSON→SQLite replay is needed for pre-existing sessions, it belongs inside otel-collect so only one writer DB handle exists across the entire project.

Test plan

  • go build ./... && go vet ./... && go test ./... — all pass
  • TestServeChild_NoIndexerStatusEndpoint regression guard added — asserts the serve-child mux does not register /api/indexer/status. The test fails fast if either the endpoint or the indexer goroutine is resurrected without first addressing the WAL-writer-contention root cause.

Notes for reviewers

🤖 Generated with Claude Code

The serve-child process used to spawn an NDJSON->SQLite indexer
goroutine and expose /api/indexer/status for observability. Each
indexer instance opened its own otelreceiver.NewWriter (sql.DB with
MaxOpenConns=1) against the same WAL SQLite file the API handlers
read from. Two writer-pool sql.DB handles serialise on the same
write lock, generating SQLITE_BUSY storms whenever the otel-collect
process and serve-child both wrote at once.

The per-session otel-collect process is now the sole writer to
otel_signals; serve-child is read-only with respect to that table.
If NDJSON->SQLite replay is needed for pre-existing sessions, it
belongs inside otel-collect so only one writer DB handle exists
across the entire project.

Adds TestServeChild_NoIndexerStatusEndpoint as a regression guard:
the test fails fast if either the endpoint or the indexer is
resurrected without addressing the SQLITE_BUSY root cause.

Originally bundled into the registry-hardening PR #62; split out per
review feedback so each behavioral change is reviewable in isolation.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6f9c288440

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +66 to +70
// NOTE(bug-28a9d7a7 Part B): The NDJSON→SQLite indexer and its dedicated
// otelreceiver.NewWriter have been removed. Opening a second sql.DB writer
// (MaxOpenConns=1) against the same WAL SQLite file contended with the
// read-pool handle used by the API handlers, causing SQLITE_BUSY errors.
// The per-session otel-collect process is now the sole writer to otel_signals.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore NDJSON→SQLite ingestion before removing indexer

This change removes the only runtime path that replays events.ndjson into otel_signals: runServeChild no longer starts indexer.Start, while the per-session collector still writes to NDJSON (cmd/htmlgraph/otel_collect_handler.go uses ndjson.Sink), and telemetry APIs/materialization read from SQLite (cmd/htmlgraph/api_otel.go, internal/otel/materialize/materialize.go). In practice, OTLP traces/logs/metrics produced after this commit are not indexed into otel_signals, so OTel dashboards and rollups will silently miss session telemetry.

Useful? React with 👍 / 👎.

Comment on lines +72 to +74
// TODO(bug-28a9d7a7): If NDJSON→SQLite replay is needed for pre-existing
// sessions, run the indexer inside the otel-collect process so only one
// writer DB handle exists across the entire project.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve retention progress checks after indexer removal

Removing the serve-child indexer also removes the only writer of .index-offset checkpoints (written in internal/otel/indexer), but retention still gates archival on that file (internal/otel/retention/retention.go:indexerCaughtUp). With no checkpoint producer, archiveSession now returns early for every session that has events.ndjson, so retention cleanup/archival stops running for those sessions and stale session directories accumulate indefinitely.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant