Skip to content

[BUG] Auto-updater server restart SIGTERMs in-flight cidx index subprocesses causing refresh failures #373

@jsbattig

Description

@jsbattig

Description

When the auto-updater restarts the CIDX server via systemctl restart cidx-server, child processes spawned by subprocess.run() inside RefreshScheduler._index_source() receive SIGTERM from systemd's default KillMode=control-group. The cidx index --fts subprocess is killed mid-indexing, causing CalledProcessError with signal 15 (SIGTERM).

Reproduction Steps

  1. Start CIDX server with multiple Langfuse repos or golden repos registered
  2. Wait for refresh jobs to trigger cidx index --fts (via _index_source())
  3. Trigger a server restart (auto-updater deploy, or manual systemctl restart cidx-server)
  4. Observe logs showing SIGTERM on cidx index --fts

Expected Behavior

The drain mechanism should wait for in-flight cidx index subprocesses to complete before restarting, OR the subprocess should be gracefully handled so that the job is marked as interrupted rather than failed.

Actual Behavior

ERROR: Indexing (semantic+FTS) on source failed for langfuse_Claude_Code_unknown-global: CalledProcessError:
Command '['cidx', 'index', '--fts']' died with <Signals.SIGTERM: 15>.

Multiple repos fail simultaneously at the exact same timestamp, confirming a server-wide event (restart) rather than per-repo issues.

Root Cause

  1. _index_source() (refresh_scheduler.py:1122) spawns cidx index --fts via subprocess.run() inheriting the parent's process group
  2. The auto-updater's restart_server() (deployment_executor.py:1361) enters maintenance mode and waits for drain (300s max)
  3. The drain mechanism (_wait_for_drain) only checks BackgroundJobManager's in-memory job status -- it has NO visibility into the child subprocess.run() process
  4. When drain timeout expires (or drain succeeds but subprocess is still running), systemctl restart sends SIGTERM to the entire cgroup, killing both Python server AND child processes
  5. Even if drain "succeeds" (BackgroundJob shows RUNNING), the thread running _execute_refresh is blocked on subprocess.run() and cannot respond to shutdown signals

Affected Files

  • src/code_indexer/global_repos/refresh_scheduler.py:1122 - _index_source() subprocess.run()
  • src/code_indexer/server/auto_update/deployment_executor.py:1361 - restart_server() drain logic

Proposed Fix Options

Option A (Recommended): Run cidx index in its own process group (start_new_session=True in subprocess.run), so systemd's cgroup kill doesn't reach it. The parent catches SIGTERM and waits for the subprocess to finish naturally.

Option B: Improve drain awareness -- have _execute_refresh check a shutdown flag before spawning subprocesses, and skip new indexing if shutdown is in progress.

Option C: Catch CalledProcessError with SIGTERM specifically in _index_source() and treat it as a retriable interruption rather than a hard failure.

Impact

Evidence (Staging Logs 2026-03-07)

03:16:32 ERROR - cidx index --fts died with SIGTERM for langfuse_Claude_Code_unknown-global
03:16:32 ERROR - cidx index --fts died with SIGTERM for langfuse_Claude_Code_seba.battig-global
[Two repos fail at EXACT same second = server-wide event]

04:37:51 ERROR - cidx index --fts died with SIGTERM [repeated pattern]
04:52:06 ERROR - cidx index --fts died with SIGTERM [repeated pattern]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions