Skip to content

[process_guardian] Failed to kill child process via SIGKILL — pipeline stuck in ERROR state, container force-restarts (2026-03-24 18:06 UTC) #735

@livepeer-tessa

Description

@livepeer-tessa

Summary

The ProcessGuardian detected a pipeline entering ERROR state and attempted to stop and restart the child process. Both SIGTERM and SIGKILL failed — the child process remained alive. This caused the monitor loop itself to error out, putting the pipeline into a persistent ERROR state. The container manager (ai.nightnode.net) then force-restarted the container.

Severity: High — requires container restart to recover, disrupting active streams.

Error Sequence

18:06:26 ERROR  process_guardian.py:314:_monitor_loop
  Pipeline is in ERROR state. Stopping streamer and restarting process. prev_restart_count=0

18:06:26 INFO   process.py:71:stop  — Terminating pipeline process
18:06:26 INFO   process.py:526:_handle — Received signal: 15 (SIGTERM). Initiating graceful shutdown.
18:06:26 INFO   process.py:266:_run_pipeline_loops — PipelineProcess: _run_pipeline_loops finished.

18:06:29 ERROR  process.py:90:stop — Failed to terminate process, killing
  (SIGTERM timed out → escalate to SIGKILL)

18:06:31 ERROR  process.py:93:stop — Failed to kill process
  self_pid=131 child_pid=169 is_alive=True

18:06:31 ERROR  process_guardian.py:334:_monitor_loop
  Failed to stop streamer and restart process. Moving to ERROR state

Stack Trace

Traceback (most recent call last):
  ...
  raise RuntimeError("Failed to kill process")
RuntimeError: Failed to kill process

(from logging.exception in process_guardian.py:334:_monitor_loop)

Container Outcome

18:06:32 ERROR  Container returned ERROR state, restarting immediately
  container=live-video-to-video_streamdiffusion-sdxl_8900  status=ERROR

18:06:32 ERROR  Container health check failed too many times
  container=live-video-to-video_streamdiffusion-sdxl_8900

Context

  • Stream: str_EWpq9mvkx9yNfbeQ
  • Gateway request: a69a78f6
  • Manifest: ada561f0
  • Pipeline: streamdiffusion-sdxl
  • Node: ai.nightnode.net (container live-video-to-video_streamdiffusion-sdxl_8900)
  • Time: 2026-03-24 ~18:05–18:06 UTC
  • Preceding condition: DEGRADED_INPUT for >60s (no input frames), transitioned to ERROR when restart was attempted

Preceding Context

Before the crash, the stream had been in DEGRADED_INPUT state for ~90 seconds:

18:05:26 WARNING last_value_cache.py:42:get — Timed out waiting for value (timeout=30.0s)
18:05:56 INFO   process_guardian.py:189 — Shutting down streamer. Flagging DEGRADED_INPUT state during shutdown: time_since_last_input=60.1s

Root Cause Hypothesis

The child process (pid=169) acknowledged SIGTERM (_run_pipeline_loops finished), but process.py:stop() subsequently reports it still alive when attempting SIGKILL. Possible causes:

  1. The process entered an uninterruptible sleep (D-state) after handling SIGTERM — Linux does not deliver SIGKILL to D-state processes until the wait completes.
  2. PID reuse race — unlikely given the tight timing.
  3. The child spawned a grandchild that inherited the PID and isn't in the same process group, so killing only the parent-PID leaves the subprocess alive.

Suggested Investigation

  • Add os.killpg(os.getpgid(child_pid), signal.SIGKILL) to kill the entire process group.
  • Log /proc/{child_pid}/status state at the point of kill failure.
  • Add a timeout check for process group kill as a final fallback before marking ERROR.

Grafana Reference

fal.ai Logs Dashboard — 2026-03-24, 18:00–18:10 UTC
Filter: stream_id=str_EWpq9mvkx9yNfbeQ

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions