-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Summary
The ProcessGuardian detected a pipeline entering ERROR state and attempted to stop and restart the child process. Both SIGTERM and SIGKILL failed — the child process remained alive. This caused the monitor loop itself to error out, putting the pipeline into a persistent ERROR state. The container manager (ai.nightnode.net) then force-restarted the container.
Severity: High — requires container restart to recover, disrupting active streams.
Error Sequence
18:06:26 ERROR process_guardian.py:314:_monitor_loop
Pipeline is in ERROR state. Stopping streamer and restarting process. prev_restart_count=0
18:06:26 INFO process.py:71:stop — Terminating pipeline process
18:06:26 INFO process.py:526:_handle — Received signal: 15 (SIGTERM). Initiating graceful shutdown.
18:06:26 INFO process.py:266:_run_pipeline_loops — PipelineProcess: _run_pipeline_loops finished.
18:06:29 ERROR process.py:90:stop — Failed to terminate process, killing
(SIGTERM timed out → escalate to SIGKILL)
18:06:31 ERROR process.py:93:stop — Failed to kill process
self_pid=131 child_pid=169 is_alive=True
18:06:31 ERROR process_guardian.py:334:_monitor_loop
Failed to stop streamer and restart process. Moving to ERROR state
Stack Trace
Traceback (most recent call last):
...
raise RuntimeError("Failed to kill process")
RuntimeError: Failed to kill process
(from logging.exception in process_guardian.py:334:_monitor_loop)
Container Outcome
18:06:32 ERROR Container returned ERROR state, restarting immediately
container=live-video-to-video_streamdiffusion-sdxl_8900 status=ERROR
18:06:32 ERROR Container health check failed too many times
container=live-video-to-video_streamdiffusion-sdxl_8900
Context
- Stream:
str_EWpq9mvkx9yNfbeQ - Gateway request:
a69a78f6 - Manifest:
ada561f0 - Pipeline:
streamdiffusion-sdxl - Node:
ai.nightnode.net(containerlive-video-to-video_streamdiffusion-sdxl_8900) - Time: 2026-03-24 ~18:05–18:06 UTC
- Preceding condition:
DEGRADED_INPUTfor >60s (no input frames), transitioned toERRORwhen restart was attempted
Preceding Context
Before the crash, the stream had been in DEGRADED_INPUT state for ~90 seconds:
18:05:26 WARNING last_value_cache.py:42:get — Timed out waiting for value (timeout=30.0s)
18:05:56 INFO process_guardian.py:189 — Shutting down streamer. Flagging DEGRADED_INPUT state during shutdown: time_since_last_input=60.1s
Root Cause Hypothesis
The child process (pid=169) acknowledged SIGTERM (_run_pipeline_loops finished), but process.py:stop() subsequently reports it still alive when attempting SIGKILL. Possible causes:
- The process entered an uninterruptible sleep (D-state) after handling SIGTERM — Linux does not deliver SIGKILL to D-state processes until the wait completes.
- PID reuse race — unlikely given the tight timing.
- The child spawned a grandchild that inherited the PID and isn't in the same process group, so killing only the parent-PID leaves the subprocess alive.
Suggested Investigation
- Add
os.killpg(os.getpgid(child_pid), signal.SIGKILL)to kill the entire process group. - Log
/proc/{child_pid}/statusstate at the point of kill failure. - Add a timeout check for process group kill as a final fallback before marking ERROR.
Grafana Reference
fal.ai Logs Dashboard — 2026-03-24, 18:00–18:10 UTC
Filter: stream_id=str_EWpq9mvkx9yNfbeQ