[process_guardian] Failed to kill child process via SIGKILL — pipeline stuck in ERROR state, container force-restarts (2026-03-24 18:06 UTC)

## Summary

The `ProcessGuardian` detected a pipeline entering `ERROR` state and attempted to stop and restart the child process. Both `SIGTERM` and `SIGKILL` failed — the child process remained alive. This caused the monitor loop itself to error out, putting the pipeline into a persistent `ERROR` state. The container manager (`ai.nightnode.net`) then force-restarted the container.

**Severity:** High — requires container restart to recover, disrupting active streams.

## Error Sequence

```
18:06:26 ERROR  process_guardian.py:314:_monitor_loop
  Pipeline is in ERROR state. Stopping streamer and restarting process. prev_restart_count=0

18:06:26 INFO   process.py:71:stop  — Terminating pipeline process
18:06:26 INFO   process.py:526:_handle — Received signal: 15 (SIGTERM). Initiating graceful shutdown.
18:06:26 INFO   process.py:266:_run_pipeline_loops — PipelineProcess: _run_pipeline_loops finished.

18:06:29 ERROR  process.py:90:stop — Failed to terminate process, killing
  (SIGTERM timed out → escalate to SIGKILL)

18:06:31 ERROR  process.py:93:stop — Failed to kill process
  self_pid=131 child_pid=169 is_alive=True

18:06:31 ERROR  process_guardian.py:334:_monitor_loop
  Failed to stop streamer and restart process. Moving to ERROR state
```

## Stack Trace

```
Traceback (most recent call last):
  ...
  raise RuntimeError("Failed to kill process")
RuntimeError: Failed to kill process

(from logging.exception in process_guardian.py:334:_monitor_loop)
```

## Container Outcome

```
18:06:32 ERROR  Container returned ERROR state, restarting immediately
  container=live-video-to-video_streamdiffusion-sdxl_8900  status=ERROR

18:06:32 ERROR  Container health check failed too many times
  container=live-video-to-video_streamdiffusion-sdxl_8900
```

## Context

- **Stream:** `str_EWpq9mvkx9yNfbeQ`  
- **Gateway request:** `a69a78f6`  
- **Manifest:** `ada561f0`  
- **Pipeline:** `streamdiffusion-sdxl`  
- **Node:** `ai.nightnode.net` (container `live-video-to-video_streamdiffusion-sdxl_8900`)  
- **Time:** 2026-03-24 ~18:05–18:06 UTC  
- **Preceding condition:** `DEGRADED_INPUT` for >60s (no input frames), transitioned to `ERROR` when restart was attempted

## Preceding Context

Before the crash, the stream had been in `DEGRADED_INPUT` state for ~90 seconds:
```
18:05:26 WARNING last_value_cache.py:42:get — Timed out waiting for value (timeout=30.0s)
18:05:56 INFO   process_guardian.py:189 — Shutting down streamer. Flagging DEGRADED_INPUT state during shutdown: time_since_last_input=60.1s
```

## Root Cause Hypothesis

The child process (pid=169) acknowledged `SIGTERM` (\_run\_pipeline\_loops finished), but `process.py:stop()` subsequently reports it still alive when attempting SIGKILL. Possible causes:
1. The process entered an uninterruptible sleep (D-state) after handling SIGTERM — Linux does not deliver SIGKILL to D-state processes until the wait completes.
2. PID reuse race — unlikely given the tight timing.
3. The child spawned a grandchild that inherited the PID and isn't in the same process group, so killing only the parent-PID leaves the subprocess alive.

## Suggested Investigation

- Add `os.killpg(os.getpgid(child_pid), signal.SIGKILL)` to kill the entire process group.
- Log `/proc/{child_pid}/status` state at the point of kill failure.
- Add a timeout check for process group kill as a final fallback before marking ERROR.

## Grafana Reference

[fal.ai Logs Dashboard](https://eu-metrics-monitoring.livepeer.monster/grafana/d/fal-ai-logs/fal-ai-logs-dashboard) — 2026-03-24, 18:00–18:10 UTC  
Filter: `stream_id=str_EWpq9mvkx9yNfbeQ`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[process_guardian] Failed to kill child process via SIGKILL — pipeline stuck in ERROR state, container force-restarts (2026-03-24 18:06 UTC) #735

Summary

Error Sequence

Stack Trace

Container Outcome

Context

Preceding Context

Root Cause Hypothesis

Suggested Investigation

Grafana Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[process_guardian] Failed to kill child process via SIGKILL — pipeline stuck in ERROR state, container force-restarts (2026-03-24 18:06 UTC) #735

Description

Summary

Error Sequence

Stack Trace

Container Outcome

Context

Preceding Context

Root Cause Hypothesis

Suggested Investigation

Grafana Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions