fix(instances): reconcile orphaned runtime processes#235
Conversation
Adopt qemu/firecracker runtimes left behind after API restarts and terminate aged unowned runtimes so host process state can converge. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: Changes are in packages/api/lib/instances (runtime process reconciliation), not in the monitored API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) paths. To monitor this PR anyway, reply with |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 11739f4. Configure here.
| continue | ||
| } | ||
| name := strings.TrimSpace(string(comm)) | ||
| if name != "qemu-system-x86" && name != "firecracker" { |
There was a problem hiding this comment.
QEMU comm name exceeds TASK_COMM_LEN, never matches
High Severity
The filter checks name != "qemu-system-x86" (16 characters), but Linux's TASK_COMM_LEN is 16 bytes including the null terminator, so /proc/PID/comm truncates process names to 15 characters. The actual QEMU binary is qemu-system-x86_64 (per qemuBinaryName() in process.go), which the kernel truncates to "qemu-system-x8" in comm. Since "qemu-system-x86" is 16 chars and can never appear in /proc/PID/comm, the reconciler silently skips all orphaned QEMU processes — it can neither adopt nor terminate them.
Reviewed by Cursor Bugbot for commit 11739f4. Configure here.


Summary
hypeman-apirestarts by writing their PID back into persisted instance metadata.Test plan
go test ./lib/instances -run '^$'go test ./lib/instances; local macOS run fails on unrelated VZ/e2fsprogs/image-readiness tests.GOOS=linux go test ./lib/instances -run TestInstanceIDFromRuntimeCmdline; cannot execute Linux test binary on macOS.Made with Cursor
Note
Medium Risk
Adds a background reconciler that scans
/procand can send SIGTERM/SIGKILL to qemu/firecracker processes; incorrect identification of orphans or instance IDs could terminate or mis-associate a live VM.Overview
Starts a new runtime orphan GC alongside existing instance reconcilers during API startup (
cmd/api/main.go).On Linux, the instance manager now periodically scans for
qemu/firecrackerprocesses whose PPID is1, extracts an instance ID from their cmdline under the guests dir, and either adopts them by persistingHypervisorPIDinto instance metadata or terminates them if no metadata exists and the process is older than 5 minutes (lib/instances/runtime_orphan*.go). Non-Linux builds stub this behavior, and a small Linux unit test validates instance ID extraction from cmdlines.Reviewed by Cursor Bugbot for commit 11739f4. Bugbot is set up for automated code reviews on this repo. Configure here.