Skip to content

fix(instances): reconcile orphaned runtime processes#235

Open
hiroTamada wants to merge 1 commit into
mainfrom
fix/orphan-runtime-reconciler
Open

fix(instances): reconcile orphaned runtime processes#235
hiroTamada wants to merge 1 commit into
mainfrom
fix/orphan-runtime-reconciler

Conversation

@hiroTamada
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada commented May 16, 2026

Summary

  • Start a runtime orphan reconciler alongside the existing TAP reconciler.
  • Adopt qemu/firecracker processes that survived hypeman-api restarts by writing their PID back into persisted instance metadata.
  • Terminate aged PID 1 runtime processes that no longer have instance metadata.

Test plan

  • go test ./lib/instances -run '^$'
  • Attempted go test ./lib/instances; local macOS run fails on unrelated VZ/e2fsprogs/image-readiness tests.
  • Attempted Linux-targeted GOOS=linux go test ./lib/instances -run TestInstanceIDFromRuntimeCmdline; cannot execute Linux test binary on macOS.

Made with Cursor


Note

Medium Risk
Adds a background reconciler that scans /proc and can send SIGTERM/SIGKILL to qemu/firecracker processes; incorrect identification of orphans or instance IDs could terminate or mis-associate a live VM.

Overview
Starts a new runtime orphan GC alongside existing instance reconcilers during API startup (cmd/api/main.go).

On Linux, the instance manager now periodically scans for qemu/firecracker processes whose PPID is 1, extracts an instance ID from their cmdline under the guests dir, and either adopts them by persisting HypervisorPID into instance metadata or terminates them if no metadata exists and the process is older than 5 minutes (lib/instances/runtime_orphan*.go). Non-Linux builds stub this behavior, and a small Linux unit test validates instance ID extraction from cmdlines.

Reviewed by Cursor Bugbot for commit 11739f4. Bugbot is set up for automated code reviews on this repo. Configure here.

Adopt qemu/firecracker runtimes left behind after API restarts and terminate aged unowned runtimes so host process state can converge.

Co-authored-by: Cursor <cursoragent@cursor.com>
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: Changes are in packages/api/lib/instances (runtime process reconciliation), not in the monitored API endpoints (packages/api/cmd/api/) or Temporal workflows (packages/api/lib/temporal) paths.

To monitor this PR anyway, reply with @firetiger monitor this.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 11739f4. Configure here.

continue
}
name := strings.TrimSpace(string(comm))
if name != "qemu-system-x86" && name != "firecracker" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QEMU comm name exceeds TASK_COMM_LEN, never matches

High Severity

The filter checks name != "qemu-system-x86" (16 characters), but Linux's TASK_COMM_LEN is 16 bytes including the null terminator, so /proc/PID/comm truncates process names to 15 characters. The actual QEMU binary is qemu-system-x86_64 (per qemuBinaryName() in process.go), which the kernel truncates to "qemu-system-x8" in comm. Since "qemu-system-x86" is 16 chars and can never appear in /proc/PID/comm, the reconciler silently skips all orphaned QEMU processes — it can neither adopt nor terminate them.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 11739f4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant