fix: ensure network cleanup runs when VMM is already dead #402

goyalpalak18 · 2026-01-25T16:12:41Z

Description

I fixed a resource leak where network interfaces (TAP devices) and Traffic Control (TC) rules were being left behind if a VMM crashed or was killed externally.

The issue was a logic flaw in how we handled the Kill() command. Previously, the code tried to stop the VMM process first. If that failed—for example, if the process was already dead (returning ESRCH)—the function would return an error immediately and skip the network.Cleanup() step entirely.

This was causing serious issues in crash-loop scenarios (like an OOMing unikernel). Since the cleanup never ran, tap0_urunc devices and TC redirect filters kept piling up in the network namespace, eventually causing network blackholes and exhausting node resources.

Changes

pkg/unikontainers/unikontainers.go:
- I decoupled the VMM stop operation from the network cleanup.
- The code now guarantees that network.Cleanup("tap0_urunc") runs, even if vmm.Stop() returns an error. If the stop fails, I log a warning but still proceed to clean up the network resources before returning.
pkg/unikontainers/hypervisors/utils.go:
- I updated the killProcess function to handle syscall.ESRCH gracefully.
- If we try to kill a process that is already dead, I now treat that as a success (returning nil) instead of an error. This makes the stop operation idempotent—if it's dead, my job is done.

Testing

I verified this using a "zombie VMM" reproduction scenario:

Reproduction:
- I started a container, then manually killed the VMM process using kill -9.
- I ran urunc kill <container_id>.
- Before: The command failed with "no such process," and ip link show proved the TAP device was still there.
- After: The command succeeded (silently handling the dead process), and I confirmed the TAP device was correctly removed.
Crash Loop:
- I simulated a Kubernetes crash loop and verified that after 10+ restarts, tc filter show showed zero orphaned rules.

Impact

Stability: This prevents network degradation in long-running clusters where pods might crash and restart frequently.
Resource Management: We now ensure strict cleanup of kernel network objects regardless of whether the application crashed or stopped gracefully.
Correctness: The kill command now accurately reflects that the resources are gone, rather than erroring out just because the process was already missing.

Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>

netlify · 2026-01-25T16:12:47Z

✅ Deploy Preview for urunc ready!

Name	Link
🔨 Latest commit	`ce7e5c0`
🔍 Latest deploy log	https://app.netlify.com/projects/urunc/deploys/6977ab42f0c0d2000809c844
😎 Deploy Preview	https://deploy-preview-402--urunc.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

cmainas · 2026-01-26T08:46:55Z

Hello @goyalpalak18 ,

thank you for this contribution. Please create an issue before opening a PR. May I ask which network namespace you set during your test?

goyalpalak18 · 2026-01-26T18:11:51Z

@cmainas Thanks! I have created the tracking issue as requested: #408

To answer your question: I tested this using the standard CNI setup, so the TAP device was located inside the pod's network namespace.

(Note: I also force-pushed the branch just now to remove some unrelated commits that accidentally got mixed in. The PR is clean now.)

fix: ensure network cleanup runs when VMM is already dead

ce7e5c0

Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>

cmainas added the ok-to-test label Jan 26, 2026

goyalpalak18 force-pushed the fix/network-cleanup-on-vmm-crash branch from 20cd951 to ce7e5c0 Compare January 26, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ensure network cleanup runs when VMM is already dead #402

fix: ensure network cleanup runs when VMM is already dead #402

goyalpalak18 commented Jan 25, 2026

Uh oh!

netlify bot commented Jan 25, 2026 •

edited

Loading

Uh oh!

cmainas commented Jan 26, 2026

Uh oh!

goyalpalak18 commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: ensure network cleanup runs when VMM is already dead #402

Are you sure you want to change the base?

fix: ensure network cleanup runs when VMM is already dead #402

Conversation

goyalpalak18 commented Jan 25, 2026

Description

Changes

Testing

Impact

Uh oh!

netlify bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for urunc ready!

Uh oh!

cmainas commented Jan 26, 2026

Uh oh!

goyalpalak18 commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Jan 25, 2026 •

edited

Loading