-
Notifications
You must be signed in to change notification settings - Fork 50
fix: verify VMM startup before reporting container as running #397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: verify VMM startup before reporting container as running #397
Conversation
✅ Deploy Preview for urunc ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @cmainas , this fixes the critical race condition causing silent zombie containers by switching to a fork+exec model with a liveness check. Please take a look when you have a moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @goyalpalak18 ,
thank you for this contribution. This is indeed a well known race, also documented in the comments inside the codebase. However, I have some comments:
- I was not able to reproduce your Testing description. K8s was correctly identifying the container as failed.
- This proposal is a significant change in
urunc's execution model, where the monitor process is the "init" (pid 1) of the host container. It is not easy to make such a design change that way. There are ways to resolve the race issue without changing the design. The suggested changes to pre-built the monitor command before sending the started message is one of the ways to do that. - As you can see from the tests, the changes do not work.
Add BuildExecCmd() method to VMM interface to validate command construction before sending StartSuccess message. This prevents reporting a container as "Running" when the VMM command cannot be built (e.g., due to invalid arguments or failed config writes). The fix preserves the existing PID 1 execution model where the reexec process uses syscall.Exec to become the VMM.
d40f5ac to
f1ee61b
Compare
|
@cmainas Thanks for the review. I have updated the PR based on your feedback: Reverted Execution Model: I undid the PID 1 changes, so the original process hierarchy is preserved. Implemented Suggestion: I added BuildExecCmd to validate the command before sending the StartSuccess message. This resolves the race condition without changing the design. |
cmainas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @goyalpalak18 for the changes, but why are we building the exec command twice? Also, if we have the build command, then we can simply exec from unikontainers.Exec rather than inside each monitor
| } | ||
|
|
||
| func (fc *Firecracker) Execve(args types.ExecArgs, ukernel types.Unikernel) error { | ||
| exArgs, err := fc.BuildExecCmd(args, ukernel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we building the exec command twice?
|
@cmainas Thanks for catching that! I've refactored the logic to simplify the flow as you suggested:
|
Description
I’ve fixed a critical race condition in our container lifecycle where we were telling the runtime (CRI) that the container had started successfully before we actually knew if the VMM was running.
Previously, we were using
syscall.Exec()to replace theuruncprocess with the VMM. The problem withexecis that it doesn't return on success, so we were forced to send the StartSuccess signal optimistically before making the call. This meant that if the VMM failed to launch (missing binary, bad config, etc.), we ended up with a "zombie" state: Kubernetes thought the container wasRunning, but the process was actually dead.To fix this, I refactored the startup logic to use a
fork+execpattern. My new flow looks like this:exec.Command().Start().kill -0.state.json.StartSuccessonce I've confirmed the VMM is up and running.Changes
pkg/unikontainers/unikontainers.go:syscall.Exectoexec.Command().Start().kill -0) to catch immediate crashes.state.jsonbefore exiting.pkg/unikontainers/types/types.go:VMMinterface with aBuildExecCmdmethod. This lets us build the command arguments separately from executing them.qemu.go,firecracker.go,hvt.go,spt.go):BuildExecCmdfor all our supported hypervisors.firecracker.go, I also fixed a swallowedjson.Marshalerror I noticed.pkg/unikontainers/hypervisors/hedge.go:BuildExecCmdto keep the interface satisfied.Testing
I ran a few scenarios to confirm this behaves as expected:
uruncwith a bad path to the QEMU binary. Before this fix, the pod would falsely report asRunningwith no logs and a bogus PID.CreateContainerError/CrashLoopBackOff), which is exactly what we want.Impact
state.jsonis finally trustworthy.