-
Notifications
You must be signed in to change notification settings - Fork 49
fix: detect Execve failures after StartSuccess signal #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: detect Execve failures after StartSuccess signal #400
Conversation
Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
✅ Deploy Preview for urunc canceled.
|
cmainas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Sanchit2662 ,
thank you for this contribution. However, i am not really sure I understand how your proposal works. How does a process catches execve failures from another process through a socket?
| return fmt.Errorf("received unexpected message: %s", msg) | ||
| } | ||
|
|
||
| // Wait briefly for potential error after success (catches Execve failures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly does this work? Which process will write in case of execve failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reexec process is responsible for writing the error.
When syscall.Exec() fails and returns, the reexec process sends StartErr over the socket before exiting:
err = unikontainer.Exec(metrics)
if err != nil {
unikontainer.SendMessage(unikontainers.StartErr)
}The parent process (urunc start) does not detect the failure directly; it only observes the error via the socket message sent by the reexec process.
Hello @cmainas , The reexec process sends a StartSuccess message before calling syscall.Exec(). If execve succeeds, the process image is replaced, no further Go code executes, and the socket is closed by the kernel. If execve fails, syscall.Exec() returns an error to the reexec process, which then sends a StartErr message over the socket. On the parent side, urunc start uses AwaitMessageWithErrorCheck, which first receives StartSuccess and then waits briefly (with a timeout) for a possible StartErr. If StartErr is received during this window, it is treated as an execve failure. |
|
Hello @Sanchit2662
I do not think this is correct. First of all the socket has been opened by
This is also incorrect. |
Summary
This PR fixes a silent container state corruption issue in
uruncwhere a container could be marked as running even though the VMM process never started.The root cause was an ordering issue in the IPC protocol:
StartSuccesswas sent tourunc startbeforevmm.Execve()was executed. IfExecve()failed afterward, the failure could no longer be reported, leaving the container in a ghost running state with no underlying process.This change introduces a small, bounded error-checking window to ensure fast
Execve()failures are correctly surfaced without redesigning the IPC protocol.Fix
Instead of immediately proceeding after receiving
StartSuccess,urunc startnow performs a short error-check window.A new helper waits briefly after success to catch a fast-following
StartErr, which reliably indicates anExecve()failure.Key addition:
How it works
urunc startreceivesStartSuccessStartErrStartErrarrives → start fails and reports the errorThis preserves the existing process model while closing the false-success window.
Impact
Before
After
Execve()failures are reliably reported