Potential race condition between hook execution and container lifecycle

Recently we've observed the following failure:
```
Error: failed to start container extract-content: /usr/bin/crun --root /run/crun --systemd-cgroup start extract-content 
failed: error executing hook /opt/oci-hook-swap.sh (exit code: 1)
```
The pods were failed to run. Further my observation I saw in the logs that `crun update` failed due to non existing container ID.
This led to the conclusion that the container finished its run before the hook make it to run the update command.
The fact that the hook is running at the host namespace at the post start phase means it is detached from the container lifecycle.

Possible solution could be to check whether the PID of the container is still exists, but it seems like the nature of the hook will always lead
us to the Time-To-Check-Time-To-Use issue.

Next action items:
* Confirm the detached assumption with CRIO/CRUN
* Create a clear reproduction of the issue
* Observe transition to NRI since there this kind of TOCTOU issue isn't possible by design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential race condition between hook execution and container lifecycle #105

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential race condition between hook execution and container lifecycle #105

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions