Recently we've observed the following failure:
Error: failed to start container extract-content: /usr/bin/crun --root /run/crun --systemd-cgroup start extract-content
failed: error executing hook /opt/oci-hook-swap.sh (exit code: 1)
The pods were failed to run. Further my observation I saw in the logs that crun update failed due to non existing container ID.
This led to the conclusion that the container finished its run before the hook make it to run the update command.
The fact that the hook is running at the host namespace at the post start phase means it is detached from the container lifecycle.
Possible solution could be to check whether the PID of the container is still exists, but it seems like the nature of the hook will always lead
us to the Time-To-Check-Time-To-Use issue.
Next action items:
- Confirm the detached assumption with CRIO/CRUN
- Create a clear reproduction of the issue
- Observe transition to NRI since there this kind of TOCTOU issue isn't possible by design.
Recently we've observed the following failure:
The pods were failed to run. Further my observation I saw in the logs that
crun updatefailed due to non existing container ID.This led to the conclusion that the container finished its run before the hook make it to run the update command.
The fact that the hook is running at the host namespace at the post start phase means it is detached from the container lifecycle.
Possible solution could be to check whether the PID of the container is still exists, but it seems like the nature of the hook will always lead
us to the Time-To-Check-Time-To-Use issue.
Next action items: