Skip to content

Potential race condition between hook execution and container lifecycle #105

@enp0s3

Description

@enp0s3

Recently we've observed the following failure:

Error: failed to start container extract-content: /usr/bin/crun --root /run/crun --systemd-cgroup start extract-content 
failed: error executing hook /opt/oci-hook-swap.sh (exit code: 1)

The pods were failed to run. Further my observation I saw in the logs that crun update failed due to non existing container ID.
This led to the conclusion that the container finished its run before the hook make it to run the update command.
The fact that the hook is running at the host namespace at the post start phase means it is detached from the container lifecycle.

Possible solution could be to check whether the PID of the container is still exists, but it seems like the nature of the hook will always lead
us to the Time-To-Check-Time-To-Use issue.

Next action items:

  • Confirm the detached assumption with CRIO/CRUN
  • Create a clear reproduction of the issue
  • Observe transition to NRI since there this kind of TOCTOU issue isn't possible by design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions