-
Notifications
You must be signed in to change notification settings - Fork 89
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed
Milestone
Description
Currently, the execution error of an NRI plugin seems to be handled on three layers: NRI plugins, NRI adaptation, and CRI runtime.
- The NRI plugin itself determines whether to throw the error to the nri adaptation. Only the thrown errors can be handled in other layers.
- The NRI adaptation catches the error returned by a plugin and checks if the error is fatal. A fatal error will cause the plugin to be closed, but neither abort the calling of other plugins nor throw the error. A non-fatal error will abort the calling of the other plugins and throw the error to the cri runtime (e.g. containerd).
- The CRI runtime catches the error and sometimes fails the CRI request while sometimes does not. In containerd v1.7.0, errors from hook events, including RunPodSandbox, CreateContainer, StartContainer, and UpdateContainer are handled in CRI, while errors from the other hook events are ignored by containerd.
In our use cases, we may adjust the cgroup resources or manage devices for containers according to enhance Pod QoS and scheduling. NRI is an approach to do these works synchronically in pods/containers' lifecycles. However, the error handling described above will limit the implementation of our NRI plugin:
- The CRI runtime does not handle errors from some hook events. So the plugin needs to remember the failure and retry later by itself. e.g. When a plugin wants to remove/uninstall devices in the RemoveContainer event.
- There is no configurable failure policy, so the plugins should cooperate nicely to avoid accidental failure, which aborts other plugins' execution. e.g. When a plugin P0 throws errors, the plugin P1 that runs after P0 is always skipped.
Therefore, we hope that there is an error-handling policy in NRI to resolve our issues:
- It is expected to be configurable in the NRI adaptation or in the plugin to determine whether the execution error of a plugin can be ignored. For example, provide an option field
failureIgnoredStagesin the NRI adaptation or stub which defines whose failure can be ignored otherwise the adaptation and the runtime should handle the error and decide whether to fail the CRI request. - It is expected to be perceivable to the CRI runtime (perhaps a future work). So the CRI runtime can behave to handle the error in CRI requests or ignore it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationgood first issueGood for newcomersGood for newcomershelp wantedExtra attention is neededExtra attention is needed