Issue 1: Allocate id results in workloads failing to restart after reboot#2
Issue 1: Allocate id results in workloads failing to restart after reboot#2goshlanguage wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the device plugin's Allocate function to fix an issue where workloads fail to restart after a node reboot. The key improvement is constructing device-specific paths based on the allocated device IDs rather than using a hardcoded generic path.
Key Changes:
- Refactored
Allocate()to iterate through container requests and construct device-specific paths using device IDs - Modified
ListAndWatch()to block on context cancellation instead of polling with a sleep loop - Added ImagePullSecrets configuration and hugepages support to example workload and daemonset
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/plugin/device_plugin.go | Refactored Allocate to construct device-specific paths from IDs; changed ListAndWatch to block on context instead of polling |
| example-workload.yaml | Added imagePullSecrets and hugepages configuration (contains YAML structure issue) |
| device-plugin-daemonset.yaml | Added imagePullSecrets for private registry access |
| README.md | Updated deployment instructions with registry authentication steps; removed roadmap section |
| .github/workflows/build.yaml | Removed branch restriction to trigger builds on all pushes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Permissions: "rw", | ||
| }, | ||
| } | ||
| klog.Infof("Received Allocate request for %v", req.ContainerRequests) |
There was a problem hiding this comment.
Trailing whitespace after the closing brace. This should be removed for code cleanliness.
| for _, req := range req.ContainerRequests { | ||
| var devices []*pluginapi.DeviceSpec | ||
|
|
||
| for _, id := range req.DevicesIds { |
There was a problem hiding this comment.
Trailing whitespace after the closing brace. This should be removed for code cleanliness.
| for _, id := range req.DevicesIds { | |
| for _, id := range req.DevicesIds { |
| klog.Infof("Received Allocate request for %v", req.ContainerRequests) | ||
|
|
||
| response := pluginapi.AllocateResponse{} | ||
|
|
||
| for _, req := range req.ContainerRequests { | ||
| var devices []*pluginapi.DeviceSpec | ||
|
|
||
| for _, id := range req.DevicesIds { |
There was a problem hiding this comment.
Trailing whitespace after the closing brace. This should be removed for code cleanliness.
| klog.Infof("Received Allocate request for %v", req.ContainerRequests) | |
| response := pluginapi.AllocateResponse{} | |
| for _, req := range req.ContainerRequests { | |
| var devices []*pluginapi.DeviceSpec | |
| for _, id := range req.DevicesIds { | |
| klog.Infof("Received Allocate request for %v", req.ContainerRequests) | |
| response := pluginapi.AllocateResponse{} | |
| for _, req := range req.ContainerRequests { | |
| var devices []*pluginapi.DeviceSpec | |
| for _, id := range req.DevicesIds { |
| } | ||
| time.Sleep(5 * time.Second) | ||
|
|
||
| // TODO: ideally we would use a channel (dp.updateCh) here |
There was a problem hiding this comment.
Trailing whitespace after the comment. This should be removed for code cleanliness.
| // TODO: ideally we would use a channel (dp.updateCh) here | |
| // TODO: ideally we would use a channel (dp.updateCh) here |
| for { | ||
| klog.Info("ListAndWatch: sending device list") | ||
| klog.Info("ListAndWatch: sending initial device list") | ||
| if err := stream.Send(&pluginapi.ListAndWatchResponse{Devices: dp.devices}); err != nil { | ||
| return err | ||
| } | ||
| time.Sleep(5 * time.Second) | ||
|
|
||
| // TODO: ideally we would use a channel (dp.updateCh) here | ||
| // to trigger sends only when hardware health changes. | ||
| // For now, we just block to keep the stream open. | ||
| <-dp.ctx.Done() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The loop in ListAndWatch will only send the device list once and then block indefinitely. After the context is done, the function returns and exits the loop. The log message "sending initial device list" will only be logged once per ListAndWatch call. If the intention is to keep sending updates, the blocking call to dp.ctx.Done() should be outside the loop, or the loop structure needs to be reconsidered. The current implementation will send the list once, then wait for context cancellation before returning.
example-workload.yaml
Outdated
| volumeMounts: | ||
| - mountPath: /hugepages | ||
| name: hugepage |
There was a problem hiding this comment.
The volumeMounts field is incorrectly placed at the spec level. It should be nested under the container definition (inside the containers array item), not at the spec level. This will cause the Pod to fail to create with a validation error.
| volumeMounts: | |
| - mountPath: /hugepages | |
| name: hugepage | |
| volumeMounts: | |
| - mountPath: /hugepages | |
| name: hugepage |
| for _, req := range req.ContainerRequests { | ||
| var devices []*pluginapi.DeviceSpec | ||
|
|
||
| for _, id := range req.DevicesIds { |
There was a problem hiding this comment.
The variable name 'req' shadows the function parameter 'req' (*pluginapi.AllocateRequest). Using the same name for the loop variable creates confusion and could lead to bugs if the outer 'req' needs to be accessed inside the loop. Consider renaming the loop variable to something like 'containerReq' or 'creq'.
| for _, req := range req.ContainerRequests { | |
| var devices []*pluginapi.DeviceSpec | |
| for _, id := range req.DevicesIds { | |
| for _, containerReq := range req.ContainerRequests { | |
| var devices []*pluginapi.DeviceSpec | |
| for _, id := range containerReq.DevicesIds { |
…hugepage requirement, and try not blindly removing device plugin socket
refactor Allocate call to fix workloads surviving reboot