Skip to content

Issue 1: Allocate id results in workloads failing to restart after reboot#2

Draft
goshlanguage wants to merge 4 commits intomainfrom
issue-1
Draft

Issue 1: Allocate id results in workloads failing to restart after reboot#2
goshlanguage wants to merge 4 commits intomainfrom
issue-1

Conversation

@goshlanguage
Copy link
Copy Markdown
Owner

refactor Allocate call to fix workloads surviving reboot

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the device plugin's Allocate function to fix an issue where workloads fail to restart after a node reboot. The key improvement is constructing device-specific paths based on the allocated device IDs rather than using a hardcoded generic path.

Key Changes:

  • Refactored Allocate() to iterate through container requests and construct device-specific paths using device IDs
  • Modified ListAndWatch() to block on context cancellation instead of polling with a sleep loop
  • Added ImagePullSecrets configuration and hugepages support to example workload and daemonset

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
internal/plugin/device_plugin.go Refactored Allocate to construct device-specific paths from IDs; changed ListAndWatch to block on context instead of polling
example-workload.yaml Added imagePullSecrets and hugepages configuration (contains YAML structure issue)
device-plugin-daemonset.yaml Added imagePullSecrets for private registry access
README.md Updated deployment instructions with registry authentication steps; removed roadmap section
.github/workflows/build.yaml Removed branch restriction to trigger builds on all pushes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Permissions: "rw",
},
}
klog.Infof("Received Allocate request for %v", req.ContainerRequests)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the closing brace. This should be removed for code cleanliness.

Copilot uses AI. Check for mistakes.
for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec

for _, id := range req.DevicesIds {
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the closing brace. This should be removed for code cleanliness.

Suggested change
for _, id := range req.DevicesIds {
for _, id := range req.DevicesIds {

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +105
klog.Infof("Received Allocate request for %v", req.ContainerRequests)

response := pluginapi.AllocateResponse{}

for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec

for _, id := range req.DevicesIds {
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the closing brace. This should be removed for code cleanliness.

Suggested change
klog.Infof("Received Allocate request for %v", req.ContainerRequests)
response := pluginapi.AllocateResponse{}
for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec
for _, id := range req.DevicesIds {
klog.Infof("Received Allocate request for %v", req.ContainerRequests)
response := pluginapi.AllocateResponse{}
for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec
for _, id := range req.DevicesIds {

Copilot uses AI. Check for mistakes.
}
time.Sleep(5 * time.Second)

// TODO: ideally we would use a channel (dp.updateCh) here
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace after the comment. This should be removed for code cleanliness.

Suggested change
// TODO: ideally we would use a channel (dp.updateCh) here
// TODO: ideally we would use a channel (dp.updateCh) here

Copilot uses AI. Check for mistakes.
Comment on lines 71 to 82
for {
klog.Info("ListAndWatch: sending device list")
klog.Info("ListAndWatch: sending initial device list")
if err := stream.Send(&pluginapi.ListAndWatchResponse{Devices: dp.devices}); err != nil {
return err
}
time.Sleep(5 * time.Second)

// TODO: ideally we would use a channel (dp.updateCh) here
// to trigger sends only when hardware health changes.
// For now, we just block to keep the stream open.
<-dp.ctx.Done()
return nil
}
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop in ListAndWatch will only send the device list once and then block indefinitely. After the context is done, the function returns and exits the loop. The log message "sending initial device list" will only be logged once per ListAndWatch call. If the intention is to keep sending updates, the blocking call to dp.ctx.Done() should be outside the loop, or the loop structure needs to be reconsidered. The current implementation will send the list once, then wait for context cancellation before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
volumeMounts:
- mountPath: /hugepages
name: hugepage
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volumeMounts field is incorrectly placed at the spec level. It should be nested under the container definition (inside the containers array item), not at the spec level. This will cause the Pod to fail to create with a validation error.

Suggested change
volumeMounts:
- mountPath: /hugepages
name: hugepage
volumeMounts:
- mountPath: /hugepages
name: hugepage

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +105
for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec

for _, id := range req.DevicesIds {
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'req' shadows the function parameter 'req' (*pluginapi.AllocateRequest). Using the same name for the loop variable creates confusion and could lead to bugs if the outer 'req' needs to be accessed inside the loop. Consider renaming the loop variable to something like 'containerReq' or 'creq'.

Suggested change
for _, req := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec
for _, id := range req.DevicesIds {
for _, containerReq := range req.ContainerRequests {
var devices []*pluginapi.DeviceSpec
for _, id := range containerReq.DevicesIds {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants