Skip to content

Fix device plugin grpc properly stop and restart after kubelet restart#84

Merged
hami-robot[bot] merged 2 commits into
Project-HAMi:mainfrom
peachest:fix/grpc-register-after-serve
Jun 4, 2026
Merged

Fix device plugin grpc properly stop and restart after kubelet restart#84
hami-robot[bot] merged 2 commits into
Project-HAMi:mainfrom
peachest:fix/grpc-register-after-serve

Conversation

@peachest

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix three defects in PluginServer gRPC restart logic, ensuring Stop+Start cycles do not cause panics, data races, or goroutine leaks.

Defect fixes

Defect 1: Start() reuses the grpcServer created in the constructor. On the second Start() call, serve() triggers the gRPC fatal error RegisterService after Serve (calls os.Exit(1)).

Defect 2: Stop() does not wait for background goroutines to exit — it calls close(stopCh) and returns immediately. When Start() creates new goroutines that read the new stopCh, stale goroutines may still hold references to the old stopCh, causing a data race.

Defect 3: Stop() does not clean up the Unix socket file, so net.Listen may fail on restart.

Fix summary

File Changes
server.go 1. Rebuild ps.grpcServer = grpc.NewServer() in Start() (Defect 1);
2. Add wg sync.WaitGroup to track all goroutines (Defect 2);
3. Stop(): close(stopCh)grpcServer.Stop()wg.Wait()os.Remove(socket) (Defect 2, 3);
4. All three goroutines call wg.Add(1)/Done();
5. Add private test hook fields (dialFunc, registerKubeletFunc, prepareHostResourcesFunc) for test isolation;
register.go Add wg.Add(1)/Done() to watchAndRegister; add registerKubeletFunc and dialFunc hook checks;

Which issue(s) this PR fixes:

Fixes Project-HAMi/HAMi#1911

Special notes for your reviewer:

  • All five restart tests correctly detect the original three defects (verified by reverting the fix and confirming test failures).
  • When run under go test -race, the five restart tests may still report flaky false positives due to a Go 1.26 SSA optimization that replaces function parameter reads with direct struct field reads in go statement argument evaluation. See race-detector-false-positive.md for a full root-cause analysis.

Does this PR introduce a user-facing change?:

No

houyuxi added 2 commits May 29, 2026 17:36
…ver lifecycle

- Add sync.WaitGroup for goroutine lifecycle tracking
- Add dialFunc/registerKubeletFunc/prepareHostResourcesFunc test hooks
- Defer grpcServer initialization to Start() for testability
- Guard stopCh close and grpcServer.Stop() against nil/closed states
- Remove socket file on shutdown
- Add Apache License header to register.go

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
…urces

- Wrap prepareHostResources as a method to allow test hook injection
- Add panicOnFatalLogger to detect grpc Fatalf calls
- Add setupRestartablePluginServer helper with all hooks injected
- Add TestGrpcServer_RestartDoesNotPanic for Stop+Start cycle

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
@hami-robot

hami-robot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

@peachest: The label(s) kind/bug cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix three defects in PluginServer gRPC restart logic, ensuring Stop+Start cycles do not cause panics, data races, or goroutine leaks.

Defect fixes

Defect 1: Start() reuses the grpcServer created in the constructor. On the second Start() call, serve() triggers the gRPC fatal error RegisterService after Serve (calls os.Exit(1)).

Defect 2: Stop() does not wait for background goroutines to exit — it calls close(stopCh) and returns immediately. When Start() creates new goroutines that read the new stopCh, stale goroutines may still hold references to the old stopCh, causing a data race.

Defect 3: Stop() does not clean up the Unix socket file, so net.Listen may fail on restart.

Fix summary

File Changes
server.go 1. Rebuild ps.grpcServer = grpc.NewServer() in Start() (Defect 1);
2. Add wg sync.WaitGroup to track all goroutines (Defect 2);
3. Stop(): close(stopCh)grpcServer.Stop()wg.Wait()os.Remove(socket) (Defect 2, 3);
4. All three goroutines call wg.Add(1)/Done();
5. Add private test hook fields (dialFunc, registerKubeletFunc, prepareHostResourcesFunc) for test isolation;
register.go Add wg.Add(1)/Done() to watchAndRegister; add registerKubeletFunc and dialFunc hook checks;

Which issue(s) this PR fixes:

Fixes Project-HAMi/HAMi#1911

Special notes for your reviewer:

  • All five restart tests correctly detect the original three defects (verified by reverting the fix and confirming test failures).
  • When run under go test -race, the five restart tests may still report flaky false positives due to a Go 1.26 SSA optimization that replaces function parameter reads with direct struct field reads in go statement argument evaluation. See race-detector-false-positive.md for a full root-cause analysis.

Does this PR introduce a user-facing change?:

No

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@archlitchi archlitchi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot

hami-robot Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, peachest

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot added the approved label Jun 4, 2026
@hami-robot hami-robot Bot merged commit 799eaa3 into Project-HAMi:main Jun 4, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] hami-device-plugin panic with "Server.RegisterService after Server.Serve" on Ascend 910B (Driver 25.5.2, ARM64)

2 participants