Fix device plugin grpc properly stop and restart after kubelet restart#84
Conversation
…ver lifecycle - Add sync.WaitGroup for goroutine lifecycle tracking - Add dialFunc/registerKubeletFunc/prepareHostResourcesFunc test hooks - Defer grpcServer initialization to Start() for testability - Guard stopCh close and grpcServer.Stop() against nil/closed states - Remove socket file on shutdown - Add Apache License header to register.go Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
…urces - Wrap prepareHostResources as a method to allow test hook injection - Add panicOnFatalLogger to detect grpc Fatalf calls - Add setupRestartablePluginServer helper with all hooks injected - Add TestGrpcServer_RestartDoesNotPanic for Stop+Start cycle Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
|
@peachest: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, peachest The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix three defects in PluginServer gRPC restart logic, ensuring Stop+Start cycles do not cause panics, data races, or goroutine leaks.
Defect fixes
Defect 1:
Start()reuses thegrpcServercreated in the constructor. On the secondStart()call,serve()triggers the gRPC fatal errorRegisterService after Serve(callsos.Exit(1)).Defect 2:
Stop()does not wait for background goroutines to exit — it callsclose(stopCh)and returns immediately. WhenStart()creates new goroutines that read the newstopCh, stale goroutines may still hold references to the oldstopCh, causing a data race.Defect 3:
Stop()does not clean up the Unix socket file, sonet.Listenmay fail on restart.Fix summary
server.gops.grpcServer = grpc.NewServer()inStart()(Defect 1);2. Add
wg sync.WaitGroupto track all goroutines (Defect 2);3.
Stop():close(stopCh)→grpcServer.Stop()→wg.Wait()→os.Remove(socket)(Defect 2, 3);4. All three goroutines call
wg.Add(1)/Done();5. Add private test hook fields (
dialFunc,registerKubeletFunc,prepareHostResourcesFunc) for test isolation;register.gowg.Add(1)/Done()towatchAndRegister; addregisterKubeletFuncanddialFunchook checks;Which issue(s) this PR fixes:
Fixes Project-HAMi/HAMi#1911
Special notes for your reviewer:
go test -race, the five restart tests may still report flaky false positives due to a Go 1.26 SSA optimization that replaces function parameter reads with direct struct field reads ingostatement argument evaluation. Seerace-detector-false-positive.mdfor a full root-cause analysis.Does this PR introduce a user-facing change?:
No