Add automatic reconnect on Ironic authentication failure#7
Conversation
WalkthroughNewClient now requires a context and retains a newServiceClient factory. The client classifies auth/reauth errors, can reconnect by recreating and validating its ServiceClient, and GetNode/SetPowerState retry once after a successful reconnect. Tests and callsites updated to pass a Context (cmd/main uses a 30s timeout). ChangesAuthentication recovery mechanism for Ironic client
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Risk AssessmentSeverity: Medium-high. Impact: Authentication and availability. The runtime swap of ServiceClient can cause connectivity to an unintended endpoint or leave the client unauthorized if the factory or returned client is misconfigured. Error-classification gaps may lead to missed reconnects or unnecessary retries. Verify factory credentials, Endpoint validation, and that isAuthError covers production auth error shapes.
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/ironic/client_internal_test.go`:
- Around line 129-143: Add a test to cover the case where
Client.newServiceClient returns (nil, nil) to ensure reconnect() handles a nil
*gophercloud.ServiceClient without panicking; update or add a test in
internal/ironic/client_internal_test.go that constructs a Client with
serviceClient set (e.g., oldSC) and newServiceClient returning (nil, nil), then
call c.reconnect() and assert it returns an error containing "no endpoint
configured" (or other expected error) rather than causing a panic, referencing
the Client type and reconnect method so the nil-check path is exercised.
In `@internal/ironic/client.go`:
- Around line 42-43: The code has a data race on Client.serviceClient during
reconnect(): add a synchronization primitive (e.g., a mu sync.RWMutex field on
Client) and use it to protect access: in reconnect() acquire mu.Lock() before
assigning c.serviceClient = sc and mu.Unlock(), and in readers (GetNode,
GetEndpoint, SetPowerState and any immediate retry logic that reads
c.serviceClient) use mu.RLock()/mu.RUnlock() when reading the pointer;
alternatively you can store serviceClient via an atomic.Pointer and do an
atomic.Store/Load around newServiceClient/reads (use
atomic.Pointer[*gophercloud.ServiceClient] if choosing atomics). Ensure the
newServiceClient func and reconnect() still return errors unchanged and that all
reads/writes to serviceClient use the chosen synchronization.
- Around line 65-67: The reconnect path currently uses context.TODO() inside the
factory (clientconfig.NewServiceClient(context.TODO(), "baremetal", nil)) and
reconnect() has no ctx, so reconnection can ignore caller cancellation; modify
reconnect to accept a context (e.g., reconnect(ctx context.Context)), change the
factory in client.go to use the passed ctx instead of context.TODO(), and update
all callers (notably where GetNode and SetPowerState call c.reconnect()) to pass
their request-scoped ctx through so client recreation respects
cancellations/deadlines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0c2f20cd-d639-4513-bbb4-6f5f9885083e
📒 Files selected for processing (2)
internal/ironic/client.gointernal/ironic/client_internal_test.go
3074ead to
c39b355
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
internal/ironic/client.go (2)
42-43:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftSynchronize
serviceClientaccess to prevent auth-path data races.Critical risk / availability impact:
reconnectwritesc.serviceClientwhile request paths read it concurrently. Under parallel reconciles, this is a true Go race and can cause undefined behavior during 401 recovery.Suggested fix
import ( "context" "errors" "fmt" "log/slog" "net/http" + "sync" @@ Client struct { + mu sync.RWMutex serviceClient *gophercloud.ServiceClient newServiceClient func(ctx context.Context) (*gophercloud.ServiceClient, error) } @@ func (c *Client) reconnect(ctx context.Context) error { @@ - c.serviceClient = sc + c.mu.Lock() + c.serviceClient = sc + c.mu.Unlock() @@ func (c *Client) GetNode(ctx context.Context, nodeID string) (*nodes.Node, error) { - node, err := nodes.Get(ctx, c.serviceClient, nodeID).Extract() + c.mu.RLock() + sc := c.serviceClient + c.mu.RUnlock() + node, err := nodes.Get(ctx, sc, nodeID).Extract() @@ - node, err = nodes.Get(ctx, c.serviceClient, nodeID).Extract() + c.mu.RLock() + sc = c.serviceClient + c.mu.RUnlock() + node, err = nodes.Get(ctx, sc, nodeID).Extract() @@ func (c *Client) GetEndpoint() string { - return c.serviceClient.Endpoint + c.mu.RLock() + defer c.mu.RUnlock() + return c.serviceClient.Endpoint } @@ func (c *Client) SetPowerState(ctx context.Context, nodeID string, target TargetPowerState) error { - res := nodes.ChangePowerState(ctx, c.serviceClient, nodeID, nodes.PowerStateOpts{Target: target.powerstate}) + c.mu.RLock() + sc := c.serviceClient + c.mu.RUnlock() + res := nodes.ChangePowerState(ctx, sc, nodeID, nodes.PowerStateOpts{Target: target.powerstate}) @@ - res = nodes.ChangePowerState(ctx, c.serviceClient, nodeID, nodes.PowerStateOpts{Target: target.powerstate}) + c.mu.RLock() + sc = c.serviceClient + c.mu.RUnlock() + res = nodes.ChangePowerState(ctx, sc, nodeID, nodes.PowerStateOpts{Target: target.powerstate})#!/bin/bash # Verify unsynchronized serviceClient access sites in this file. rg -n 'serviceClient' internal/ironic/client.go rg -n 'sync\.RWMutex|mu\.(RLock|RUnlock|Lock|Unlock)' internal/ironic/client.goAlso applies to: 96-107, 113-114, 120-120, 132-132, 143-143, 150-150
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/ironic/client.go` around lines 42 - 43, The code currently performs unsynchronized reads/writes of the c.serviceClient field causing data races; add a sync.RWMutex (e.g., mu) to the client struct and wrap all accesses to serviceClient: use mu.Lock()/mu.Unlock() when assigning in reconnect (or any writer like newServiceClient usage) and mu.RLock()/mu.RUnlock() for all readers in request paths and helper methods that reference serviceClient (search for uses in reconnect, methods that call c.serviceClient, and any code paths around token/401 recovery) to ensure safe concurrent access.
68-73:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd nil guards before dereferencing recreated/created service clients.
Major risk / reliability impact: both constructor and reconnect dereference
sc.Endpointwithout checkingsc != nil; a(nil, nil)factory return panics and can crash the process on the recovery path.Suggested fix
sc, err := factory(ctx) if err != nil { return nil, fmt.Errorf("failed to create baremetal client: %w", err) } + if sc == nil { + return nil, fmt.Errorf("failed to create baremetal client: nil service client") + } if sc.Endpoint == "" { return nil, fmt.Errorf("baremetal client has no endpoint configured") } @@ sc, err := c.newServiceClient(ctx) if err != nil { slog.Error("failed to recreate ironic service client", "error", err) return fmt.Errorf("failed to recreate baremetal client: %w", err) } + if sc == nil { + return fmt.Errorf("failed to recreate baremetal client: nil service client") + } if sc.Endpoint == "" { return fmt.Errorf("recreated baremetal client has no endpoint configured") }#!/bin/bash # Verify endpoint dereferences and nil guards around ServiceClient creation/recreation. rg -n 'factory\(ctx\)|newServiceClient\(ctx\)|Endpoint == ""|sc == nil' internal/ironic/client.goAlso applies to: 98-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/ironic/client.go` around lines 68 - 73, The code dereferences sc.Endpoint without ensuring sc is non-nil (from factory(ctx) and in the reconnect/newServiceClient paths), so add nil guards: after calling factory(ctx) or newServiceClient(ctx) check if sc == nil and return a descriptive error (e.g., "baremetal client creation returned nil service client") before any use of sc.Endpoint; likewise update the reconnect logic to verify sc != nil before accessing sc.Endpoint or other fields to avoid panics. Ensure changes reference the factory(ctx) call, the sc variable, sc.Endpoint usage, and the reconnect/newServiceClient functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/main.go`:
- Line 237: Replace the unbounded context passed to ironic.NewClient with a
timeout-bound context: create a context via
context.WithTimeout(context.Background(), <timeout>), call ironic.NewClient with
that ctx, and ensure you call cancel() (defer cancel()) to avoid leaks; choose a
sensible timeout (e.g., 15–30s) and add the time import if missing so startup
cannot block indefinitely on network/auth failures when initializing
ironicClient.
---
Duplicate comments:
In `@internal/ironic/client.go`:
- Around line 42-43: The code currently performs unsynchronized reads/writes of
the c.serviceClient field causing data races; add a sync.RWMutex (e.g., mu) to
the client struct and wrap all accesses to serviceClient: use
mu.Lock()/mu.Unlock() when assigning in reconnect (or any writer like
newServiceClient usage) and mu.RLock()/mu.RUnlock() for all readers in request
paths and helper methods that reference serviceClient (search for uses in
reconnect, methods that call c.serviceClient, and any code paths around
token/401 recovery) to ensure safe concurrent access.
- Around line 68-73: The code dereferences sc.Endpoint without ensuring sc is
non-nil (from factory(ctx) and in the reconnect/newServiceClient paths), so add
nil guards: after calling factory(ctx) or newServiceClient(ctx) check if sc ==
nil and return a descriptive error (e.g., "baremetal client creation returned
nil service client") before any use of sc.Endpoint; likewise update the
reconnect logic to verify sc != nil before accessing sc.Endpoint or other fields
to avoid panics. Ensure changes reference the factory(ctx) call, the sc
variable, sc.Endpoint usage, and the reconnect/newServiceClient functions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b3e2bc1f-0ea4-4172-9671-e1d1393f32d9
📒 Files selected for processing (4)
cmd/main.gointernal/ironic/client.gointernal/ironic/client_internal_test.gointernal/ironic/client_test.go
c39b355 to
c1c260e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/ironic/client.go`:
- Around line 68-73: The code calls factory(ctx) and immediately accesses
sc.Endpoint, which dereferences sc without checking for nil; update both factory
call sites in internal/ironic/client.go to first check if sc == nil after sc,
err := factory(ctx) and if so return a descriptive error (e.g.,
fmt.Errorf("baremetal client factory returned nil ServiceClient")) before any
use of sc.Endpoint or other fields, and keep the existing error wrap for err;
apply the same nil-check guard to the other occurrence where sc is dereferenced
(the second factory(...) call block) to prevent panics during startup or
auth-recovery.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: beb658ec-33a2-4135-bf6a-31c439b0b2a6
📒 Files selected for processing (4)
cmd/main.gointernal/ironic/client.gointernal/ironic/client_internal_test.gointernal/ironic/client_test.go
0cc57ff to
636c785
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/ironic/client_internal_test.go (1)
135-149:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing test coverage for nil ServiceClient return remains unaddressed.
Security risk: If
newServiceClientfactory returns(nil, nil)and the implementation dereferencesclient.Endpointwithout a nil-check, this causes a panic. In production, panics reduce availability and create a potential denial-of-service vector if the factory has a bug.Impact: Availability degradation; untested edge case could cause crash loops during auth failures.
🛡️ Proposed additional test case
+ It("should return error when factory returns nil client", func() { + oldSC := &gophercloud.ServiceClient{Endpoint: oldEndpoint} + + c := &Client{ + serviceClient: oldSC, + newServiceClient: func(context.Context) (*gophercloud.ServiceClient, error) { + return nil, nil + }, + } + + err := c.reconnect(context.Background()) + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring("nil service client")) + Expect(c.serviceClient).To(Equal(oldSC), "should keep old client on failure") + }) + It("should return error when new client has empty endpoint", func() {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/ironic/client_internal_test.go` around lines 135 - 149, The test suite and reconnect logic must handle the case where newServiceClient returns (nil, nil); add a test in internal/ironic/client_internal_test.go that sets Client.newServiceClient to return (nil, nil) and assert that Client.reconnect(ctx) returns a non-nil error containing a clear message (e.g., "no endpoint configured" or "service client is nil"); then update the Client.reconnect implementation to check the returned *gophercloud.ServiceClient for nil before dereferencing client.Endpoint (references: Client.reconnect, Client.newServiceClient, Client.serviceClient, ServiceClient.Endpoint) so it returns a safe error instead of panicking.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/ironic/client.go`:
- Around line 81-94: The isAuthError function currently uses value targets for
errors.As which misses pointer-typed gophercloud reauth errors; change the
errors.As checks in isAuthError to use pointer targets (e.g.,
*gophercloud.ErrUnableToReauthenticate and
*gophercloud.ErrErrorAfterReauthentication) so pointer instances returned by
ProviderClient are matched, and update tests in client_internal_test.go to
construct the actual pointer-shaped errors (use
&gophercloud.ErrUnableToReauthenticate{...} and
&gophercloud.ErrErrorAfterReauthentication{...}) to verify the real runtime
shapes trigger reconnect.
---
Duplicate comments:
In `@internal/ironic/client_internal_test.go`:
- Around line 135-149: The test suite and reconnect logic must handle the case
where newServiceClient returns (nil, nil); add a test in
internal/ironic/client_internal_test.go that sets Client.newServiceClient to
return (nil, nil) and assert that Client.reconnect(ctx) returns a non-nil error
containing a clear message (e.g., "no endpoint configured" or "service client is
nil"); then update the Client.reconnect implementation to check the returned
*gophercloud.ServiceClient for nil before dereferencing client.Endpoint
(references: Client.reconnect, Client.newServiceClient, Client.serviceClient,
ServiceClient.Endpoint) so it returns a safe error instead of panicking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: osac-project/coderabbit/.coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 343190ff-64d0-4747-9333-bc7f416ce850
📒 Files selected for processing (4)
cmd/main.gointernal/ironic/client.gointernal/ironic/client_internal_test.gointernal/ironic/client_test.go
When gophercloud's built-in token refresh (ReauthFunc) fails — e.g. due to a network interruption or Keystone being temporarily unavailable — the Ironic client was left permanently stuck with a stale token until the pod restarted. This adds a fallback that detects 401/reauth errors, recreates the service client from scratch, and retries the operation once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
636c785 to
4373790
Compare
Deferring the data race fix to a future PR as agreed with the author.
Port the reconnect logic from the old internal/ironic package (PR osac-project#7) into the new internal/management layer. When gophercloud's built-in token refresh fails, the client detects auth errors (401, reauth failures), recreates the service client from scratch, and retries the operation once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port the reconnect logic from the old internal/ironic package (PR #7) into the new internal/management layer. When gophercloud's built-in token refresh fails, the client detects auth errors (401, reauth failures), recreates the service client from scratch, and retries the operation once. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Closes: osac-project/issues#394
When gophercloud's built-in token refresh (ReauthFunc) fails — e.g. due to a network interruption or Keystone being temporarily unavailable — the Ironic client was left permanently stuck with a stale token until the pod restarted. This adds a fallback that detects 401/reauth errors, recreates the service client from scratch, and retries the operation once.
Summary by CodeRabbit
New Features
Bug Fixes
Tests