Skip to content

storage: gRPC client retries indefinitely on Workload Identity auth failures (Unauthenticated wrapped as Unavailable) #14391

@AaronReboot

Description

@AaronReboot

gRPC xDS Transport Converts Unauthenticated to Unavailable, Causing Infinite Retry

Summary

When using storage.NewGRPCClient() on GKE with a misconfigured Workload Identity binding, the client retries indefinitely instead of returning an Unauthenticated error. The root cause is that the gRPC Go xDS transport and pickfirst balancer discard the original gRPC status code, and the picker wrapper unconditionally assigns codes.Unavailable to all non-status errors. The storage client's retry predicate then treats Unavailable as retryable.

Reproduction

On GKE with Workload Identity enabled:

  1. Create a K8s ServiceAccount annotated with a GCP SA
  2. Create the Workload Identity binding for a different K8s SA name than the one the pod uses
  3. Use storage.NewGRPCClient() and call any method (e.g. Object.Attrs())

The client hangs until context deadline. With a short per-call timeout, the error surfaces as:

retry failed with context deadline exceeded; last error: rpc error: code = Unavailable
  desc = name resolver error: ... xds: error received from xDS stream: rpc error:
  code = Unauthenticated desc = transport: per-RPC creds failed due to error:
  compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden:
  Permission 'iam.serviceAccounts.getAccessToken' denied on resource`

The inner code is Unauthenticated (not retryable per gRPC spec). The outer code is Unavailable (retryable). However the Google Cloud Storage client's retry logic only reads the outer code.

Error Transformation Chain

IAM 403 "Permission denied"                              (raw HTTP error)
  -> codes.Unauthenticated                                grpc/internal/transport/http2_client.go
  -> xdsresource.NewErrorf("xds: ... %v", err)            grpc/internal/xds/clients/xdsclient/authority.go
                                                           *** gRPC status code DISCARDED (fmt %v) ***
  -> fmt.Errorf("listener resource error: %v", err)        grpc/internal/xds/xdsdepmgr/xds_dependency_manager.go
  -> fmt.Errorf("name resolver error: %v", err)            grpc/balancer/pickfirst/pickfirst.go
  -> status.Error(codes.Unavailable, err.Error())          grpc/picker_wrapper.go
                                                           *** codes.Unavailable ASSIGNED ***
  -> ShouldRetry: Unavailable -> true                      storage/invoke.go
                                                           *** infinite retry ***

Root Cause Analysis

There are three contributing bugs:

1. authority.go discards the gRPC status code

grpc/internal/xds/clients/xdsclient/authority.go ~L244:

watcher.ResourceError(xdsresource.NewErrorf(xdsresource.ErrorTypeConnection,
    "xds: error received from xDS stream: %v", err), func() {})

The %v formatting serializes the original status.Error to a string, discarding its structured code. From this point forward, the original Unauthenticated code exists only as text in the message.

2. pickfirst.go wraps with fmt.Errorf, not status.Errorf

grpc/balancer/pickfirst/pickfirst.go ~L234:

b.updateBalancerState(balancer.State{
    ConnectivityState: connectivity.TransientFailure,
    Picker:            &picker{err: fmt.Errorf("name resolver error: %v", err)},
})

The picker stores a plain error, not a status.Error. This means the subsequent status extraction will fail.

3. picker_wrapper.go defaults all non-status errors to Unavailable

grpc/picker_wrapper.go ~L176:

return pick{}, status.Error(codes.Unavailable, err.Error())

When the picker returns a plain error (not a status.Error), this line unconditionally assigns codes.Unavailable. This is the line that makes the error retryable.

Storage Retry Predicate

storage/invoke.go ~L237:

if st, ok := status.FromError(err); ok {
    if code := st.Code(); code == codes.Unavailable || ... {
        return true
    }
}

ShouldRetry does have recursive unwrap logic (L243), but status.Error doesn't implement Unwrap() -- it embeds the original error as a string in the status message, not as a structured error chain. The recursion never fires.

Impact

  • All Google Cloud Go client libraries that retry on Unavailable are affected when used with the gRPC/xDS transport (DirectPath). This includes Storage, BigQuery, Pub/Sub, and Spanner.
  • Storage is the most commonly affected because it defaults to NewGRPCClient (with xDS) on GKE.
  • The failure mode is silent -- no error is logged, the call simply hangs until context deadline.
  • Unauthenticated and PermissionDenied errors from the credential provider are both affected.

Correct Behavior per gRPC Spec

Per the gRPC status code documentation:

  • UNAVAILABLE (14): "transient condition, which can be corrected by retrying" -- retryable
  • UNAUTHENTICATED (16): "does not have valid authentication credentials" -- not retryable (requires fixing credentials)

Per gRFC A6: "only status codes that indicate the service did not process the request should be retried."

Suggested Fixes

In grpc-go (root cause):

  1. authority.go should propagate the gRPC status code through the xDS error types rather than formatting with %v
  2. pickfirst.go should use status.Errorf to preserve structured error codes
  3. picker_wrapper.go should inspect inner errors before defaulting to Unavailable, or at minimum not assign Unavailable when the error chain contains Unauthenticated

In google-cloud-go/storage (defense in depth):

  1. ShouldRetry could inspect the error message for auth-related keywords ("Unauthenticated", "PermissionDenied", "iam.serviceAccounts") and refuse to retry, similar to the existing x509 check in gax-go

Metadata

Metadata

Assignees

Labels

api: storageIssues related to the Cloud Storage API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions