gRPC xDS Transport Converts Unauthenticated to Unavailable, Causing Infinite Retry
Summary
When using storage.NewGRPCClient() on GKE with a misconfigured Workload Identity binding, the client retries indefinitely instead of returning an Unauthenticated error. The root cause is that the gRPC Go xDS transport and pickfirst balancer discard the original gRPC status code, and the picker wrapper unconditionally assigns codes.Unavailable to all non-status errors. The storage client's retry predicate then treats Unavailable as retryable.
Reproduction
On GKE with Workload Identity enabled:
- Create a K8s ServiceAccount annotated with a GCP SA
- Create the Workload Identity binding for a different K8s SA name than the one the pod uses
- Use
storage.NewGRPCClient() and call any method (e.g. Object.Attrs())
The client hangs until context deadline. With a short per-call timeout, the error surfaces as:
retry failed with context deadline exceeded; last error: rpc error: code = Unavailable
desc = name resolver error: ... xds: error received from xDS stream: rpc error:
code = Unauthenticated desc = transport: per-RPC creds failed due to error:
compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden:
Permission 'iam.serviceAccounts.getAccessToken' denied on resource`
The inner code is Unauthenticated (not retryable per gRPC spec). The outer code is Unavailable (retryable). However the Google Cloud Storage client's retry logic only reads the outer code.
Error Transformation Chain
IAM 403 "Permission denied" (raw HTTP error)
-> codes.Unauthenticated grpc/internal/transport/http2_client.go
-> xdsresource.NewErrorf("xds: ... %v", err) grpc/internal/xds/clients/xdsclient/authority.go
*** gRPC status code DISCARDED (fmt %v) ***
-> fmt.Errorf("listener resource error: %v", err) grpc/internal/xds/xdsdepmgr/xds_dependency_manager.go
-> fmt.Errorf("name resolver error: %v", err) grpc/balancer/pickfirst/pickfirst.go
-> status.Error(codes.Unavailable, err.Error()) grpc/picker_wrapper.go
*** codes.Unavailable ASSIGNED ***
-> ShouldRetry: Unavailable -> true storage/invoke.go
*** infinite retry ***
Root Cause Analysis
There are three contributing bugs:
1. authority.go discards the gRPC status code
grpc/internal/xds/clients/xdsclient/authority.go ~L244:
watcher.ResourceError(xdsresource.NewErrorf(xdsresource.ErrorTypeConnection,
"xds: error received from xDS stream: %v", err), func() {})
The %v formatting serializes the original status.Error to a string, discarding its structured code. From this point forward, the original Unauthenticated code exists only as text in the message.
2. pickfirst.go wraps with fmt.Errorf, not status.Errorf
grpc/balancer/pickfirst/pickfirst.go ~L234:
b.updateBalancerState(balancer.State{
ConnectivityState: connectivity.TransientFailure,
Picker: &picker{err: fmt.Errorf("name resolver error: %v", err)},
})
The picker stores a plain error, not a status.Error. This means the subsequent status extraction will fail.
3. picker_wrapper.go defaults all non-status errors to Unavailable
grpc/picker_wrapper.go ~L176:
return pick{}, status.Error(codes.Unavailable, err.Error())
When the picker returns a plain error (not a status.Error), this line unconditionally assigns codes.Unavailable. This is the line that makes the error retryable.
Storage Retry Predicate
storage/invoke.go ~L237:
if st, ok := status.FromError(err); ok {
if code := st.Code(); code == codes.Unavailable || ... {
return true
}
}
ShouldRetry does have recursive unwrap logic (L243), but status.Error doesn't implement Unwrap() -- it embeds the original error as a string in the status message, not as a structured error chain. The recursion never fires.
Impact
- All Google Cloud Go client libraries that retry on
Unavailable are affected when used with the gRPC/xDS transport (DirectPath). This includes Storage, BigQuery, Pub/Sub, and Spanner.
- Storage is the most commonly affected because it defaults to
NewGRPCClient (with xDS) on GKE.
- The failure mode is silent -- no error is logged, the call simply hangs until context deadline.
Unauthenticated and PermissionDenied errors from the credential provider are both affected.
Correct Behavior per gRPC Spec
Per the gRPC status code documentation:
UNAVAILABLE (14): "transient condition, which can be corrected by retrying" -- retryable
UNAUTHENTICATED (16): "does not have valid authentication credentials" -- not retryable (requires fixing credentials)
Per gRFC A6: "only status codes that indicate the service did not process the request should be retried."
Suggested Fixes
In grpc-go (root cause):
authority.go should propagate the gRPC status code through the xDS error types rather than formatting with %v
pickfirst.go should use status.Errorf to preserve structured error codes
picker_wrapper.go should inspect inner errors before defaulting to Unavailable, or at minimum not assign Unavailable when the error chain contains Unauthenticated
In google-cloud-go/storage (defense in depth):
ShouldRetry could inspect the error message for auth-related keywords ("Unauthenticated", "PermissionDenied", "iam.serviceAccounts") and refuse to retry, similar to the existing x509 check in gax-go
gRPC xDS Transport Converts
UnauthenticatedtoUnavailable, Causing Infinite RetrySummary
When using
storage.NewGRPCClient()on GKE with a misconfigured Workload Identity binding, the client retries indefinitely instead of returning anUnauthenticatederror. The root cause is that the gRPC Go xDS transport and pickfirst balancer discard the original gRPC status code, and the picker wrapper unconditionally assignscodes.Unavailableto all non-status errors. The storage client's retry predicate then treatsUnavailableas retryable.Reproduction
On GKE with Workload Identity enabled:
storage.NewGRPCClient()and call any method (e.g.Object.Attrs())The client hangs until context deadline. With a short per-call timeout, the error surfaces as:
The inner code is
Unauthenticated(not retryable per gRPC spec). The outer code isUnavailable(retryable). However the Google Cloud Storage client's retry logic only reads the outer code.Error Transformation Chain
Root Cause Analysis
There are three contributing bugs:
1.
authority.godiscards the gRPC status codegrpc/internal/xds/clients/xdsclient/authority.go~L244:The
%vformatting serializes the originalstatus.Errorto a string, discarding its structured code. From this point forward, the originalUnauthenticatedcode exists only as text in the message.2.
pickfirst.gowraps withfmt.Errorf, notstatus.Errorfgrpc/balancer/pickfirst/pickfirst.go~L234:The picker stores a plain
error, not astatus.Error. This means the subsequent status extraction will fail.3.
picker_wrapper.godefaults all non-status errors toUnavailablegrpc/picker_wrapper.go~L176:When the picker returns a plain
error(not astatus.Error), this line unconditionally assignscodes.Unavailable. This is the line that makes the error retryable.Storage Retry Predicate
storage/invoke.go~L237:ShouldRetrydoes have recursive unwrap logic (L243), butstatus.Errordoesn't implementUnwrap()-- it embeds the original error as a string in the status message, not as a structured error chain. The recursion never fires.Impact
Unavailableare affected when used with the gRPC/xDS transport (DirectPath). This includes Storage, BigQuery, Pub/Sub, and Spanner.NewGRPCClient(with xDS) on GKE.UnauthenticatedandPermissionDeniederrors from the credential provider are both affected.Correct Behavior per gRPC Spec
Per the gRPC status code documentation:
UNAVAILABLE(14): "transient condition, which can be corrected by retrying" -- retryableUNAUTHENTICATED(16): "does not have valid authentication credentials" -- not retryable (requires fixing credentials)Per gRFC A6: "only status codes that indicate the service did not process the request should be retried."
Suggested Fixes
In
grpc-go(root cause):authority.goshould propagate the gRPC status code through the xDS error types rather than formatting with%vpickfirst.goshould usestatus.Errorfto preserve structured error codespicker_wrapper.goshould inspect inner errors before defaulting toUnavailable, or at minimum not assignUnavailablewhen the error chain containsUnauthenticatedIn
google-cloud-go/storage(defense in depth):ShouldRetrycould inspect the error message for auth-related keywords ("Unauthenticated","PermissionDenied","iam.serviceAccounts") and refuse to retry, similar to the existingx509check ingax-go