storage: gRPC client retries indefinitely on Workload Identity auth failures (Unauthenticated wrapped as Unavailable)

# gRPC xDS Transport Converts `Unauthenticated` to `Unavailable`, Causing Infinite Retry

## Summary

When using `storage.NewGRPCClient()` on GKE with a misconfigured Workload Identity binding, the client retries indefinitely instead of returning an `Unauthenticated` error. The root cause is that the gRPC Go xDS transport and pickfirst balancer discard the original gRPC status code, and the picker wrapper _unconditionally_ assigns `codes.Unavailable` to all non-status errors. The storage client's retry predicate then treats `Unavailable` as retryable.

## Reproduction

On GKE with Workload Identity enabled:

1. Create a K8s ServiceAccount annotated with a GCP SA
2. Create the Workload Identity binding for a **different** K8s SA name than the one the pod uses
3. Use `storage.NewGRPCClient()` and call any method (e.g. `Object.Attrs()`)

The client hangs until context deadline. With a short per-call timeout, the error surfaces as:

```
retry failed with context deadline exceeded; last error: rpc error: code = Unavailable
  desc = name resolver error: ... xds: error received from xDS stream: rpc error:
  code = Unauthenticated desc = transport: per-RPC creds failed due to error:
  compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden:
  Permission 'iam.serviceAccounts.getAccessToken' denied on resource`
```

The inner code is `Unauthenticated` (not retryable per gRPC spec). The outer code is `Unavailable` (retryable). However the Google Cloud Storage client's retry logic only reads the outer code.

## Error Transformation Chain

```
IAM 403 "Permission denied"                              (raw HTTP error)
  -> codes.Unauthenticated                                grpc/internal/transport/http2_client.go
  -> xdsresource.NewErrorf("xds: ... %v", err)            grpc/internal/xds/clients/xdsclient/authority.go
                                                           *** gRPC status code DISCARDED (fmt %v) ***
  -> fmt.Errorf("listener resource error: %v", err)        grpc/internal/xds/xdsdepmgr/xds_dependency_manager.go
  -> fmt.Errorf("name resolver error: %v", err)            grpc/balancer/pickfirst/pickfirst.go
  -> status.Error(codes.Unavailable, err.Error())          grpc/picker_wrapper.go
                                                           *** codes.Unavailable ASSIGNED ***
  -> ShouldRetry: Unavailable -> true                      storage/invoke.go
                                                           *** infinite retry ***
```

## Root Cause Analysis

There are three contributing bugs:

### 1. `authority.go` discards the gRPC status code

[`grpc/internal/xds/clients/xdsclient/authority.go`](https://github.com/grpc/grpc-go/blob/master/internal/xds/clients/xdsclient/authority.go) ~L244:

```go
watcher.ResourceError(xdsresource.NewErrorf(xdsresource.ErrorTypeConnection,
    "xds: error received from xDS stream: %v", err), func() {})
```

The `%v` formatting serializes the original `status.Error` to a string, discarding its structured code. From this point forward, the original `Unauthenticated` code exists only as text in the message.

### 2. `pickfirst.go` wraps with `fmt.Errorf`, not `status.Errorf`

[`grpc/balancer/pickfirst/pickfirst.go`](https://github.com/grpc/grpc-go/blob/master/balancer/pickfirst/pickfirst.go) ~L234:

```go
b.updateBalancerState(balancer.State{
    ConnectivityState: connectivity.TransientFailure,
    Picker:            &picker{err: fmt.Errorf("name resolver error: %v", err)},
})
```

The picker stores a plain `error`, not a `status.Error`. This means the subsequent status extraction will fail.

### 3. `picker_wrapper.go` defaults all non-status errors to `Unavailable`

[`grpc/picker_wrapper.go`](https://github.com/grpc/grpc-go/blob/master/picker_wrapper.go) ~L176:

```go
return pick{}, status.Error(codes.Unavailable, err.Error())
```

When the picker returns a plain `error` (not a `status.Error`), this line unconditionally assigns `codes.Unavailable`. This is the line that makes the error retryable.

## Storage Retry Predicate

[`storage/invoke.go`](https://github.com/googleapis/google-cloud-go/blob/main/storage/invoke.go) ~L237:

```go
if st, ok := status.FromError(err); ok {
    if code := st.Code(); code == codes.Unavailable || ... {
        return true
    }
}
```

`ShouldRetry` does have recursive unwrap logic (L243), but `status.Error` doesn't implement `Unwrap()` -- it embeds the original error as a string in the status message, not as a structured error chain. The recursion never fires.

## Impact

- **All Google Cloud Go client libraries** that retry on `Unavailable` are affected when used with the gRPC/xDS transport (DirectPath). This includes Storage, BigQuery, Pub/Sub, and Spanner.
- Storage is the most commonly affected because it defaults to `NewGRPCClient` (with xDS) on GKE.
- The failure mode is silent -- no error is logged, the call simply hangs until context deadline.
- `Unauthenticated` and `PermissionDenied` errors from the credential provider are both affected.

## Correct Behavior per gRPC Spec

Per the [gRPC status code documentation](https://grpc.io/docs/guides/status-codes/):
- `UNAVAILABLE` (14): "transient condition, which can be corrected by retrying" -- **retryable**
- `UNAUTHENTICATED` (16): "does not have valid authentication credentials" -- **not retryable** (requires fixing credentials)

Per [gRFC A6](https://github.com/grpc/proposal/blob/master/A6-client-retries.md): "only status codes that indicate the service did not process the request should be retried."

## Suggested Fixes

**In `grpc-go` (root cause):**

1. `authority.go` should propagate the gRPC status code through the xDS error types rather than formatting with `%v`
2. `pickfirst.go` should use `status.Errorf` to preserve structured error codes
3. `picker_wrapper.go` should inspect inner errors before defaulting to `Unavailable`, or at minimum not assign `Unavailable` when the error chain contains `Unauthenticated`

**In `google-cloud-go/storage` (defense in depth):**

4. `ShouldRetry` could inspect the error message for auth-related keywords (`"Unauthenticated"`, `"PermissionDenied"`, `"iam.serviceAccounts"`) and refuse to retry, similar to the existing `x509` check in `gax-go`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: gRPC client retries indefinitely on Workload Identity auth failures (Unauthenticated wrapped as Unavailable) #14391

gRPC xDS Transport Converts `Unauthenticated` to `Unavailable`, Causing Infinite Retry

Summary

Reproduction

Error Transformation Chain

Root Cause Analysis

1. `authority.go` discards the gRPC status code

2. `pickfirst.go` wraps with `fmt.Errorf`, not `status.Errorf`

3. `picker_wrapper.go` defaults all non-status errors to `Unavailable`

Storage Retry Predicate

Impact

Correct Behavior per gRPC Spec

Suggested Fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: gRPC client retries indefinitely on Workload Identity auth failures (Unauthenticated wrapped as Unavailable) #14391

Description

gRPC xDS Transport Converts Unauthenticated to Unavailable, Causing Infinite Retry

Summary

Reproduction

Error Transformation Chain

Root Cause Analysis

1. authority.go discards the gRPC status code

2. pickfirst.go wraps with fmt.Errorf, not status.Errorf

3. picker_wrapper.go defaults all non-status errors to Unavailable

Storage Retry Predicate

Impact

Correct Behavior per gRPC Spec

Suggested Fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

gRPC xDS Transport Converts `Unauthenticated` to `Unavailable`, Causing Infinite Retry

1. `authority.go` discards the gRPC status code

2. `pickfirst.go` wraps with `fmt.Errorf`, not `status.Errorf`

3. `picker_wrapper.go` defaults all non-status errors to `Unavailable`