storage: gRPC Objects() iterator hangs indefinitely when connection drops during paginated listing

## Bug

`Bucket.Objects()` using the gRPC transport (`storage.NewGRPCClient`) hangs forever if the underlying TCP connection dies mid-listing. `itr.Next()` blocks indefinitely — no error, no timeout. The client cannot detect the dead connection.

The JSON/HTTP transport (`storage.NewClient`) is not affected.

## Reproduction

```go
client, _ := storage.NewGRPCClient(ctx)
q := storage.Query{Prefix: "prefix-with-many-objects/"}
q.SetAttrSelection([]string{"Name"})

itr := client.Bucket("bucket").Objects(ctx, &q)
for {
    _, err := itr.Next() // blocks forever after connection drops
    if err == iterator.Done { break }
    if err != nil { log.Fatal(err) } // never reached
}
```

Trigger: list a prefix large enough that the paginated listing runs for several minutes. The connection will eventually drop (server GOAWAY from `max_connection_age`, infrastructure timeout, etc.). After the drop, `itr.Next()` hangs permanently.

## Why it hangs

Two things prevent the client from detecting the dead connection:

**1. Per-RPC timeouts are disabled.** The gapic layer sets a 60-second timeout on `ListObjects`:

```go
// storage/internal/apiv2/storage_client.go:224-236
ListObjects: []gax.CallOption{
    gax.WithTimeout(60000 * time.Millisecond),
    // ...
},
```

But `grpc_client.go:84` overrides all gapic timeouts globally:

```go
s.gax = append(s.gax, gax.WithRetry(nil), gax.WithTimeout(0))
```

`gax.Invoke` (`gax-go/v2/invoke.go:86`) treats `timeout == 0` as "no timeout." The veneer's `run()` function (`storage/invoke.go`) replaces gax retry logic but does not replace the per-RPC timeout. Result: each paginated RPC has no deadline beyond the caller's context.

The global override exists because `s.gax` applies to all methods — including `ReadObject`/`WriteObject` where 60 seconds is too short. `ListObjects` is collateral damage. Note: the 60-second gapic timeout is per *page* RPC, not per listing. `ListObjects` uses paginated unary RPCs via `InternalFetch(pageSize, pageToken)` (`grpc_client.go:547-556`), issuing a separate RPC for each page of ~1000 results. A single page should return in well under 60 seconds, so restoring this timeout would not break large listings.

**2. gRPC keepalive is not configured.** No `grpc.WithKeepaliveParams` is set anywhere in the storage package. The grpc-go default keepalive time is `infinity` (`grpc-go/internal/transport/defaults.go`), so the keepalive goroutine is never started (`grpc-go/internal/transport/http2_client.go:269-276`). Dead TCP connections are invisible.

Without per-RPC timeouts or keepalive, the only protection is the caller's context deadline — which is typically hours for batch workloads.

## HTTP transport comparison

The HTTP transport does not have this problem. Both transports use the same `run()` function (`storage/invoke.go:97`) for retry logic, and both paginate with `fetch(pageSize, pageToken)`. The structural difference:

**HTTP path** (`http_client.go:347-389`): each page calls `req.Context(ctx).Do()`, which issues an HTTP request with Go's `net/http` client. HTTP requests have natural timeouts — TCP idle timeouts, HTTP/2 PING frames from the Go standard library, and server-side response deadlines all bound the wait. A stalled connection surfaces as an I/O error, which `run()` can retry.

**gRPC path** (`grpc_client.go:547-558`): each page calls `c.raw.ListObjects(ctx, req, s.gax...)` → `gax.Invoke()`. With `gax.WithTimeout(0)` in `s.gax`, there is no per-RPC deadline. The gRPC transport has no keepalive configured, and grpc-go does not impose its own idle timeout. A stalled connection blocks forever — `run()` never gets a chance to retry because the call never returns.

The fix is to make the gRPC path behave like HTTP: ensure each per-page RPC is bounded so that a stalled connection surfaces as an error that `run()` can retry.

## Suggested fix

Scope the timeout override to data operations only. `gax.WithTimeout(0)` is correct for `ReadObject`/`WriteObject` but should not apply to metadata operations like `ListObjects`. Either:

- Apply `gax.WithTimeout(0)` per-method (only to `ReadObject`/`WriteObject`) instead of globally via `s.gax`
- Or implement per-attempt timeouts in the veneer's `run()` retry loop

Additionally, configuring gRPC keepalive would detect dead connections independently of timeouts.

## Workaround

Use `storage.NewClient()` (JSON/HTTP transport) for listing operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: gRPC Objects() iterator hangs indefinitely when connection drops during paginated listing #14417

Bug

Reproduction

Why it hangs

HTTP transport comparison

Suggested fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: gRPC Objects() iterator hangs indefinitely when connection drops during paginated listing #14417

Description

Bug

Reproduction

Why it hangs

HTTP transport comparison

Suggested fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions