Skip to content

storage: gRPC Objects() iterator hangs indefinitely when connection drops during paginated listing #14417

@AaronReboot

Description

@AaronReboot

Bug

Bucket.Objects() using the gRPC transport (storage.NewGRPCClient) hangs forever if the underlying TCP connection dies mid-listing. itr.Next() blocks indefinitely — no error, no timeout. The client cannot detect the dead connection.

The JSON/HTTP transport (storage.NewClient) is not affected.

Reproduction

client, _ := storage.NewGRPCClient(ctx)
q := storage.Query{Prefix: "prefix-with-many-objects/"}
q.SetAttrSelection([]string{"Name"})

itr := client.Bucket("bucket").Objects(ctx, &q)
for {
    _, err := itr.Next() // blocks forever after connection drops
    if err == iterator.Done { break }
    if err != nil { log.Fatal(err) } // never reached
}

Trigger: list a prefix large enough that the paginated listing runs for several minutes. The connection will eventually drop (server GOAWAY from max_connection_age, infrastructure timeout, etc.). After the drop, itr.Next() hangs permanently.

Why it hangs

Two things prevent the client from detecting the dead connection:

1. Per-RPC timeouts are disabled. The gapic layer sets a 60-second timeout on ListObjects:

// storage/internal/apiv2/storage_client.go:224-236
ListObjects: []gax.CallOption{
    gax.WithTimeout(60000 * time.Millisecond),
    // ...
},

But grpc_client.go:84 overrides all gapic timeouts globally:

s.gax = append(s.gax, gax.WithRetry(nil), gax.WithTimeout(0))

gax.Invoke (gax-go/v2/invoke.go:86) treats timeout == 0 as "no timeout." The veneer's run() function (storage/invoke.go) replaces gax retry logic but does not replace the per-RPC timeout. Result: each paginated RPC has no deadline beyond the caller's context.

The global override exists because s.gax applies to all methods — including ReadObject/WriteObject where 60 seconds is too short. ListObjects is collateral damage. Note: the 60-second gapic timeout is per page RPC, not per listing. ListObjects uses paginated unary RPCs via InternalFetch(pageSize, pageToken) (grpc_client.go:547-556), issuing a separate RPC for each page of ~1000 results. A single page should return in well under 60 seconds, so restoring this timeout would not break large listings.

2. gRPC keepalive is not configured. No grpc.WithKeepaliveParams is set anywhere in the storage package. The grpc-go default keepalive time is infinity (grpc-go/internal/transport/defaults.go), so the keepalive goroutine is never started (grpc-go/internal/transport/http2_client.go:269-276). Dead TCP connections are invisible.

Without per-RPC timeouts or keepalive, the only protection is the caller's context deadline — which is typically hours for batch workloads.

HTTP transport comparison

The HTTP transport does not have this problem. Both transports use the same run() function (storage/invoke.go:97) for retry logic, and both paginate with fetch(pageSize, pageToken). The structural difference:

HTTP path (http_client.go:347-389): each page calls req.Context(ctx).Do(), which issues an HTTP request with Go's net/http client. HTTP requests have natural timeouts — TCP idle timeouts, HTTP/2 PING frames from the Go standard library, and server-side response deadlines all bound the wait. A stalled connection surfaces as an I/O error, which run() can retry.

gRPC path (grpc_client.go:547-558): each page calls c.raw.ListObjects(ctx, req, s.gax...)gax.Invoke(). With gax.WithTimeout(0) in s.gax, there is no per-RPC deadline. The gRPC transport has no keepalive configured, and grpc-go does not impose its own idle timeout. A stalled connection blocks forever — run() never gets a chance to retry because the call never returns.

The fix is to make the gRPC path behave like HTTP: ensure each per-page RPC is bounded so that a stalled connection surfaces as an error that run() can retry.

Suggested fix

Scope the timeout override to data operations only. gax.WithTimeout(0) is correct for ReadObject/WriteObject but should not apply to metadata operations like ListObjects. Either:

  • Apply gax.WithTimeout(0) per-method (only to ReadObject/WriteObject) instead of globally via s.gax
  • Or implement per-attempt timeouts in the veneer's run() retry loop

Additionally, configuring gRPC keepalive would detect dead connections independently of timeouts.

Workaround

Use storage.NewClient() (JSON/HTTP transport) for listing operations.

Metadata

Metadata

Assignees

Labels

api: storageIssues related to the Cloud Storage API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions