Bug
Bucket.Objects() using the gRPC transport (storage.NewGRPCClient) hangs forever if the underlying TCP connection dies mid-listing. itr.Next() blocks indefinitely — no error, no timeout. The client cannot detect the dead connection.
The JSON/HTTP transport (storage.NewClient) is not affected.
Reproduction
client, _ := storage.NewGRPCClient(ctx)
q := storage.Query{Prefix: "prefix-with-many-objects/"}
q.SetAttrSelection([]string{"Name"})
itr := client.Bucket("bucket").Objects(ctx, &q)
for {
_, err := itr.Next() // blocks forever after connection drops
if err == iterator.Done { break }
if err != nil { log.Fatal(err) } // never reached
}
Trigger: list a prefix large enough that the paginated listing runs for several minutes. The connection will eventually drop (server GOAWAY from max_connection_age, infrastructure timeout, etc.). After the drop, itr.Next() hangs permanently.
Why it hangs
Two things prevent the client from detecting the dead connection:
1. Per-RPC timeouts are disabled. The gapic layer sets a 60-second timeout on ListObjects:
// storage/internal/apiv2/storage_client.go:224-236
ListObjects: []gax.CallOption{
gax.WithTimeout(60000 * time.Millisecond),
// ...
},
But grpc_client.go:84 overrides all gapic timeouts globally:
s.gax = append(s.gax, gax.WithRetry(nil), gax.WithTimeout(0))
gax.Invoke (gax-go/v2/invoke.go:86) treats timeout == 0 as "no timeout." The veneer's run() function (storage/invoke.go) replaces gax retry logic but does not replace the per-RPC timeout. Result: each paginated RPC has no deadline beyond the caller's context.
The global override exists because s.gax applies to all methods — including ReadObject/WriteObject where 60 seconds is too short. ListObjects is collateral damage. Note: the 60-second gapic timeout is per page RPC, not per listing. ListObjects uses paginated unary RPCs via InternalFetch(pageSize, pageToken) (grpc_client.go:547-556), issuing a separate RPC for each page of ~1000 results. A single page should return in well under 60 seconds, so restoring this timeout would not break large listings.
2. gRPC keepalive is not configured. No grpc.WithKeepaliveParams is set anywhere in the storage package. The grpc-go default keepalive time is infinity (grpc-go/internal/transport/defaults.go), so the keepalive goroutine is never started (grpc-go/internal/transport/http2_client.go:269-276). Dead TCP connections are invisible.
Without per-RPC timeouts or keepalive, the only protection is the caller's context deadline — which is typically hours for batch workloads.
HTTP transport comparison
The HTTP transport does not have this problem. Both transports use the same run() function (storage/invoke.go:97) for retry logic, and both paginate with fetch(pageSize, pageToken). The structural difference:
HTTP path (http_client.go:347-389): each page calls req.Context(ctx).Do(), which issues an HTTP request with Go's net/http client. HTTP requests have natural timeouts — TCP idle timeouts, HTTP/2 PING frames from the Go standard library, and server-side response deadlines all bound the wait. A stalled connection surfaces as an I/O error, which run() can retry.
gRPC path (grpc_client.go:547-558): each page calls c.raw.ListObjects(ctx, req, s.gax...) → gax.Invoke(). With gax.WithTimeout(0) in s.gax, there is no per-RPC deadline. The gRPC transport has no keepalive configured, and grpc-go does not impose its own idle timeout. A stalled connection blocks forever — run() never gets a chance to retry because the call never returns.
The fix is to make the gRPC path behave like HTTP: ensure each per-page RPC is bounded so that a stalled connection surfaces as an error that run() can retry.
Suggested fix
Scope the timeout override to data operations only. gax.WithTimeout(0) is correct for ReadObject/WriteObject but should not apply to metadata operations like ListObjects. Either:
- Apply
gax.WithTimeout(0) per-method (only to ReadObject/WriteObject) instead of globally via s.gax
- Or implement per-attempt timeouts in the veneer's
run() retry loop
Additionally, configuring gRPC keepalive would detect dead connections independently of timeouts.
Workaround
Use storage.NewClient() (JSON/HTTP transport) for listing operations.
Bug
Bucket.Objects()using the gRPC transport (storage.NewGRPCClient) hangs forever if the underlying TCP connection dies mid-listing.itr.Next()blocks indefinitely — no error, no timeout. The client cannot detect the dead connection.The JSON/HTTP transport (
storage.NewClient) is not affected.Reproduction
Trigger: list a prefix large enough that the paginated listing runs for several minutes. The connection will eventually drop (server GOAWAY from
max_connection_age, infrastructure timeout, etc.). After the drop,itr.Next()hangs permanently.Why it hangs
Two things prevent the client from detecting the dead connection:
1. Per-RPC timeouts are disabled. The gapic layer sets a 60-second timeout on
ListObjects:But
grpc_client.go:84overrides all gapic timeouts globally:gax.Invoke(gax-go/v2/invoke.go:86) treatstimeout == 0as "no timeout." The veneer'srun()function (storage/invoke.go) replaces gax retry logic but does not replace the per-RPC timeout. Result: each paginated RPC has no deadline beyond the caller's context.The global override exists because
s.gaxapplies to all methods — includingReadObject/WriteObjectwhere 60 seconds is too short.ListObjectsis collateral damage. Note: the 60-second gapic timeout is per page RPC, not per listing.ListObjectsuses paginated unary RPCs viaInternalFetch(pageSize, pageToken)(grpc_client.go:547-556), issuing a separate RPC for each page of ~1000 results. A single page should return in well under 60 seconds, so restoring this timeout would not break large listings.2. gRPC keepalive is not configured. No
grpc.WithKeepaliveParamsis set anywhere in the storage package. The grpc-go default keepalive time isinfinity(grpc-go/internal/transport/defaults.go), so the keepalive goroutine is never started (grpc-go/internal/transport/http2_client.go:269-276). Dead TCP connections are invisible.Without per-RPC timeouts or keepalive, the only protection is the caller's context deadline — which is typically hours for batch workloads.
HTTP transport comparison
The HTTP transport does not have this problem. Both transports use the same
run()function (storage/invoke.go:97) for retry logic, and both paginate withfetch(pageSize, pageToken). The structural difference:HTTP path (
http_client.go:347-389): each page callsreq.Context(ctx).Do(), which issues an HTTP request with Go'snet/httpclient. HTTP requests have natural timeouts — TCP idle timeouts, HTTP/2 PING frames from the Go standard library, and server-side response deadlines all bound the wait. A stalled connection surfaces as an I/O error, whichrun()can retry.gRPC path (
grpc_client.go:547-558): each page callsc.raw.ListObjects(ctx, req, s.gax...)→gax.Invoke(). Withgax.WithTimeout(0)ins.gax, there is no per-RPC deadline. The gRPC transport has no keepalive configured, and grpc-go does not impose its own idle timeout. A stalled connection blocks forever —run()never gets a chance to retry because the call never returns.The fix is to make the gRPC path behave like HTTP: ensure each per-page RPC is bounded so that a stalled connection surfaces as an error that
run()can retry.Suggested fix
Scope the timeout override to data operations only.
gax.WithTimeout(0)is correct forReadObject/WriteObjectbut should not apply to metadata operations likeListObjects. Either:gax.WithTimeout(0)per-method (only toReadObject/WriteObject) instead of globally vias.gaxrun()retry loopAdditionally, configuring gRPC keepalive would detect dead connections independently of timeouts.
Workaround
Use
storage.NewClient()(JSON/HTTP transport) for listing operations.