Description
We observed a set of issues when using rueidis v1.0.69 with a Redis Cluster (v7.4).
During a failover in one shard of the cluster, the application experienced an increase in Redis request latency, followed by multiple client-side errors. No application code changes were deployed during this time.
The following issues were observed:
- Intermittent transaction errors:
EXEC was aborted by redis or connection closed
- Redis protocol parse errors:
rueidis: parse error: redis message type simple string is not a array
rueidis: parse error: redis message type array is not a string
EOF
dial tcp <ip>:<port>: operation was canceled
Once the parse errors started, the client did not recover and required restarting application pods.
Panic observed during cluster refresh
In the same time window, we observed a panic originating from the cluster topology refresh path.
panic: runtime error: index out of range [2] with length 0
goroutine 258842186 [running]:
github.com/redis/rueidis.parseSlots(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/cluster.go:378
github.com/redis/rueidis.clusterslots.parse(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/cluster.go:378
github.com/redis/rueidis.(*clusterClient)._refresh(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/cluster.go:167
github.com/redis/rueidis.(*call).do(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/cluster.go:217
github.com/redis/rueidis.(*call).LazyDo.func1(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/singleflight.go:58
github.com/redis/rueidis.(*call).LazyDo(...)
/go/pkg/mod/github.com/redis/rueidis@v1.0.69/singleflight.go:53
Client Code
- Connection dial timeout: default
- Connection write timeout: default
Client Initialization
client, err := rueidisotel.NewClient(
rueidis.ClientOption{
InitAddress: conf.NodeAddresses,
Username: conf.UserName,
Password: conf.Password,
DisableCache: conf.DisableCache,
CacheSizeEachConn: cacheSizeEachConnection,
},
)
if err != nil {
return err
}
if err := client.Do(
context.Background(),
client.B().Ping().Build(),
).Error(); err != nil {
return err
}
Client Usage
func (c *Cache) MGetWithClientSideCache(
ctx context.Context,
keys []string,
) (map[string]string, error) {
results, err := rueidis.MGetCache(
c.client,
ctx,
30*time.Second,
keys,
)
if err != nil {
return nil, err
}
finalResult := make(map[string]string, len(results))
for key, result := range results {
value, err := result.ToString()
if err != nil && !rueidis.IsRedisNil(err) {
return nil, err
}
finalResult[key] = value
}
return finalResult, nil
}
Environment
- Rueidis v1.0.69
- Go 1.23.0
- Redis Cluster v7.4
- Auto-Pipelining enabled
Questions
-
When does rueidis refresh the cluster topology?
If no refresh interval is configured, what situations cause a topology refresh to happen automatically?
-
Is it normal to see EXEC abort errors during a shard failover?
Should applications expect these errors during cluster changes?
-
What situations can lead to Redis protocol parse errors in rueidis?
For example, can they happen due to connection interruptions or cluster changes?
-
After a protocol parse error, should the client recover on its own or be restarted?
Description
We observed a set of issues when using rueidis v1.0.69 with a Redis Cluster (v7.4).
During a failover in one shard of the cluster, the application experienced an increase in Redis request latency, followed by multiple client-side errors. No application code changes were deployed during this time.
The following issues were observed:
Once the parse errors started, the client did not recover and required restarting application pods.
Panic observed during cluster refresh
In the same time window, we observed a panic originating from the cluster topology refresh path.
Client Code
Client Initialization
Client Usage
Environment
Questions
When does rueidis refresh the cluster topology?
If no refresh interval is configured, what situations cause a topology refresh to happen automatically?
Is it normal to see EXEC abort errors during a shard failover?
Should applications expect these errors during cluster changes?
What situations can lead to Redis protocol parse errors in rueidis?
For example, can they happen due to connection interruptions or cluster changes?
After a protocol parse error, should the client recover on its own or be restarted?