Skip to content

feat: Add AWS Valkey (ElastiCache) support to the stack #389

@gandalf-at-lerian

Description

@gandalf-at-lerian

feat: Add AWS Valkey (ElastiCache) support to the stack

Background & Context

Why Valkey

Valkey is an open-source, Apache 2.0-licensed fork of Redis 7.x, created after Redis Ltd. changed Redis's license to non-OSS (SSPL) in March 2024. Valkey is maintained by the Linux Foundation and backed by AWS, Google, Oracle, and others. It is wire-compatible with Redis 7, meaning existing go-redis/v9 clients work without code changes.

Valkey 8.x key improvements over Redis 7:

  • 36–91% lower latency in I/O-bound workloads (I/O threads redesign)
  • Active memory defragmentation improvements reducing memory bloat by up to 15%
  • Official clustering improvements reducing split-brain scenarios
  • First-class support for WAIT and WAITAOF in cluster mode
  • Continued development of modules API (vs. Redis which is closing this)

Why AWS ElastiCache (Managed) over Self-hosted

Dimension Managed ElastiCache Self-hosted (e.g., Helm chart on EKS)
Failover Automatic (<60s) Manual or operator-managed
Persistence Snapshots + backups out of the box Custom setup required
Patching AWS applies security patches Manual
Networking VPC-native, SG-based isolation Pod network, harder to restrict
Cost at scale Predictable Underestimated (ops + k8s overhead)
Observability CloudWatch metrics built-in Prometheus exporter + dashboards
TLS Managed certs, in-transit encryption Self-signed or cert-manager

Redis Protocol Compatibility

ElastiCache Valkey speaks the Redis protocol. The existing go-redis/v9 usage in lib-commons/commons/redis and in pool-manager requires zero client-side changes to connect to ElastiCache Valkey. The lib-commons redis.Config Topology.Cluster field already supports cluster-mode addresses.


Current State (Based on Repo Analysis)

pool-manager (module: github.com/LerianStudio/tenant-manager)

Tech stack:

  • Go 1.25.7, Fiber v2, OpenTelemetry
  • Storage: MongoDB 7.0 (primary DB, via DocumentDB on AWS)
  • Cache/Lock: Valkey (Redis-compatible) — already in use via redis/go-redis/v9 v9.18.0
  • Messaging: RabbitMQ (via rabbitmq/amqp091-go)
  • Auth: WorkOS + JWT
  • AWS SDKs: CloudFormation, DocDB, EC2, RDS, S3, SecretsManager (full AWS orchestration)

Current Valkey usage (ACTIVE, single-node, standalone topology):

The service already uses Valkey in docker-compose (valkey/valkey:8-alpine) and in .env.example:

REDIS_HOST=valkey
REDIS_PORT=6379
REDIS_USERNAME=
REDIS_PASSWORD=
REDIS_DB=0
REDIS_TLS=false

Use cases confirmed in source code:

  1. API key caching (internal/adapters/http/middleware/apikey.go): Redis GET to cache API key validation results, reducing MongoDB round-trips per request
  2. Rate limiting (internal/adapters/http/middleware/ratelimit.go): Redis-backed per-tenant/per-tier rate limiter
  3. Idempotency (internal/adapters/http/middleware/idempotency.go): Redis SET NX for request deduplication (TTL 300s)
  4. Multi-tenant config caching (internal/bootstrap/wire_infra_redis.go): SecretsCache backed by Redis, TTL driven by MULTI_TENANT_CACHE_TTL (default 24h)
  5. Cache invalidation API (internal/adapters/http/handler/cache_handler.go): SCAN + DEL for pattern-based cache clearing
  6. Tenant settings caching (internal/adapters/http/handler/tenant_service_handler.go): Redis-backed tenant settings with TTL

Connection pattern: lib-commons/v4/commons/redis.Client (standalone topology) OR direct go-redis.Client when REDIS_USERNAME is set (ACL auth path). Both paths ping Valkey at startup.

AWS resources in use:

  • AWS CloudFormation (tenant infrastructure provisioning)
  • AWS DocumentDB (MongoDB-compatible, for tenant DBs)
  • AWS RDS (PostgreSQL provisioning for tenants)
  • AWS S3 (migration files, Casdoor templates)
  • AWS Secrets Manager (credentials for all tenant clusters)
  • AWS EC2 (cluster management)

No ElastiCache yet. Valkey is self-hosted (Docker Compose locally, deployment method in production TBD).

backoffice-console (module: Next.js/TypeScript monorepo)

Tech stack:

  • Next.js 14 (App Router), TypeScript, Turborepo
  • No direct Redis dependency — it is a pure frontend/BFF
  • Auth: WorkOS (cookie-based sessions)

Cache interaction: The backoffice-console has a CacheRepository (apps/backoffice/src/infra/repositories/cache-repository.impl.ts) that calls the pool-manager API (/cache, /cache/keys, /cache/pattern) to view and invalidate Valkey cache entries. It has no direct Redis connection.

No Redis/cache infrastructure changes needed in backoffice-console beyond updating environment variables that point to the pool-manager API.

lib-commons (module: github.com/LerianStudio/lib-commons/v4)

Tech stack:

  • Go 1.25.7, shared library used by all Lerian services
  • Already has redis/go-redis/v9 v9.18.0
  • Already has go-redsync/redsync/v4 (distributed locks)
  • Already has alicebob/miniredis/v2 (in-memory Redis for tests)

Existing packages:

  • commons/redis/: Full Redis client wrapper (Client struct) supporting Standalone, Sentinel, and Cluster topologies, TLS, GCP IAM auth, circuit breaker, reconnection, OpenTelemetry metrics. Already cluster-mode capable.
  • commons/tenant-manager/cache/: ConfigCache interface (in-memory, process-local) for tenant config caching
  • commons/tenant-manager/valkey/: Key helpers for tenant-namespaced keys (tenant:{tenantID}:{key})

Gap: The commons/redis package wraps go-redis.UniversalClient and supports cluster mode via ClusterTopology, but there is no higher-level CacheClient interface abstraction that services can depend on without importing go-redis directly. Services currently couple to redis.UniversalClient from go-redis. A clean CacheClient interface in lib-commons would decouple services and make testing trivial.


Proposed Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        AWS EKS Cluster                          │
│                                                                  │
│  ┌─────────────────────┐    ┌──────────────────────────────┐   │
│  │    pool-manager      │    │      backoffice-console       │   │
│  │  (Go / Fiber)        │    │  (Next.js / TypeScript)       │   │
│  │                      │    │                               │   │
│  │  Rate Limiter ──┐    │    │  CacheRepository ─── HTTP ───┼──►│
│  │  API Key Cache ─┼──► │    │  (calls pool-manager /cache)  │   │
│  │  Idempotency ───┘    │    └──────────────────────────────┘   │
│  │  Settings Cache      │                                        │
│  │         │            │                                        │
│  │  lib-commons/redis   │                                        │
│  │  CacheClient (new)   │                                        │
│  └────────────┬─────────┘                                        │
│               │  TLS / port 6379 (single) or 6380 (cluster)     │
│               │  SG: allow EKS node CIDR → ElastiCache SG       │
│               ▼                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │         AWS ElastiCache (Valkey 8)                       │    │
│  │                                                          │    │
│  │  Staging:    Single Node (cache.t4g.small)               │    │
│  │  Production: Cluster Mode (3 shards × 2 nodes each)      │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Legend:
  → HTTP proxy (backoffice → pool-manager)
  ▼ Redis protocol (TLS, go-redis UniversalClient)

Option A: Single Instance (Standalone / Replication Group with 1 shard)

Configuration

  • ElastiCache node type: cache.t4g.small (2 vCPU, 1.37 GB RAM) or cache.r7g.large for production-grade single
  • Replication group: 1 primary + 1 optional read replica (for HA without cluster mode)
  • Failover: AWS promotes replica automatically (~60s downtime window)
  • Endpoint: Single primary endpoint (e.g., my-valkey.xxxxx.use2.cache.amazonaws.com:6379)

Client configuration

// Single instance — existing lib-commons redis.Config works unchanged
cfg := redis.Config{
    Topology: redis.Topology{
        Standalone: &redis.StandaloneTopology{
            Address: os.Getenv("CACHE_ADDRS"), // "host:6379"
        },
    },
}

When to use

  • Development / Staging environments — simpler ops, lower cost
  • Single-region deployments where HA is not critical
  • Low-throughput workloads (<10k ops/sec sustained)

Pros

  • ✅ Zero client-side sharding logic
  • ✅ All Redis commands work (including SCAN, KEYS, MULTI/EXEC without cross-slot issues)
  • ✅ Cheaper (~$25–50/month for t4g.small)
  • ✅ Existing lib-commons StandaloneTopology works as-is

Cons

  • ❌ Single point of failure (until replica promotes, ~60s gap)
  • ❌ Vertical scaling only (instance resize = brief downtime)
  • ❌ Max ~26 GB RAM per node (r7g.2xlarge)

Approximate cost (us-east-2)

  • cache.t4g.small: ~$24/month (on-demand), ~$15/month (reserved 1yr)
  • cache.r7g.large (recommended staging): ~$110/month

Option B: Cluster Mode (ElastiCache Valkey Cluster)

Configuration

  • Topology: N shards, each with 1 primary + 1 replica
  • Recommended production: 3 shards × 2 nodes = 6 nodes total
  • Endpoints: Cluster configuration endpoint (e.g., my-valkey-cluster.xxxxx.clustercfg.use2.cache.amazonaws.com:6379)
  • go-redis client: ClusterClient (or UniversalClient with cluster addresses)

Client configuration

// Cluster mode — lib-commons redis.Config ClusterTopology
cfg := redis.Config{
    Topology: redis.Topology{
        Cluster: &redis.ClusterTopology{
            Addresses: strings.Split(os.Getenv("CACHE_ADDRS"), ","),
        },
    },
}
// go-redis UniversalClient auto-detects cluster when >1 address is given
// or when Cluster topology is selected.

When to use

  • Production environments requiring high availability
  • High-throughput workloads (>10k ops/sec)
  • Data sets >26 GB that must be horizontally sharded

Pros

  • ✅ Horizontal scaling (add shards without downtime)
  • ✅ Automatic failover per shard (<30s)
  • ✅ Total memory = N shards × node RAM (e.g., 3 × r7g.large = ~57 GB usable)
  • ✅ lib-commons ClusterTopology already implemented

Cons

  • ❌ Multi-key operations (MGET, DEL multiple keys, SCAN) require all keys in same slot
  • ❌ Transactions (MULTI/EXEC) only work within a single slot
  • ❌ SCAN iterates one node at a time — the SCAN pattern-delete in pool-manager must iterate all nodes
  • ❌ Higher cost (~3× single instance for 3 shards)
  • ❌ Key space must use {} hash tags for cross-key operations on the same slot

Key space constraint

Pool-manager currently uses SCAN for cache invalidation. In cluster mode, SCAN only covers one shard. The invalidation logic must be updated to scan all cluster nodes. go-redis ClusterClient.ForEachMaster is the correct API.

Approximate cost (us-east-2, 3 shards × 1 replica each)

  • cache.r7g.large × 6: ~$660/month (on-demand), ~$420/month (reserved 1yr)
  • cache.t4g.medium × 6 (smaller prod): ~$180/month

Recommendation

Staging: Option A (Single Instance, cache.r7g.large, 1 replica)
Production: Option B (Cluster Mode, 3 shards × cache.r7g.large, 1 replica each)

Rationale: pool-manager's primary use cases (API key caching, rate limiting, idempotency, settings cache) are read-heavy with small values. The key space is tenant-namespaced (tenant:{tenantID}:{key}) which maps cleanly to cluster sharding via hash tags if needed. The existing SCAN-based invalidation must be updated for cluster mode, but that is a one-time fix in cache_handler.go. The lib-commons/commons/redis package already supports both topologies via redis.UniversalClient — no new library code needed for connectivity.


Changes Required in lib-commons

New package: commons/cache

Create a clean CacheClient interface that services depend on instead of redis.UniversalClient. This enables:

  • Easy mocking in unit tests (no miniredis needed for simple tests)
  • Pluggable implementations (in-process, Valkey, future: DynamoDB DAX)
  • Decoupling services from go-redis types

File: commons/cache/cache.go

// Copyright (c) 2026 Lerian Studio. All rights reserved.
// Use of this source code is governed by the Elastic License 2.0
// that can be found in the LICENSE file.

// Package cache provides a unified interface for distributed cache operations
// backed by Valkey (Redis-compatible) via AWS ElastiCache.
package cache

import (
    "context"
    "time"
)

// CacheMode selects the Valkey deployment topology.
type CacheMode string

const (
    // SingleInstance connects to a single Valkey node or replication group primary endpoint.
    SingleInstance CacheMode = "single"
    // Cluster connects to a Valkey cluster via cluster configuration endpoint.
    Cluster CacheMode = "cluster"
)

// CacheClient defines the contract for cache operations.
// All implementations must be safe for concurrent use by multiple goroutines.
//
// Available implementations:
//   - NewCacheClient(cfg CacheConfig): factory for SingleInstance and Cluster modes
//   - MockCacheClient: test double for unit tests (use go:generate with mockgen)
type CacheClient interface {
    // Get retrieves a string value by key.
    // Returns ErrCacheMiss if the key does not exist or has expired.
    Get(ctx context.Context, key string) (string, error)

    // Set stores a value with the given TTL.
    // A zero TTL means the key never expires.
    Set(ctx context.Context, key string, value interface{}, ttl time.Duration) error

    // Del removes one or more keys. Returns nil if keys do not exist.
    Del(ctx context.Context, keys ...string) error

    // Exists reports how many of the given keys exist in the cache.
    Exists(ctx context.Context, keys ...string) (int64, error)

    // Incr atomically increments an integer value by 1.
    Incr(ctx context.Context, key string) (int64, error)

    // Expire sets (or resets) the TTL on an existing key.
    Expire(ctx context.Context, key string, ttl time.Duration) error

    // Close releases all resources held by this client.
    Close() error
}

// CacheConfig configures a CacheClient.
type CacheConfig struct {
    // Addrs is the list of Valkey addresses.
    //   SingleInstance: ["host:port"]
    //   Cluster:        ["host1:port", "host2:port", ...]  (cluster cfg endpoint or seed nodes)
    Addrs []string

    // Password is the Valkey AUTH password (or ACL password when Username is also set).
    Password string

    // Username is the Valkey ACL username. Leave empty for default AUTH.
    Username string

    // TLSEnabled enables TLS for the connection to ElastiCache.
    TLSEnabled bool

    // CACertBase64 is the Base64-encoded PEM CA certificate for TLS verification.
    // Required when TLSEnabled is true and using a custom CA (e.g., ElastiCache in-transit).
    CACertBase64 string

    // Mode selects SingleInstance or Cluster topology.
    Mode CacheMode

    // PoolSize is the maximum number of connections per node.
    // Defaults to 10 when zero.
    PoolSize int

    // MaxRetries is the maximum number of retries on command failure.
    // Defaults to 3 when zero.
    MaxRetries int

    // DialTimeout is the timeout for establishing a connection.
    // Defaults to 5s when zero.
    DialTimeout time.Duration

    // ReadTimeout is the timeout for socket reads.
    // Defaults to 3s when zero.
    ReadTimeout time.Duration

    // WriteTimeout is the timeout for socket writes.
    // Defaults to 3s when zero.
    WriteTimeout time.Duration
}

// ErrCacheMiss is returned by Get when the key does not exist or has expired.
var ErrCacheMiss = errCacheMiss("cache miss")

type errCacheMiss string

func (e errCacheMiss) Error() string { return string(e) }

File: commons/cache/valkey.go — SingleInstance implementation:

package cache

import (
    "context"
    "errors"
    "fmt"
    "time"

    goredis "github.com/redis/go-redis/v9"
)

type valkeyClient struct {
    client goredis.UniversalClient
}

// NewCacheClient returns a CacheClient backed by Valkey (AWS ElastiCache or local).
// It uses go-redis UniversalClient which transparently handles both single and cluster modes.
func NewCacheClient(cfg CacheConfig) (CacheClient, error) {
    if len(cfg.Addrs) == 0 {
        return nil, fmt.Errorf("cache: at least one address is required")
    }

    opts := &goredis.UniversalOptions{
        Addrs:        cfg.Addrs,
        Password:     cfg.Password,
        Username:     cfg.Username,
        PoolSize:     cfg.PoolSize,
        MaxRetries:   cfg.MaxRetries,
        DialTimeout:  cfg.DialTimeout,
        ReadTimeout:  cfg.ReadTimeout,
        WriteTimeout: cfg.WriteTimeout,
    }

    if cfg.Mode == Cluster {
        // Force cluster client regardless of address count
        opts.RouteByLatency = false
    }

    if cfg.TLSEnabled {
        tlsCfg, err := buildTLSConfig(cfg.CACertBase64)
        if err != nil {
            return nil, fmt.Errorf("cache: failed to build TLS config: %w", err)
        }
        opts.TLSConfig = tlsCfg
    }

    client := goredis.NewUniversalClient(opts)

    if err := client.Ping(context.Background()).Err(); err != nil {
        _ = client.Close()
        return nil, fmt.Errorf("cache: failed to connect to Valkey at %v: %w", cfg.Addrs, err)
    }

    return &valkeyClient{client: client}, nil
}

func (v *valkeyClient) Get(ctx context.Context, key string) (string, error) {
    val, err := v.client.Get(ctx, key).Result()
    if errors.Is(err, goredis.Nil) {
        return "", ErrCacheMiss
    }
    return val, err
}

func (v *valkeyClient) Set(ctx context.Context, key string, value interface{}, ttl time.Duration) error {
    return v.client.Set(ctx, key, value, ttl).Err()
}

func (v *valkeyClient) Del(ctx context.Context, keys ...string) error {
    return v.client.Del(ctx, keys...).Err()
}

func (v *valkeyClient) Exists(ctx context.Context, keys ...string) (int64, error) {
    return v.client.Exists(ctx, keys...).Result()
}

func (v *valkeyClient) Incr(ctx context.Context, key string) (int64, error) {
    return v.client.Incr(ctx, key).Result()
}

func (v *valkeyClient) Expire(ctx context.Context, key string, ttl time.Duration) error {
    return v.client.Expire(ctx, key, ttl).Err()
}

func (v *valkeyClient) Close() error {
    return v.client.Close()
}

Generate mock:

go generate ./commons/cache/...
# //go:generate go run go.uber.org/mock/mockgen -source=cache.go -destination=mock_cache.go -package=cache

go.mod: No new dependency needed — github.com/redis/go-redis/v9 is already in go.mod.


Changes Required in pool-manager

Wire up cache.CacheClient

Currently pool-manager uses redis.UniversalClient directly from go-redis. This should be replaced with the lib-commons cache.CacheClient interface for testability.

internal/bootstrap/wire.go — add CacheClient field to Application:

CacheClient lcache.CacheClient // lib-commons cache interface

internal/bootstrap/wire_infra_redis.go — after connecting, wrap in CacheClient:

import lcache "github.com/LerianStudio/lib-commons/v4/commons/cache"

// After obtaining redisClient (UniversalClient), wrap it:
app.CacheClient, err = lcache.NewCacheClientFromUniversal(redisClient)

Or simply pass CacheConfig derived from env vars and call lcache.NewCacheClient(cfg) directly.

Specific use cases to wire:

Component File Current After
API key cache middleware/apikey.go redis.UniversalClient cache.CacheClient
Rate limiter middleware/ratelimit.go redis.UniversalClient cache.CacheClient
Idempotency middleware/idempotency.go redis.UniversalClient cache.CacheClient
Settings cache handler/tenant_service_handler.go redis.UniversalClient cache.CacheClient
Cache handler SCAN handler/cache_handler.go direct SCAN/DEL keep redis.UniversalClient for SCAN (cluster: ForEachMaster)

Note on SCAN in cluster mode: The cache invalidation handler uses SCAN + DEL to clear patterns. In cluster mode, SCAN only covers the connected shard. The handler must be updated to use ClusterClient.ForEachMaster when cluster mode is active. This is the only code change required specifically for cluster mode compatibility.

Helm values update (charts/pool-manager/values.yaml):

env:
  # Existing Redis config (replace with CACHE_* vars)
  CACHE_MODE: "single"
  CACHE_ADDRS: ""        # populated from k8s Secret
  CACHE_PASSWORD: ""     # populated from k8s Secret
  CACHE_TLS_ENABLED: "true"
  CACHE_POOL_SIZE: "10"
  CACHE_MAX_RETRIES: "3"

envFrom:
  - secretRef:
      name: valkey-credentials   # k8s Secret with CACHE_ADDRS, CACHE_PASSWORD

Changes Required in backoffice-console

The backoffice-console has no direct Redis connection. It calls the pool-manager API to manage cache entries (/cache, /cache/keys, /cache/pattern). No Redis-specific changes are required.

Only changes needed:

  • Update API_URL / NEXT_PUBLIC_TENANT_MANAGER_API_URL in Helm values to point to the correct pool-manager service endpoint (no change in logic, just confirming the endpoint is reachable)
  • Verify /cache/keys list endpoint works correctly when pool-manager is connected to ElastiCache (SCAN pagination — confirm cursor handling is compatible)

Helm values update (charts/backoffice-console/values.yaml — no cache vars needed, just confirm):

env:
  NEXT_PUBLIC_TENANT_MANAGER_API_URL: "https://api.your-domain.com"
  # No CACHE_* vars needed — console is a pure frontend

Infrastructure / Helm

Kubernetes Secret

Create a valkey-credentials Secret in each namespace (staging, production):

apiVersion: v1
kind: Secret
metadata:
  name: valkey-credentials
  namespace: pool-manager
type: Opaque
stringData:
  CACHE_ADDRS: "my-valkey.xxxxx.use2.cache.amazonaws.com:6379"
  CACHE_PASSWORD: "<auth-token-from-secrets-manager>"

Reference in Helm values:

envFrom:
  - secretRef:
      name: valkey-credentials

Terraform Module

Use the AWS terraform-aws-modules/elasticache/aws module:

module "valkey_staging" {
  source  = "terraform-aws-modules/elasticache/aws"
  version = "~> 1.0"

  cluster_id           = "lerian-valkey-staging"
  engine               = "valkey"
  engine_version       = "8.0"
  node_type            = "cache.r7g.large"
  num_cache_nodes      = 1
  parameter_group_name = "default.valkey8"
  port                 = 6379

  subnet_ids         = module.vpc.elasticache_subnets
  security_group_ids = [aws_security_group.valkey.id]

  at_rest_encryption_enabled  = true
  transit_encryption_enabled  = true
  auth_token                  = random_password.valkey_auth.result
  auth_token_update_strategy  = "ROTATE"

  automatic_failover_enabled = false  # staging: single node
}

module "valkey_production" {
  source  = "terraform-aws-modules/elasticache/aws"
  version = "~> 1.0"

  replication_group_id       = "lerian-valkey-prod"
  description                = "Lerian Valkey production cluster"
  engine                     = "valkey"
  engine_version             = "8.0"
  node_type                  = "cache.r7g.large"
  num_node_groups            = 3   # 3 shards
  replicas_per_node_group    = 1   # 1 replica per shard

  cluster_mode_enabled        = true
  automatic_failover_enabled  = true
  multi_az_enabled            = true

  subnet_ids         = module.vpc.elasticache_subnets
  security_group_ids = [aws_security_group.valkey.id]

  at_rest_encryption_enabled  = true
  transit_encryption_enabled  = true
  auth_token                  = random_password.valkey_auth_prod.result
}

Security Group

resource "aws_security_group" "valkey" {
  name   = "lerian-valkey-sg"
  vpc_id = module.vpc.vpc_id

  ingress {
    description     = "Valkey from EKS nodes"
    from_port       = 6379
    to_port         = 6380  # 6380 for cluster TLS
    protocol        = "tcp"
    security_groups = [module.eks.node_security_group_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Environment Variables

Complete Reference

Variable Description Required
CACHE_MODE Topology: single or cluster Yes
CACHE_ADDRS Comma-separated host:port list Yes
CACHE_PASSWORD Valkey AUTH token Yes (ElastiCache)
CACHE_USERNAME Valkey ACL username (optional) No
CACHE_TLS_ENABLED Enable TLS (required for ElastiCache) Yes (ElastiCache)
CACHE_CA_CERT_BASE64 Base64 PEM CA cert for TLS verification No (uses system CA by default)
CACHE_POOL_SIZE Connection pool size per node No (default: 10)
CACHE_MAX_RETRIES Max retries on transient errors No (default: 3)
CACHE_DIAL_TIMEOUT Connection timeout (e.g., 5s) No (default: 5s)
CACHE_READ_TIMEOUT Read timeout (e.g., 3s) No (default: 3s)
CACHE_WRITE_TIMEOUT Write timeout (e.g., 3s) No (default: 3s)

Examples by Environment

Local development (docker-compose):

CACHE_MODE=single
CACHE_ADDRS=localhost:6379
CACHE_PASSWORD=
CACHE_TLS_ENABLED=false
CACHE_POOL_SIZE=5

Staging (ElastiCache single node):

CACHE_MODE=single
CACHE_ADDRS=lerian-valkey-staging.xxxxx.use2.cache.amazonaws.com:6379
CACHE_PASSWORD=<from-secrets-manager>
CACHE_TLS_ENABLED=true
CACHE_POOL_SIZE=10
CACHE_MAX_RETRIES=3

Production (ElastiCache cluster):

CACHE_MODE=cluster
CACHE_ADDRS=lerian-valkey-prod.xxxxx.clustercfg.use2.cache.amazonaws.com:6379
CACHE_PASSWORD=<from-secrets-manager>
CACHE_TLS_ENABLED=true
CACHE_POOL_SIZE=20
CACHE_MAX_RETRIES=3
CACHE_DIAL_TIMEOUT=5s
CACHE_READ_TIMEOUT=3s
CACHE_WRITE_TIMEOUT=3s

Local Development

Add to docker-compose.yml in pool-manager (already done — using valkey/valkey:8-alpine). For other services that adopt the cache client, use:

services:
  valkey:
    image: valkey/valkey:8
    container_name: valkey
    ports:
      - "6379:6379"
    command: ["valkey-server", "--appendonly", "yes"]
    volumes:
      - valkey_data:/data
    healthcheck:
      test: ["CMD", "valkey-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  valkey_data:

For local cluster simulation (optional, for cluster-mode testing):

  valkey-cluster:
    image: valkey/valkey:8
    command: >
      valkey-server
      --cluster-enabled yes
      --cluster-config-file nodes.conf
      --cluster-node-timeout 5000
      --appendonly yes
    ports:
      - "7000-7005:7000-7005"

Migration Path

Current situation

pool-manager already uses Valkey (self-hosted via docker-compose locally). The production deployment method was not found in the cloned repo (no Helm charts in the repo — likely in a separate gitops/infra repo). The migration path assumes moving from self-hosted (container) to managed ElastiCache.

Steps

  1. Audit existing data — Valkey is used for ephemeral cache only (API keys, rate limits, idempotency, settings). All data is reconstructible. No data migration needed.

  2. Provision ElastiCache (staging) via Terraform. Verify connectivity from EKS nodes using a debug pod:

    kubectl run valkey-test --image=valkey/valkey:8 --rm -it -- \
      valkey-cli -h <elasticache-endpoint> -p 6379 -a <auth-token> --tls ping
  3. Update Helm values for pool-manager in staging to use new CACHE_* env vars pointing to ElastiCache.

  4. Deploy and validate — monitor:

    • checks["redis"] in /health/ready endpoint
    • Rate limiter functionality (smoke test: hit an endpoint >RATE_LIMIT_MAX times)
    • Idempotency (replay same request ID twice, expect 200 on replay)
    • Cache hit metrics in CloudWatch (ElastiCache CacheHits, CacheMisses)
  5. Switch production — deploy production ElastiCache (cluster mode), update Helm values, deploy.

  6. Decommission self-hosted — remove old Valkey container/pod from infrastructure.

Env var renaming

The existing REDIS_* env vars in pool-manager will be replaced by CACHE_* vars. During the transition, pool-manager can support both for one release cycle by checking CACHE_ADDRS first, falling back to REDIS_HOST:REDIS_PORT if absent.


Testing Requirements

Unit Tests (lib-commons commons/cache)

  • Use the generated MockCacheClient (mockgen) to test all callers without a real server
  • Test factory function NewCacheClient with invalid configs (empty addrs, bad TLS cert)
  • Test ErrCacheMiss is returned correctly when go-redis returns Nil
  • Coverage requirement: ≥ 80%
// Example mock usage in pool-manager tests
func TestAPIKeyMiddleware_CacheHit(t *testing.T) {
    ctrl := gomock.NewController(t)
    defer ctrl.Finish()
    
    mockCache := cachemock.NewMockCacheClient(ctrl)
    mockCache.EXPECT().Get(gomock.Any(), "apikey:sha256:abc123").Return(`{"valid":true}`, nil)
    
    // ... test middleware
}

Integration Tests

Use testcontainers-go with a real Valkey container:

func TestCacheClient_Integration(t *testing.T) {
    ctx := context.Background()
    
    req := testcontainers.ContainerRequest{
        Image:        "valkey/valkey:8",
        ExposedPorts: []string{"6379/tcp"},
        WaitingFor:   wait.ForLog("Ready to accept connections"),
    }
    
    container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: req,
        Started: true,
    })
    require.NoError(t, err)
    defer container.Terminate(ctx)
    
    host, _ := container.Host(ctx)
    port, _ := container.MappedPort(ctx, "6379")
    
    client, err := cache.NewCacheClient(cache.CacheConfig{
        Addrs: []string{host + ":" + port.Port()},
        Mode:  cache.SingleInstance,
    })
    require.NoError(t, err)
    defer client.Close()
    
    // Test Get/Set/Del/Exists/Incr/Expire
}

Cluster Mode Tests

Use a 3-node Valkey cluster container (or testcontainers-go Redis cluster module) to verify:

  • ForEachMaster SCAN works across all shards
  • Key routing works correctly for tenant-namespaced keys
  • Exists with keys on different shards returns correct count

Definition of Done

  • lib-commons: commons/cache package created with CacheClient interface (cache.go)
  • lib-commons: CacheMode type (SingleInstance | Cluster) defined
  • lib-commons: CacheConfig struct defined with all fields documented
  • lib-commons: NewCacheClient(cfg CacheConfig) (CacheClient, error) factory implemented using go-redis UniversalClient
  • lib-commons: SingleInstance mode working with standalone ElastiCache endpoint
  • lib-commons: Cluster mode working with cluster configuration endpoint
  • lib-commons: ErrCacheMiss sentinel error defined and returned by Get on cache miss
  • lib-commons: MockCacheClient generated via go:generate mockgen
  • lib-commons: Unit tests with mock, coverage ≥ 80%
  • lib-commons: Integration tests with real Valkey container (testcontainers-go)
  • lib-commons: go.mod unchanged (go-redis/v9 already present)
  • lib-commons: CHANGELOG and MIGRATION_MAP updated
  • pool-manager: Application.CacheClient field added (type cache.CacheClient)
  • pool-manager: initRedis updated to call cache.NewCacheClient with env-driven config
  • pool-manager: CACHE_* env vars defined in Config struct (replacing / aliasing REDIS_*)
  • pool-manager: API key middleware wired to cache.CacheClient
  • pool-manager: Rate limiter middleware wired to cache.CacheClient
  • pool-manager: Idempotency middleware wired to cache.CacheClient
  • pool-manager: Settings/secrets cache wired to cache.CacheClient
  • pool-manager: cache_handler.go SCAN updated to use ForEachMaster in cluster mode
  • pool-manager: Helm values updated with CACHE_* env vars and valkey-credentials secretRef
  • pool-manager: Integration tests updated to use new CacheClient interface
  • pool-manager: .env.example updated with CACHE_* variables
  • backoffice-console: Verified /cache API endpoints work with ElastiCache-backed pool-manager
  • backoffice-console: Helm values confirmed (no CACHE_* vars needed — BFF only)
  • AWS ElastiCache Valkey provisioned in staging (single node, cache.r7g.large, TLS enabled)
  • AWS ElastiCache Valkey provisioned in production (cluster mode, 3 shards × cache.r7g.large, 1 replica each, TLS enabled, multi-AZ)
  • Kubernetes Secret valkey-credentials created in staging and production namespaces
  • Terraform module reviewed and applied via CI
  • Security group rules verified: EKS node SG → ElastiCache SG on port 6379/6380
  • CloudWatch alarms configured: EngineCPUUtilization > 80%, CurrConnections > 1000, DatabaseMemoryUsagePercentage > 80%
  • Runbook documented covering:
    • Cache flush procedure (pattern-based via /cache/pattern API or valkey-cli FLUSHDB)
    • Failover test (promote replica, verify pool-manager reconnects)
    • Scaling (add shard in cluster mode, update CACHE_ADDRS)
    • Auth token rotation (auth_token_update_strategy = ROTATE in Terraform)

Metadata

Metadata

Assignees

No one assigned

    Labels

    awsAWS resourcescacheCache layerenhancementNew feature or requestinfrastructureInfrastructure related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions