feat: Add AWS Valkey (ElastiCache) support to the stack

# feat: Add AWS Valkey (ElastiCache) support to the stack

## Background & Context

### Why Valkey

[Valkey](https://valkey.io) is an open-source, Apache 2.0-licensed fork of Redis 7.x, created after Redis Ltd. changed Redis's license to non-OSS (SSPL) in March 2024. Valkey is maintained by the Linux Foundation and backed by AWS, Google, Oracle, and others. It is **wire-compatible with Redis 7**, meaning existing `go-redis/v9` clients work without code changes.

**Valkey 8.x key improvements over Redis 7:**
- 36–91% lower latency in I/O-bound workloads (I/O threads redesign)
- Active memory defragmentation improvements reducing memory bloat by up to 15%
- Official clustering improvements reducing split-brain scenarios
- First-class support for `WAIT` and `WAITAOF` in cluster mode
- Continued development of modules API (vs. Redis which is closing this)

### Why AWS ElastiCache (Managed) over Self-hosted

| Dimension | Managed ElastiCache | Self-hosted (e.g., Helm chart on EKS) |
|---|---|---|
| Failover | Automatic (<60s) | Manual or operator-managed |
| Persistence | Snapshots + backups out of the box | Custom setup required |
| Patching | AWS applies security patches | Manual |
| Networking | VPC-native, SG-based isolation | Pod network, harder to restrict |
| Cost at scale | Predictable | Underestimated (ops + k8s overhead) |
| Observability | CloudWatch metrics built-in | Prometheus exporter + dashboards |
| TLS | Managed certs, in-transit encryption | Self-signed or cert-manager |

### Redis Protocol Compatibility

ElastiCache Valkey speaks the Redis protocol. The existing `go-redis/v9` usage in `lib-commons/commons/redis` and in `pool-manager` requires **zero client-side changes** to connect to ElastiCache Valkey. The lib-commons `redis.Config` `Topology.Cluster` field already supports cluster-mode addresses.

---

## Current State (Based on Repo Analysis)

### pool-manager (module: `github.com/LerianStudio/tenant-manager`)

**Tech stack:**
- Go 1.25.7, Fiber v2, OpenTelemetry
- **Storage:** MongoDB 7.0 (primary DB, via DocumentDB on AWS)
- **Cache/Lock:** Valkey (Redis-compatible) — already in use via `redis/go-redis/v9 v9.18.0`
- **Messaging:** RabbitMQ (via `rabbitmq/amqp091-go`)
- **Auth:** WorkOS + JWT
- **AWS SDKs:** CloudFormation, DocDB, EC2, RDS, S3, SecretsManager (full AWS orchestration)

**Current Valkey usage (ACTIVE, single-node, standalone topology):**

The service already uses Valkey in docker-compose (`valkey/valkey:8-alpine`) and in `.env.example`:
```
REDIS_HOST=valkey
REDIS_PORT=6379
REDIS_USERNAME=
REDIS_PASSWORD=
REDIS_DB=0
REDIS_TLS=false
```

**Use cases confirmed in source code:**
1. **API key caching** (`internal/adapters/http/middleware/apikey.go`): Redis GET to cache API key validation results, reducing MongoDB round-trips per request
2. **Rate limiting** (`internal/adapters/http/middleware/ratelimit.go`): Redis-backed per-tenant/per-tier rate limiter
3. **Idempotency** (`internal/adapters/http/middleware/idempotency.go`): Redis SET NX for request deduplication (TTL 300s)
4. **Multi-tenant config caching** (`internal/bootstrap/wire_infra_redis.go`): `SecretsCache` backed by Redis, TTL driven by `MULTI_TENANT_CACHE_TTL` (default 24h)
5. **Cache invalidation API** (`internal/adapters/http/handler/cache_handler.go`): SCAN + DEL for pattern-based cache clearing
6. **Tenant settings caching** (`internal/adapters/http/handler/tenant_service_handler.go`): Redis-backed tenant settings with TTL

**Connection pattern:** `lib-commons/v4/commons/redis.Client` (standalone topology) OR direct `go-redis.Client` when `REDIS_USERNAME` is set (ACL auth path). Both paths ping Valkey at startup.

**AWS resources in use:**
- AWS CloudFormation (tenant infrastructure provisioning)
- AWS DocumentDB (MongoDB-compatible, for tenant DBs)
- AWS RDS (PostgreSQL provisioning for tenants)
- AWS S3 (migration files, Casdoor templates)
- AWS Secrets Manager (credentials for all tenant clusters)
- AWS EC2 (cluster management)

**No ElastiCache yet.** Valkey is self-hosted (Docker Compose locally, deployment method in production TBD).

### backoffice-console (module: Next.js/TypeScript monorepo)

**Tech stack:**
- Next.js 14 (App Router), TypeScript, Turborepo
- No direct Redis dependency — it is a pure frontend/BFF
- Auth: WorkOS (cookie-based sessions)

**Cache interaction:** The backoffice-console has a `CacheRepository` (`apps/backoffice/src/infra/repositories/cache-repository.impl.ts`) that calls the **pool-manager API** (`/cache`, `/cache/keys`, `/cache/pattern`) to view and invalidate Valkey cache entries. It has no direct Redis connection.

**No Redis/cache infrastructure changes needed** in backoffice-console beyond updating environment variables that point to the pool-manager API.

### lib-commons (module: `github.com/LerianStudio/lib-commons/v4`)

**Tech stack:**
- Go 1.25.7, shared library used by all Lerian services
- Already has `redis/go-redis/v9 v9.18.0`
- Already has `go-redsync/redsync/v4` (distributed locks)
- Already has `alicebob/miniredis/v2` (in-memory Redis for tests)

**Existing packages:**
- `commons/redis/`: Full Redis client wrapper (`Client` struct) supporting `Standalone`, `Sentinel`, and `Cluster` topologies, TLS, GCP IAM auth, circuit breaker, reconnection, OpenTelemetry metrics. **Already cluster-mode capable.**
- `commons/tenant-manager/cache/`: `ConfigCache` interface (in-memory, process-local) for tenant config caching
- `commons/tenant-manager/valkey/`: Key helpers for tenant-namespaced keys (`tenant:{tenantID}:{key}`)

**Gap:** The `commons/redis` package wraps `go-redis.UniversalClient` and supports cluster mode via `ClusterTopology`, but there is no higher-level `CacheClient` interface abstraction that services can depend on without importing `go-redis` directly. Services currently couple to `redis.UniversalClient` from go-redis. A clean `CacheClient` interface in lib-commons would decouple services and make testing trivial.

---

## Proposed Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        AWS EKS Cluster                          │
│                                                                  │
│  ┌─────────────────────┐    ┌──────────────────────────────┐   │
│  │    pool-manager      │    │      backoffice-console       │   │
│  │  (Go / Fiber)        │    │  (Next.js / TypeScript)       │   │
│  │                      │    │                               │   │
│  │  Rate Limiter ──┐    │    │  CacheRepository ─── HTTP ───┼──►│
│  │  API Key Cache ─┼──► │    │  (calls pool-manager /cache)  │   │
│  │  Idempotency ───┘    │    └──────────────────────────────┘   │
│  │  Settings Cache      │                                        │
│  │         │            │                                        │
│  │  lib-commons/redis   │                                        │
│  │  CacheClient (new)   │                                        │
│  └────────────┬─────────┘                                        │
│               │  TLS / port 6379 (single) or 6380 (cluster)     │
│               │  SG: allow EKS node CIDR → ElastiCache SG       │
│               ▼                                                  │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │         AWS ElastiCache (Valkey 8)                       │    │
│  │                                                          │    │
│  │  Staging:    Single Node (cache.t4g.small)               │    │
│  │  Production: Cluster Mode (3 shards × 2 nodes each)      │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Legend:
  → HTTP proxy (backoffice → pool-manager)
  ▼ Redis protocol (TLS, go-redis UniversalClient)
```

---

## Option A: Single Instance (Standalone / Replication Group with 1 shard)

### Configuration
- **ElastiCache node type:** `cache.t4g.small` (2 vCPU, 1.37 GB RAM) or `cache.r7g.large` for production-grade single
- **Replication group:** 1 primary + 1 optional read replica (for HA without cluster mode)
- **Failover:** AWS promotes replica automatically (~60s downtime window)
- **Endpoint:** Single primary endpoint (e.g., `my-valkey.xxxxx.use2.cache.amazonaws.com:6379`)

### Client configuration
```go
// Single instance — existing lib-commons redis.Config works unchanged
cfg := redis.Config{
    Topology: redis.Topology{
        Standalone: &redis.StandaloneTopology{
            Address: os.Getenv("CACHE_ADDRS"), // "host:6379"
        },
    },
}
```

### When to use
- **Development / Staging environments** — simpler ops, lower cost
- **Single-region deployments** where HA is not critical
- **Low-throughput workloads** (<10k ops/sec sustained)

### Pros
- ✅ Zero client-side sharding logic
- ✅ All Redis commands work (including SCAN, KEYS, MULTI/EXEC without cross-slot issues)
- ✅ Cheaper (~$25–50/month for t4g.small)
- ✅ Existing lib-commons `StandaloneTopology` works as-is

### Cons
- ❌ Single point of failure (until replica promotes, ~60s gap)
- ❌ Vertical scaling only (instance resize = brief downtime)
- ❌ Max ~26 GB RAM per node (r7g.2xlarge)

### Approximate cost (us-east-2)
- `cache.t4g.small`: ~$24/month (on-demand), ~$15/month (reserved 1yr)
- `cache.r7g.large` (recommended staging): ~$110/month

---

## Option B: Cluster Mode (ElastiCache Valkey Cluster)

### Configuration
- **Topology:** N shards, each with 1 primary + 1 replica
- **Recommended production:** 3 shards × 2 nodes = 6 nodes total
- **Endpoints:** Cluster configuration endpoint (e.g., `my-valkey-cluster.xxxxx.clustercfg.use2.cache.amazonaws.com:6379`)
- **go-redis client:** `ClusterClient` (or `UniversalClient` with cluster addresses)

### Client configuration
```go
// Cluster mode — lib-commons redis.Config ClusterTopology
cfg := redis.Config{
    Topology: redis.Topology{
        Cluster: &redis.ClusterTopology{
            Addresses: strings.Split(os.Getenv("CACHE_ADDRS"), ","),
        },
    },
}
// go-redis UniversalClient auto-detects cluster when >1 address is given
// or when Cluster topology is selected.
```

### When to use
- **Production environments** requiring high availability
- **High-throughput workloads** (>10k ops/sec)
- **Data sets >26 GB** that must be horizontally sharded

### Pros
- ✅ Horizontal scaling (add shards without downtime)
- ✅ Automatic failover per shard (<30s)
- ✅ Total memory = N shards × node RAM (e.g., 3 × r7g.large = ~57 GB usable)
- ✅ lib-commons `ClusterTopology` already implemented

### Cons
- ❌ Multi-key operations (MGET, DEL multiple keys, SCAN) require all keys in same slot
- ❌ Transactions (`MULTI/EXEC`) only work within a single slot
- ❌ SCAN iterates one node at a time — the `SCAN` pattern-delete in pool-manager must iterate all nodes
- ❌ Higher cost (~3× single instance for 3 shards)
- ❌ Key space must use `{}` hash tags for cross-key operations on the same slot

### Key space constraint
Pool-manager currently uses SCAN for cache invalidation. In cluster mode, SCAN only covers one shard. The invalidation logic must be updated to scan all cluster nodes. go-redis `ClusterClient.ForEachMaster` is the correct API.

### Approximate cost (us-east-2, 3 shards × 1 replica each)
- `cache.r7g.large` × 6: ~$660/month (on-demand), ~$420/month (reserved 1yr)
- `cache.t4g.medium` × 6 (smaller prod): ~$180/month

---

## Recommendation

**Staging: Option A** (Single Instance, `cache.r7g.large`, 1 replica)
**Production: Option B** (Cluster Mode, 3 shards × `cache.r7g.large`, 1 replica each)

**Rationale:** pool-manager's primary use cases (API key caching, rate limiting, idempotency, settings cache) are read-heavy with small values. The key space is tenant-namespaced (`tenant:{tenantID}:{key}`) which maps cleanly to cluster sharding via hash tags if needed. The existing SCAN-based invalidation must be updated for cluster mode, but that is a one-time fix in `cache_handler.go`. The `lib-commons/commons/redis` package already supports both topologies via `redis.UniversalClient` — no new library code needed for connectivity.

---

## Changes Required in lib-commons

### New package: `commons/cache`

Create a clean `CacheClient` interface that services depend on instead of `redis.UniversalClient`. This enables:
- Easy mocking in unit tests (no miniredis needed for simple tests)
- Pluggable implementations (in-process, Valkey, future: DynamoDB DAX)
- Decoupling services from go-redis types

**File: `commons/cache/cache.go`**

```go
// Copyright (c) 2026 Lerian Studio. All rights reserved.
// Use of this source code is governed by the Elastic License 2.0
// that can be found in the LICENSE file.

// Package cache provides a unified interface for distributed cache operations
// backed by Valkey (Redis-compatible) via AWS ElastiCache.
package cache

import (
    "context"
    "time"
)

// CacheMode selects the Valkey deployment topology.
type CacheMode string

const (
    // SingleInstance connects to a single Valkey node or replication group primary endpoint.
    SingleInstance CacheMode = "single"
    // Cluster connects to a Valkey cluster via cluster configuration endpoint.
    Cluster CacheMode = "cluster"
)

// CacheClient defines the contract for cache operations.
// All implementations must be safe for concurrent use by multiple goroutines.
//
// Available implementations:
//   - NewCacheClient(cfg CacheConfig): factory for SingleInstance and Cluster modes
//   - MockCacheClient: test double for unit tests (use go:generate with mockgen)
type CacheClient interface {
    // Get retrieves a string value by key.
    // Returns ErrCacheMiss if the key does not exist or has expired.
    Get(ctx context.Context, key string) (string, error)

    // Set stores a value with the given TTL.
    // A zero TTL means the key never expires.
    Set(ctx context.Context, key string, value interface{}, ttl time.Duration) error

    // Del removes one or more keys. Returns nil if keys do not exist.
    Del(ctx context.Context, keys ...string) error

    // Exists reports how many of the given keys exist in the cache.
    Exists(ctx context.Context, keys ...string) (int64, error)

    // Incr atomically increments an integer value by 1.
    Incr(ctx context.Context, key string) (int64, error)

    // Expire sets (or resets) the TTL on an existing key.
    Expire(ctx context.Context, key string, ttl time.Duration) error

    // Close releases all resources held by this client.
    Close() error
}

// CacheConfig configures a CacheClient.
type CacheConfig struct {
    // Addrs is the list of Valkey addresses.
    //   SingleInstance: ["host:port"]
    //   Cluster:        ["host1:port", "host2:port", ...]  (cluster cfg endpoint or seed nodes)
    Addrs []string

    // Password is the Valkey AUTH password (or ACL password when Username is also set).
    Password string

    // Username is the Valkey ACL username. Leave empty for default AUTH.
    Username string

    // TLSEnabled enables TLS for the connection to ElastiCache.
    TLSEnabled bool

    // CACertBase64 is the Base64-encoded PEM CA certificate for TLS verification.
    // Required when TLSEnabled is true and using a custom CA (e.g., ElastiCache in-transit).
    CACertBase64 string

    // Mode selects SingleInstance or Cluster topology.
    Mode CacheMode

    // PoolSize is the maximum number of connections per node.
    // Defaults to 10 when zero.
    PoolSize int

    // MaxRetries is the maximum number of retries on command failure.
    // Defaults to 3 when zero.
    MaxRetries int

    // DialTimeout is the timeout for establishing a connection.
    // Defaults to 5s when zero.
    DialTimeout time.Duration

    // ReadTimeout is the timeout for socket reads.
    // Defaults to 3s when zero.
    ReadTimeout time.Duration

    // WriteTimeout is the timeout for socket writes.
    // Defaults to 3s when zero.
    WriteTimeout time.Duration
}

// ErrCacheMiss is returned by Get when the key does not exist or has expired.
var ErrCacheMiss = errCacheMiss("cache miss")

type errCacheMiss string

func (e errCacheMiss) Error() string { return string(e) }
```

**File: `commons/cache/valkey.go`** — SingleInstance implementation:

```go
package cache

import (
    "context"
    "errors"
    "fmt"
    "time"

    goredis "github.com/redis/go-redis/v9"
)

type valkeyClient struct {
    client goredis.UniversalClient
}

// NewCacheClient returns a CacheClient backed by Valkey (AWS ElastiCache or local).
// It uses go-redis UniversalClient which transparently handles both single and cluster modes.
func NewCacheClient(cfg CacheConfig) (CacheClient, error) {
    if len(cfg.Addrs) == 0 {
        return nil, fmt.Errorf("cache: at least one address is required")
    }

    opts := &goredis.UniversalOptions{
        Addrs:        cfg.Addrs,
        Password:     cfg.Password,
        Username:     cfg.Username,
        PoolSize:     cfg.PoolSize,
        MaxRetries:   cfg.MaxRetries,
        DialTimeout:  cfg.DialTimeout,
        ReadTimeout:  cfg.ReadTimeout,
        WriteTimeout: cfg.WriteTimeout,
    }

    if cfg.Mode == Cluster {
        // Force cluster client regardless of address count
        opts.RouteByLatency = false
    }

    if cfg.TLSEnabled {
        tlsCfg, err := buildTLSConfig(cfg.CACertBase64)
        if err != nil {
            return nil, fmt.Errorf("cache: failed to build TLS config: %w", err)
        }
        opts.TLSConfig = tlsCfg
    }

    client := goredis.NewUniversalClient(opts)

    if err := client.Ping(context.Background()).Err(); err != nil {
        _ = client.Close()
        return nil, fmt.Errorf("cache: failed to connect to Valkey at %v: %w", cfg.Addrs, err)
    }

    return &valkeyClient{client: client}, nil
}

func (v *valkeyClient) Get(ctx context.Context, key string) (string, error) {
    val, err := v.client.Get(ctx, key).Result()
    if errors.Is(err, goredis.Nil) {
        return "", ErrCacheMiss
    }
    return val, err
}

func (v *valkeyClient) Set(ctx context.Context, key string, value interface{}, ttl time.Duration) error {
    return v.client.Set(ctx, key, value, ttl).Err()
}

func (v *valkeyClient) Del(ctx context.Context, keys ...string) error {
    return v.client.Del(ctx, keys...).Err()
}

func (v *valkeyClient) Exists(ctx context.Context, keys ...string) (int64, error) {
    return v.client.Exists(ctx, keys...).Result()
}

func (v *valkeyClient) Incr(ctx context.Context, key string) (int64, error) {
    return v.client.Incr(ctx, key).Result()
}

func (v *valkeyClient) Expire(ctx context.Context, key string, ttl time.Duration) error {
    return v.client.Expire(ctx, key, ttl).Err()
}

func (v *valkeyClient) Close() error {
    return v.client.Close()
}
```

**Generate mock:**
```bash
go generate ./commons/cache/...
# //go:generate go run go.uber.org/mock/mockgen -source=cache.go -destination=mock_cache.go -package=cache
```

**go.mod:** No new dependency needed — `github.com/redis/go-redis/v9` is already in `go.mod`.

---

## Changes Required in pool-manager

### Wire up `cache.CacheClient`

Currently pool-manager uses `redis.UniversalClient` directly from go-redis. This should be replaced with the lib-commons `cache.CacheClient` interface for testability.

**`internal/bootstrap/wire.go`** — add `CacheClient` field to `Application`:
```go
CacheClient lcache.CacheClient // lib-commons cache interface
```

**`internal/bootstrap/wire_infra_redis.go`** — after connecting, wrap in CacheClient:
```go
import lcache "github.com/LerianStudio/lib-commons/v4/commons/cache"

// After obtaining redisClient (UniversalClient), wrap it:
app.CacheClient, err = lcache.NewCacheClientFromUniversal(redisClient)
```

Or simply pass `CacheConfig` derived from env vars and call `lcache.NewCacheClient(cfg)` directly.

**Specific use cases to wire:**

| Component | File | Current | After |
|---|---|---|---|
| API key cache | `middleware/apikey.go` | `redis.UniversalClient` | `cache.CacheClient` |
| Rate limiter | `middleware/ratelimit.go` | `redis.UniversalClient` | `cache.CacheClient` |
| Idempotency | `middleware/idempotency.go` | `redis.UniversalClient` | `cache.CacheClient` |
| Settings cache | `handler/tenant_service_handler.go` | `redis.UniversalClient` | `cache.CacheClient` |
| Cache handler SCAN | `handler/cache_handler.go` | direct SCAN/DEL | keep `redis.UniversalClient` for SCAN (cluster: `ForEachMaster`) |

> **Note on SCAN in cluster mode:** The cache invalidation handler uses SCAN + DEL to clear patterns. In cluster mode, SCAN only covers the connected shard. The handler must be updated to use `ClusterClient.ForEachMaster` when cluster mode is active. This is the **only** code change required specifically for cluster mode compatibility.

**Helm values update** (`charts/pool-manager/values.yaml`):
```yaml
env:
  # Existing Redis config (replace with CACHE_* vars)
  CACHE_MODE: "single"
  CACHE_ADDRS: ""        # populated from k8s Secret
  CACHE_PASSWORD: ""     # populated from k8s Secret
  CACHE_TLS_ENABLED: "true"
  CACHE_POOL_SIZE: "10"
  CACHE_MAX_RETRIES: "3"

envFrom:
  - secretRef:
      name: valkey-credentials   # k8s Secret with CACHE_ADDRS, CACHE_PASSWORD
```

---

## Changes Required in backoffice-console

The backoffice-console has **no direct Redis connection**. It calls the pool-manager API to manage cache entries (`/cache`, `/cache/keys`, `/cache/pattern`). No Redis-specific changes are required.

**Only changes needed:**
- Update `API_URL` / `NEXT_PUBLIC_TENANT_MANAGER_API_URL` in Helm values to point to the correct pool-manager service endpoint (no change in logic, just confirming the endpoint is reachable)
- Verify `/cache/keys` list endpoint works correctly when pool-manager is connected to ElastiCache (SCAN pagination — confirm cursor handling is compatible)

**Helm values update** (`charts/backoffice-console/values.yaml` — no cache vars needed, just confirm):
```yaml
env:
  NEXT_PUBLIC_TENANT_MANAGER_API_URL: "https://api.your-domain.com"
  # No CACHE_* vars needed — console is a pure frontend
```

---

## Infrastructure / Helm

### Kubernetes Secret

Create a `valkey-credentials` Secret in each namespace (staging, production):

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: valkey-credentials
  namespace: pool-manager
type: Opaque
stringData:
  CACHE_ADDRS: "my-valkey.xxxxx.use2.cache.amazonaws.com:6379"
  CACHE_PASSWORD: "<auth-token-from-secrets-manager>"
```

Reference in Helm values:
```yaml
envFrom:
  - secretRef:
      name: valkey-credentials
```

### Terraform Module

Use the AWS `terraform-aws-modules/elasticache/aws` module:

```hcl
module "valkey_staging" {
  source  = "terraform-aws-modules/elasticache/aws"
  version = "~> 1.0"

  cluster_id           = "lerian-valkey-staging"
  engine               = "valkey"
  engine_version       = "8.0"
  node_type            = "cache.r7g.large"
  num_cache_nodes      = 1
  parameter_group_name = "default.valkey8"
  port                 = 6379

  subnet_ids         = module.vpc.elasticache_subnets
  security_group_ids = [aws_security_group.valkey.id]

  at_rest_encryption_enabled  = true
  transit_encryption_enabled  = true
  auth_token                  = random_password.valkey_auth.result
  auth_token_update_strategy  = "ROTATE"

  automatic_failover_enabled = false  # staging: single node
}

module "valkey_production" {
  source  = "terraform-aws-modules/elasticache/aws"
  version = "~> 1.0"

  replication_group_id       = "lerian-valkey-prod"
  description                = "Lerian Valkey production cluster"
  engine                     = "valkey"
  engine_version             = "8.0"
  node_type                  = "cache.r7g.large"
  num_node_groups            = 3   # 3 shards
  replicas_per_node_group    = 1   # 1 replica per shard

  cluster_mode_enabled        = true
  automatic_failover_enabled  = true
  multi_az_enabled            = true

  subnet_ids         = module.vpc.elasticache_subnets
  security_group_ids = [aws_security_group.valkey.id]

  at_rest_encryption_enabled  = true
  transit_encryption_enabled  = true
  auth_token                  = random_password.valkey_auth_prod.result
}
```

### Security Group

```hcl
resource "aws_security_group" "valkey" {
  name   = "lerian-valkey-sg"
  vpc_id = module.vpc.vpc_id

  ingress {
    description     = "Valkey from EKS nodes"
    from_port       = 6379
    to_port         = 6380  # 6380 for cluster TLS
    protocol        = "tcp"
    security_groups = [module.eks.node_security_group_id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}
```

---

## Environment Variables

### Complete Reference

| Variable | Description | Required |
|---|---|---|
| `CACHE_MODE` | Topology: `single` or `cluster` | Yes |
| `CACHE_ADDRS` | Comma-separated `host:port` list | Yes |
| `CACHE_PASSWORD` | Valkey AUTH token | Yes (ElastiCache) |
| `CACHE_USERNAME` | Valkey ACL username (optional) | No |
| `CACHE_TLS_ENABLED` | Enable TLS (required for ElastiCache) | Yes (ElastiCache) |
| `CACHE_CA_CERT_BASE64` | Base64 PEM CA cert for TLS verification | No (uses system CA by default) |
| `CACHE_POOL_SIZE` | Connection pool size per node | No (default: 10) |
| `CACHE_MAX_RETRIES` | Max retries on transient errors | No (default: 3) |
| `CACHE_DIAL_TIMEOUT` | Connection timeout (e.g., `5s`) | No (default: 5s) |
| `CACHE_READ_TIMEOUT` | Read timeout (e.g., `3s`) | No (default: 3s) |
| `CACHE_WRITE_TIMEOUT` | Write timeout (e.g., `3s`) | No (default: 3s) |

### Examples by Environment

**Local development (docker-compose):**
```env
CACHE_MODE=single
CACHE_ADDRS=localhost:6379
CACHE_PASSWORD=
CACHE_TLS_ENABLED=false
CACHE_POOL_SIZE=5
```

**Staging (ElastiCache single node):**
```env
CACHE_MODE=single
CACHE_ADDRS=lerian-valkey-staging.xxxxx.use2.cache.amazonaws.com:6379
CACHE_PASSWORD=<from-secrets-manager>
CACHE_TLS_ENABLED=true
CACHE_POOL_SIZE=10
CACHE_MAX_RETRIES=3
```

**Production (ElastiCache cluster):**
```env
CACHE_MODE=cluster
CACHE_ADDRS=lerian-valkey-prod.xxxxx.clustercfg.use2.cache.amazonaws.com:6379
CACHE_PASSWORD=<from-secrets-manager>
CACHE_TLS_ENABLED=true
CACHE_POOL_SIZE=20
CACHE_MAX_RETRIES=3
CACHE_DIAL_TIMEOUT=5s
CACHE_READ_TIMEOUT=3s
CACHE_WRITE_TIMEOUT=3s
```

---

## Local Development

Add to `docker-compose.yml` in pool-manager (already done — using `valkey/valkey:8-alpine`). For other services that adopt the cache client, use:

```yaml
services:
  valkey:
    image: valkey/valkey:8
    container_name: valkey
    ports:
      - "6379:6379"
    command: ["valkey-server", "--appendonly", "yes"]
    volumes:
      - valkey_data:/data
    healthcheck:
      test: ["CMD", "valkey-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  valkey_data:
```

For local cluster simulation (optional, for cluster-mode testing):
```yaml
  valkey-cluster:
    image: valkey/valkey:8
    command: >
      valkey-server
      --cluster-enabled yes
      --cluster-config-file nodes.conf
      --cluster-node-timeout 5000
      --appendonly yes
    ports:
      - "7000-7005:7000-7005"
```

---

## Migration Path

### Current situation
pool-manager already uses Valkey (self-hosted via docker-compose locally). The production deployment method was not found in the cloned repo (no Helm charts in the repo — likely in a separate gitops/infra repo). The migration path assumes moving from self-hosted (container) to managed ElastiCache.

### Steps

1. **Audit existing data** — Valkey is used for ephemeral cache only (API keys, rate limits, idempotency, settings). All data is reconstructible. **No data migration needed.**

2. **Provision ElastiCache (staging)** via Terraform. Verify connectivity from EKS nodes using a debug pod:
   ```bash
   kubectl run valkey-test --image=valkey/valkey:8 --rm -it -- \
     valkey-cli -h <elasticache-endpoint> -p 6379 -a <auth-token> --tls ping
   ```

3. **Update Helm values** for pool-manager in staging to use new `CACHE_*` env vars pointing to ElastiCache.

4. **Deploy and validate** — monitor:
   - `checks["redis"]` in `/health/ready` endpoint
   - Rate limiter functionality (smoke test: hit an endpoint >RATE_LIMIT_MAX times)
   - Idempotency (replay same request ID twice, expect 200 on replay)
   - Cache hit metrics in CloudWatch (ElastiCache `CacheHits`, `CacheMisses`)

5. **Switch production** — deploy production ElastiCache (cluster mode), update Helm values, deploy.

6. **Decommission self-hosted** — remove old Valkey container/pod from infrastructure.

### Env var renaming

The existing `REDIS_*` env vars in pool-manager will be replaced by `CACHE_*` vars. During the transition, pool-manager can support both for one release cycle by checking `CACHE_ADDRS` first, falling back to `REDIS_HOST:REDIS_PORT` if absent.

---

## Testing Requirements

### Unit Tests (lib-commons `commons/cache`)

- Use the generated `MockCacheClient` (mockgen) to test all callers without a real server
- Test factory function `NewCacheClient` with invalid configs (empty addrs, bad TLS cert)
- Test `ErrCacheMiss` is returned correctly when go-redis returns `Nil`
- Coverage requirement: ≥ 80%

```go
// Example mock usage in pool-manager tests
func TestAPIKeyMiddleware_CacheHit(t *testing.T) {
    ctrl := gomock.NewController(t)
    defer ctrl.Finish()
    
    mockCache := cachemock.NewMockCacheClient(ctrl)
    mockCache.EXPECT().Get(gomock.Any(), "apikey:sha256:abc123").Return(`{"valid":true}`, nil)
    
    // ... test middleware
}
```

### Integration Tests

Use `testcontainers-go` with a real Valkey container:

```go
func TestCacheClient_Integration(t *testing.T) {
    ctx := context.Background()
    
    req := testcontainers.ContainerRequest{
        Image:        "valkey/valkey:8",
        ExposedPorts: []string{"6379/tcp"},
        WaitingFor:   wait.ForLog("Ready to accept connections"),
    }
    
    container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
        ContainerRequest: req,
        Started: true,
    })
    require.NoError(t, err)
    defer container.Terminate(ctx)
    
    host, _ := container.Host(ctx)
    port, _ := container.MappedPort(ctx, "6379")
    
    client, err := cache.NewCacheClient(cache.CacheConfig{
        Addrs: []string{host + ":" + port.Port()},
        Mode:  cache.SingleInstance,
    })
    require.NoError(t, err)
    defer client.Close()
    
    // Test Get/Set/Del/Exists/Incr/Expire
}
```

### Cluster Mode Tests

Use a 3-node Valkey cluster container (or `testcontainers-go` Redis cluster module) to verify:
- `ForEachMaster` SCAN works across all shards
- Key routing works correctly for tenant-namespaced keys
- `Exists` with keys on different shards returns correct count

---

## Definition of Done

- [ ] `lib-commons`: `commons/cache` package created with `CacheClient` interface (`cache.go`)
- [ ] `lib-commons`: `CacheMode` type (`SingleInstance` | `Cluster`) defined
- [ ] `lib-commons`: `CacheConfig` struct defined with all fields documented
- [ ] `lib-commons`: `NewCacheClient(cfg CacheConfig) (CacheClient, error)` factory implemented using `go-redis UniversalClient`
- [ ] `lib-commons`: SingleInstance mode working with standalone ElastiCache endpoint
- [ ] `lib-commons`: Cluster mode working with cluster configuration endpoint
- [ ] `lib-commons`: `ErrCacheMiss` sentinel error defined and returned by `Get` on cache miss
- [ ] `lib-commons`: `MockCacheClient` generated via `go:generate mockgen`
- [ ] `lib-commons`: Unit tests with mock, coverage ≥ 80%
- [ ] `lib-commons`: Integration tests with real Valkey container (testcontainers-go)
- [ ] `lib-commons`: go.mod unchanged (go-redis/v9 already present)
- [ ] `lib-commons`: CHANGELOG and MIGRATION_MAP updated
- [ ] `pool-manager`: `Application.CacheClient` field added (type `cache.CacheClient`)
- [ ] `pool-manager`: `initRedis` updated to call `cache.NewCacheClient` with env-driven config
- [ ] `pool-manager`: `CACHE_*` env vars defined in `Config` struct (replacing / aliasing `REDIS_*`)
- [ ] `pool-manager`: API key middleware wired to `cache.CacheClient`
- [ ] `pool-manager`: Rate limiter middleware wired to `cache.CacheClient`
- [ ] `pool-manager`: Idempotency middleware wired to `cache.CacheClient`
- [ ] `pool-manager`: Settings/secrets cache wired to `cache.CacheClient`
- [ ] `pool-manager`: `cache_handler.go` SCAN updated to use `ForEachMaster` in cluster mode
- [ ] `pool-manager`: Helm values updated with `CACHE_*` env vars and `valkey-credentials` secretRef
- [ ] `pool-manager`: Integration tests updated to use new CacheClient interface
- [ ] `pool-manager`: `.env.example` updated with `CACHE_*` variables
- [ ] `backoffice-console`: Verified `/cache` API endpoints work with ElastiCache-backed pool-manager
- [ ] `backoffice-console`: Helm values confirmed (no CACHE_* vars needed — BFF only)
- [ ] AWS ElastiCache Valkey provisioned in **staging** (single node, `cache.r7g.large`, TLS enabled)
- [ ] AWS ElastiCache Valkey provisioned in **production** (cluster mode, 3 shards × `cache.r7g.large`, 1 replica each, TLS enabled, multi-AZ)
- [ ] Kubernetes Secret `valkey-credentials` created in staging and production namespaces
- [ ] Terraform module reviewed and applied via CI
- [ ] Security group rules verified: EKS node SG → ElastiCache SG on port 6379/6380
- [ ] CloudWatch alarms configured: `EngineCPUUtilization > 80%`, `CurrConnections > 1000`, `DatabaseMemoryUsagePercentage > 80%`
- [ ] Runbook documented covering:
  - Cache flush procedure (pattern-based via `/cache/pattern` API or `valkey-cli FLUSHDB`)
  - Failover test (promote replica, verify pool-manager reconnects)
  - Scaling (add shard in cluster mode, update `CACHE_ADDRS`)
  - Auth token rotation (`auth_token_update_strategy = ROTATE` in Terraform)


Component	File	Current	After
API key cache	`middleware/apikey.go`	`redis.UniversalClient`	`cache.CacheClient`
Rate limiter	`middleware/ratelimit.go`	`redis.UniversalClient`	`cache.CacheClient`
Idempotency	`middleware/idempotency.go`	`redis.UniversalClient`	`cache.CacheClient`
Settings cache	`handler/tenant_service_handler.go`	`redis.UniversalClient`	`cache.CacheClient`
Cache handler SCAN	`handler/cache_handler.go`	direct SCAN/DEL	keep `redis.UniversalClient` for SCAN (cluster: `ForEachMaster`)

Dimension	Managed ElastiCache	Self-hosted (e.g., Helm chart on EKS)
Failover	Automatic (<60s)	Manual or operator-managed
Persistence	Snapshots + backups out of the box	Custom setup required
Patching	AWS applies security patches	Manual
Networking	VPC-native, SG-based isolation	Pod network, harder to restrict
Cost at scale	Predictable	Underestimated (ops + k8s overhead)
Observability	CloudWatch metrics built-in	Prometheus exporter + dashboards
TLS	Managed certs, in-transit encryption	Self-signed or cert-manager

Variable	Description	Required
`CACHE_MODE`	Topology: `single` or `cluster`	Yes
`CACHE_ADDRS`	Comma-separated `host:port` list	Yes
`CACHE_PASSWORD`	Valkey AUTH token	Yes (ElastiCache)
`CACHE_USERNAME`	Valkey ACL username (optional)	No
`CACHE_TLS_ENABLED`	Enable TLS (required for ElastiCache)	Yes (ElastiCache)
`CACHE_CA_CERT_BASE64`	Base64 PEM CA cert for TLS verification	No (uses system CA by default)
`CACHE_POOL_SIZE`	Connection pool size per node	No (default: 10)
`CACHE_MAX_RETRIES`	Max retries on transient errors	No (default: 3)
`CACHE_DIAL_TIMEOUT`	Connection timeout (e.g., `5s`)	No (default: 5s)
`CACHE_READ_TIMEOUT`	Read timeout (e.g., `3s`)	No (default: 3s)
`CACHE_WRITE_TIMEOUT`	Write timeout (e.g., `3s`)	No (default: 3s)

feat: Add AWS Valkey (ElastiCache) support to the stack #389

Description

feat: Add AWS Valkey (ElastiCache) support to the stack

Background & Context

Why Valkey

Why AWS ElastiCache (Managed) over Self-hosted

Redis Protocol Compatibility

Current State (Based on Repo Analysis)

pool-manager (module: github.com/LerianStudio/tenant-manager)

backoffice-console (module: Next.js/TypeScript monorepo)

lib-commons (module: github.com/LerianStudio/lib-commons/v4)

Proposed Architecture

Option A: Single Instance (Standalone / Replication Group with 1 shard)

Configuration

Client configuration

When to use

Pros

Cons

Approximate cost (us-east-2)

Option B: Cluster Mode (ElastiCache Valkey Cluster)

Configuration

Client configuration

When to use

Pros

Cons

Key space constraint

Approximate cost (us-east-2, 3 shards × 1 replica each)

Recommendation

Changes Required in lib-commons

New package: commons/cache

Changes Required in pool-manager

Wire up cache.CacheClient

Changes Required in backoffice-console

Infrastructure / Helm

Kubernetes Secret

Terraform Module

Security Group

Environment Variables

Complete Reference

Examples by Environment

Local Development

Migration Path

Current situation

Steps

Env var renaming

Testing Requirements

Unit Tests (lib-commons commons/cache)

Integration Tests

Cluster Mode Tests

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

pool-manager (module: `github.com/LerianStudio/tenant-manager`)

lib-commons (module: `github.com/LerianStudio/lib-commons/v4`)

New package: `commons/cache`

Wire up `cache.CacheClient`

Unit Tests (lib-commons `commons/cache`)