Production-grade, cloud-agnostic Redis 8.4 cluster automation with Envoy, TLS, and ruthless operational discipline.
- Purpose & How It Works
- System Architecture
- Redis Cluster Deep-Dive
- Auto-Failover & Persistence
- Production Deployment Guide
- Script Catalog & Usage
- Monitoring & Observability
- Operations & Runbooks
- Troubleshooting Playbook
- Integration Testing
- Repository Layout
- Contributing · License · Support
| Goal | Reality check |
|---|---|
| Provide a single, TLS-terminated Redis endpoint that hides all cluster complexity from applications. | Envoy’s redis_proxy terminates TLS, enforces ACL auth, understands cluster slots, and retries transparently. |
| Ship hardened Redis nodes with AOF durability and battle-tested defaults. | Custom Redis image + config templates enforce ACLs, persistence, and high file descriptor limits. |
| Offer ruthless operational automation. | Scripts bootstrap hosts, initialize clusters, scale slots, rotate logs, take backups, and simulate chaos. |
| Avoid “black box” magic. | Everything is plain Docker + Bash. You get auditable configs, no proprietary agents, and zero vendor lock-in. |
Flow summary
- Envoy node exposes
:6379(or:6380TLS) as the only endpoint apps ever see. - Three Redis masters own the 16,384 cluster slots; each master has at least one replica in another availability zone / failure domain.
- Cluster metadata (slots + node health) is polled by Envoy every 10s, so applications don’t chase redirects.
- Persistence uses AOF (
appendfsync everysec) with off-host snapshots uploaded to your object store (S3/GCS/Azure Blob/MinIO/etc.). - Monitoring hooks (redis_exporter, node_exporter, Envoy metrics) feed whatever Prometheus + Grafana stack you already run.
flowchart LR
subgraph Clients
A[Apps / Services]
end
subgraph Proxy
B[Envoy redis_proxy<br/>TLS termination + ACL auth]
end
subgraph RedisCluster["Redis Cluster: 3 masters + replicas"]
C1[Master A<br/>slots 0-5460]
C2[Master B<br/>slots 5461-10922]
C3[Master C<br/>slots 10923-16383]
R1[Replica A']
R2[Replica B']
R3[Replica C']
end
subgraph Observability
P1[redis_exporter<br/>9121]
P2[node_exporter<br/>9100]
P3[Envoy /stats<br/>9901]
Grafana[(Grafana)]
Prometheus[(Prometheus)]
end
A -->|TLS| B
B -->|slot-aware TCP| C1
B -->|slot-aware TCP| C2
B -->|slot-aware TCP| C3
C1 --> R1
C2 --> R2
C3 --> R3
P1 --> Prometheus
P2 --> Prometheus
P3 --> Prometheus
Prometheus --> Grafana
Key traits
- Only Envoy is internet/ELB facing. Redis nodes stay in private subnets.
- Envoy holds a connection pool per master, retries failed ops, and evenly balances keys via Maglev hashing.
- You run redis_exporter + node_exporter locally (per node) and scrape Envoy’s
/stats/prometheus; Prometheus pulls everything over private IPs.
graph TB
subgraph AZ-a
M1[(Master A)]
R1[(Replica A')]
end
subgraph AZ-b
M2[(Master B)]
R2[(Replica B')]
end
subgraph AZ-c
M3[(Master C)]
R3[(Replica C')]
end
M1 -- cluster bus 16379 --> M2
M2 -- cluster bus 16379 --> M3
M3 -- cluster bus 16379 --> M1
M1 -- async repl --> R1
M2 -- async repl --> R2
M3 -- async repl --> R3
Envoy[
Envoy redis_proxy
- cluster refresh 10s
- downstream ACL auth
- upstream TLS
- request retries
]
Envoy --> M1
Envoy --> M2
Envoy --> M3
Cluster mechanics
- Slot map: 0-16383 split evenly across the three masters. Cluster metadata stored in
nodes.confand polled by Envoy. - Replication: asynchronous, TLS-enabled, uses dedicated ACL user to keep secrets scoped.
- Health:
cluster-node-timeout= 5s; nodes mark peers down fast, so failover finishes in ~10-15 seconds. - Sentinel: optional; Redis native cluster failover already covers master promotion, but Sentinel can be layered on if you want deterministic quorum voting.
| Risk | Mitigation inside RedisForge |
|---|---|
| Master dies (hardware/OS fault) | Redis cluster flags it failing, promotes replica within the same slot range. Envoy refreshes topology every 10s and drops dead upstream connections automatically. |
| Envoy sees stale topology | cluster_refresh_rate + redirect refresh handles MOVED/ASK replies immediately. |
| Data loss | AOF (appendonly yes, appendfsync everysec) + object-storage archive via backup.sh. Optional mixed mode (aof-use-rdb-preamble yes) keeps restart time low. |
| Replica lag spikes | Grafana board surfaces redis_replication_lag_seconds; run scripts/scale.sh add to add replicas in overloaded AZs. |
sequenceDiagram
participant Client
participant Envoy
participant MasterA
participant ReplicaA
participant ObjectStore
Client->>Envoy: SET key:value
Envoy->>MasterA: Write (slot 1024)
MasterA-->>ReplicaA: Async replicate (AOF)
MasterA-->>ObjectStore: (via backup.sh) Upload tarred AOF hourly
Note over MasterA,ReplicaA: Master failure
ReplicaA->>ReplicaA: Promote to Master (cluster failover)
Envoy->>ReplicaA: Rebuild pool after refresh
Client->>Envoy: Subsequent GET routed to new Master
- Do not run replicas on the same hypervisor. Spread masters and replicas across at least three separate failure domains (zones/regions/providers).
- Run
backup.shvia cron with credentials that can write to your object store of choice. - Periodically rehearse failover:
./scripts/scale.sh remove <node>combined with chaos tooling.
If you skip any of this, you’re gambling your data. Follow it step-by-step.
| Component | Count | Baseline sizing | Ports | Notes |
|---|---|---|---|---|
| Envoy proxy | 1 (per cluster endpoint) | ≥4 vCPU / 8 GB RAM | 6379/6380, 9901 | Front-end with your load balancer / DNS; lock admin port to monitoring CIDRs. |
| Redis masters | 3 | ≥8 vCPU / 64 GB RAM + NVMe/SSD | 6379, 16379 | Place each in a different availability/failure zone (can even be different clouds). |
| Redis replicas | 3 | Mirror master sizing | 6379, 16379 | Pair each master with at least one replica housed in another zone/provider. |
Mixing providers (e.g., on-prem + AWS + Azure) is fine as long as latency between nodes stays low and security policies permit the ports above.
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w vm.overcommit_memory=1
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
sudo tee -a /etc/security/limits.conf <<'EOF'
* soft nofile 100000
* hard nofile 100000
EOFsudo yum install -y docker git jq redis
sudo systemctl enable --now docker
sudo usermod -aG docker $USER && newgrp docker
git clone https://github.com/siyamsarker/RedisForge.git
cd RedisForge && cp env.example .envFor detailed monitoring setup, see docs/monitoring-setup.md. For operational procedures, see docs/operations-runbook.md.
Populate .env with:
REDIS_REQUIREPASS,REDIS_ACL_PASS,REDIS_READONLY_PASS, etc. (useopenssl rand -base64 32)REDIS_CLUSTER_ANNOUNCE_IP=<this node private IP>REDIS_MASTER_{1..3}_HOST(private DNS / IPs for Envoy)BACKUP_S3_BUCKET,AWS_REGION(use any S3-compatible endpoint; for non-AWS storage supply the correct credentials/endpoint configuration)
./scripts/deploy.sh redis
docker logs -f redis-masterREDIS_REQUIREPASS=$REDIS_REQUIREPASS \
./scripts/init-cluster.sh \
"10.0.1.10:6379,10.0.2.11:6379,10.0.3.12:6379,10.0.4.10:6379,10.0.5.11:6379,10.0.6.12:6379"Verify:
redis-cli -h 10.0.1.10 -a "$REDIS_REQUIREPASS" cluster info
redis-cli -h 10.0.1.10 -a "$REDIS_REQUIREPASS" cluster slots./scripts/generate-certs.sh config/tls/prod
./scripts/deploy.sh envoy
curl -k https://<envoy-ip>:6379 -u app_user:$REDIS_ACL_PASS ping- Run
./scripts/setup-exporters.shon every Redis host. - Schedule
backup.sh+log-rotate.shvia cron (see Operations). - Import Grafana dashboard JSON and create Prometheus alerts.
| Script | When to run | What it does | Example |
|---|---|---|---|
| `scripts/deploy.sh redis | envoy | monitoring` | Bootstrapping or rebuilding a node |
scripts/init-cluster.sh <nodes> |
After all nodes are up | Creates the Redis cluster, pairs masters with replicas, validates state. | ./scripts/init-cluster.sh "host1:6379,...,host6:6379" |
| `scripts/scale.sh add | remove` | Adding masters/replicas or decommissioning | Adds nodes, rebalances slots, or drains/removes nodes safely. |
scripts/backup.sh |
Hourly via cron | Archives the newest AOF + nodes.conf and uploads to an S3-compatible object store. |
BACKUP_S3_BUCKET=s3://prod-redis ./scripts/backup.sh |
scripts/log-rotate.sh <dir> <sizeMB> <count> |
Daily via cron | Rotates Redis logs, compresses old copies, enforces retention. | ./scripts/log-rotate.sh /var/log/redis 1024 7 |
scripts/test-cluster.sh <host> <port> |
Smoke tests after deploy or failover | Runs PING, SET/GET, pub/sub, cluster info checks through Envoy. | ./scripts/test-cluster.sh envoy.company.local 6379 |
scripts/setup-exporters.sh |
After Redis deploy on every node | Launches redis_exporter + node_exporter via Docker (host networking). | ./scripts/setup-exporters.sh |
tests/run-integration.sh |
In CI or before large changes | Spins up full cluster in Docker Compose, initializes slots, runs tests. | ./tests/run-integration.sh |
scripts/generate-certs.sh <dir> |
Rotating TLS certs | Issues self-signed certs for lower envs; replace with CA certs in prod. | ./scripts/generate-certs.sh config/tls/prod |
If you don’t know exactly why you are running a script, stop. Blind automation is how you lose data.
- Exporters (per Redis host)
./scripts/setup-exporters.sh curl http://localhost:9121/metrics | head # redis_exporter curl http://localhost:9100/metrics | head # node_exporter
- Prometheus
- Add the scrape jobs shown below to your existing
prometheus.yml:- job_name: 'redisforge-redis' static_configs: - targets: ['10.0.1.10:9121','10.0.2.11:9121','10.0.3.12:9121'] - job_name: 'redisforge-node' static_configs: - targets: ['10.0.1.10:9100','10.0.2.11:9100','10.0.3.12:9100'] - job_name: 'redisforge-envoy' metrics_path: /stats/prometheus static_configs: - targets: ['envoy-vpc.internal:9901']
- Reload Prometheus:
curl -X POST http://<prom-host>:9090/-/reload
- Add the scrape jobs shown below to your existing
- Alertmanager (Discord-ready)
- File:
monitoring/alertmanager/alertmanager.yaml - Steps:
- Create a Discord webhook in your server (Server Settings → Integrations → Webhooks).
- Replace every
<YOUR_DISCORD_WEBHOOK_URL>in the file with the real webhook URL (append/slack). - Deploy Alertmanager with that file (
docker run prom/alertmanager ... -config.file=/etc/alertmanager/alertmanager.yaml). - Test:
A message should hit your Discord channel.
curl -XPOST http://<alertmanager-host>:9093/api/v1/alerts -d '[{"labels":{"alertname":"Test","severity":"warning"},"annotations":{"summary":"Discord test"}}]'
- File:
- Grafana dashboard
- File:
monitoring/grafana/dashboards/redisforge-dashboard.json - Import via Grafana UI → Dashboards → Import → Upload JSON.
- Datasource configuration lives under
monitoring/grafana/provisioning/datasources/; point it to your Prometheus endpoint.
- File:
Key metrics to never ignore:
redis_cluster_slots_ok(<16384 = fire drill)redis_connected_slaves(should match replica targets)redis_aof_current_size_bytesvs disk usageenvoy_cluster_upstream_rq_xx{response_code_class="5"}
# Add an additional master (rebalances slots)
REDIS_REQUIREPASS=$PASS ./scripts/scale.sh add 10.0.7.10:6379 --role master
# Add replica pinned to master ID
REDIS_REQUIREPASS=$PASS ./scripts/scale.sh add 10.0.8.11:6379 --role replica --replica-of <master-node-id>
# Remove node after draining slots
REDIS_REQUIREPASS=$PASS ./scripts/scale.sh remove <node-id>BACKUP_S3_BUCKET=s3://prod-redisforge ./scripts/backup.sh
# Cron: 0 * * * * BACKUP_S3_BUCKET=s3://prod-redisforge /opt/RedisForge/scripts/backup.sh >> /var/log/redisforge-backup.log 2>&1./scripts/log-rotate.sh /var/log/redis 1024 7
# Cron: 0 2 * * * /opt/RedisForge/scripts/log-rotate.sh /var/log/redis 1024 7./scripts/generate-certs.sh config/tls/prod
rsync config/tls/prod/* envoy-host:/etc/envoy/certs
docker restart envoy-proxyredis-cli -h envoy.company.local -p 6379 --tls --cacert config/tls/prod/ca.crt -a "$REDIS_REQUIREPASS" ping
curl -s https://envoy.company.local:9901/stats/prometheus | grep envoy_cluster_upstream_rq_time_bucket | head| Symptom | Quick triage | Root fix |
|---|---|---|
redis-cli through Envoy returns MOVED constantly |
Envoy can’t refresh topology (wrong Redis auth or TLS). Check Envoy logs + /clusters. |
Ensure REDIS_REQUIREPASS matches across Redis + Envoy env vars; verify /etc/envoy/certs. |
Slots <16384 or cluster_state:fail |
redis-cli --cluster check <node> |
Replace failed nodes, run ./scripts/scale.sh remove <dead-node-id>, then add new replica. |
| Backups missing in object store | Check /var/log/redisforge-backup.log and credentials. |
Ensure the IAM/user/service-account used by the host can write to the bucket, confirm BACKUP_S3_BUCKET/endpoint vars, then rerun backup.sh. |
| High replication lag | `redis-cli info replication | grep lag` |
| Envoy admin port inaccessible | Security group or firewalls blocking 9901. | Allow Prometheus CIDRs; never expose 9901 publicly. |
| Containers crash-loop | docker logs redis-master or docker logs envoy-proxy |
Usually missing secrets or wrong env vars. Re-create .env, redeploy with correct TLS paths. |
Golden rule: never restart everything at once. Fix one component, validate, then move on. Chaos is contagious.
See tests/run-integration.sh for the full automated test suite.
Quick test:
./tests/run-integration.shValidates:
- ✅ PING through Envoy
- ✅ SET/GET operations
⚠️ Pub/Sub (see limitations below)- ✅ Cluster failover
- ✅ Data persistence
If this fails locally, do not deploy to production. Fix the tests first.
⚠️ IMPORTANT: Redis Pub/Sub is not fully supported when routing through Envoy'sredis_proxyin cluster mode.
Why: Envoy's Redis proxy is designed for request/response commands (GET, SET, etc.) and doesn't maintain the long-lived connections required for Pub/Sub subscriptions.
Workarounds:
- Direct connection: Connect directly to Redis nodes for Pub/Sub (bypass Envoy)
- Alternative messaging: Use Redis Streams or external message brokers (RabbitMQ, Kafka) for pub/sub patterns
- Separate cluster: Deploy a dedicated Redis instance (non-cluster) for Pub/Sub behind Envoy
Testing: Pub/Sub tests are expected to fail in test-cluster.sh when testing through Envoy.
RedisForge/
├── config/ # Envoy + Redis templates, TLS helper docs
├── docker/ # Hardened Dockerfiles + entrypoints
├── docs/quickstart.md # Extended deployment walkthrough
├── monitoring/ # Alertmanager + Grafana assets
├── scripts/ # Automation (deploy, scale, backup, etc.)
├── tests/ # Integration harness
├── env.example
├── LICENSE
└── README.md
- Fork the repo, branch off
main. - Make your changes + update docs.
- Run
./tests/run-integration.sh(or your CI equivalent). - Open a PR with logs/screenshots. No tests = no merge.
See CONTRIBUTING.md for full guidelines and CODE_OF_CONDUCT.md for our community standards.
Keep scripts idempotent, don’t hardcode secrets, and document every operational change.
RedisForge is distributed under the MIT License.
- Issues & ideas: GitHub Issues
- Discussions & design questions: open a Discussion or PR.
- Security disclosures: contact the maintainers privately.
Built for operators who refuse to trust luck more than discipline.