Parent: RenderTrust System Documentation
┌─────────────────────────────────────────────────────────────┐
│ Cloudflare │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DNS: api.rendertrust.com → CNAME tunnel.cfargotunnel │ │
│ │ SSL: Full (strict), TLS 1.2+, HSTS preload │ │
│ │ WAF: SQLi, XSS, path traversal rules │ │
│ │ Rate Limit: 100/min API, 20/min auth │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────┘
│ (Cloudflare Tunnel, outbound)
▼
┌─────────────────────────────────────────────────────────────┐
│ Hetzner VPS (CX31+) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Coolify (Self-hosted PaaS) │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────────────┐ │ │
│ │ │FastAPI │ │ PG 16 │ │Redis 7 │ │Cloudflare │ │ │
│ │ │Gateway │ │ │ │ │ │Tunnel Daemon │ │ │
│ │ │ :8000 │ │ :5432 │ │ :6379 │ │ │ │ │
│ │ └────────┘ └────────┘ └────────┘ └───────────────┘ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────────────┐ │ │
│ │ │Promethe│ │Grafana │ │ Loki │ │ Promtail │ │ │
│ │ │ :9090 │ │ :3000 │ │ :3100 │ │ │ │ │
│ │ └────────┘ └────────┘ └────────┘ └───────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
| Spec | Minimum | Recommended |
|---|---|---|
| CPU | 2 vCPU | 4 vCPU |
| RAM | 4 GB | 8 GB |
| Disk | 40 GB SSD | 80 GB SSD |
| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
| Provider | Hetzner CX21 | Hetzner CX31 |
# Install Coolify (installs Docker, Traefik, Coolify UI)
curl -fsSL https://cdn.coollabs.io/coolify/install.sh | bash
# Access Coolify UI
open http://<server-ip>:8000- Create project "RenderTrust" in Coolify
- Add Docker Compose resource →
ci/coolify/docker-compose.coolify.yml - Set environment variables from
ci/coolify/env.template - Configure domain:
api.rendertrust.com→ Let's Encrypt SSL - Deploy
| File | Purpose | Key Features |
|---|---|---|
docker-compose.yml |
Development | Hot reload, no resource limits |
docker-compose.prod.yml |
Production | Resource limits, security hardening, read-only rootfs |
docker-compose.test.yml |
Testing | Ephemeral DB/Redis (tmpfs), auto-runs pytest |
docker-compose.edge.yml |
Edge nodes | Single node service with health check |
loadtest/docker-compose.load.yml |
Load testing | k6 + target app + monitoring |
| Feature | Configuration |
|---|---|
| Resource Limits | App: 1GB/1 CPU, DB: 1GB/1 CPU, Redis: 512MB/0.5 CPU |
| Security Options | no-new-privileges: true |
| Filesystem | Read-only rootfs + 100MB tmpfs for app |
| Logging | JSON driver, 50MB max, 5 file rotation |
| Redis | Appendonly, 256MB maxmemory, allkeys-lru |
| PostgreSQL | 128MB shared_buffers, persistent volume |
| Record | Type | Value | Proxy |
|---|---|---|---|
api.rendertrust.com |
CNAME | {tunnel}.cfargotunnel.com |
ON |
app.rendertrust.com |
CNAME | {tunnel}.cfargotunnel.com |
ON |
grafana.rendertrust.com |
CNAME | {tunnel}.cfargotunnel.com |
ON |
@ |
TXT | v=spf1 -all |
— |
_dmarc |
TXT | v=DMARC1; p=reject |
— |
- Connection: Outbound-only from VPS to Cloudflare edge
- No inbound ports needed on VPS (except SSH for admin)
- Auth: Credential file at
~/.cloudflared/{tunnel-id}.jsonOR token-based - Catch-all: Returns 404 for unrecognized hostnames
- Mode: Full (strict) — encrypted client→CF and CF→origin with cert validation
- TLS 1.2 minimum, TLS 1.3 enabled
- HSTS: 6 months, includeSubDomains, preload
- Automatic HTTPS rewrites: enabled
Scrapes metrics from the FastAPI app every 15 seconds:
# ci/grafana/prometheus.yml
scrape_configs:
- job_name: rendertrust-app
static_configs:
- targets: ["app:8000"]
metrics_path: /metrics
scrape_interval: 15sAuto-provisioned dashboards for:
| Dashboard | Panels |
|---|---|
| API Performance | Request rate, latency p50/p95/p99, error rate, status codes |
| Job Pipeline | Dispatch rate, completion rate, failure rate, queue depth |
| Fleet Health | Node count by status, average load, heartbeat freshness |
| Credits | Credit consumption rate, balance distribution, purchase volume |
| Alert | Condition | Severity | Duration |
|---|---|---|---|
| FleetTooFewNodes | healthy nodes < 2 | Critical | 5 min |
| HighErrorRate | 5xx rate > 5% | Critical | 5 min |
| HighJobFailureRate | Failed > 10% | Warning | 10 min |
| APILatencyHigh | p95 > 5s | Warning | 5 min |
| NoWebSocketConnections | connections == 0 | Warning | 10 min |
Log Pipeline:
Docker containers → Promtail → Loki → Grafana
Promtail Configuration:
- Scrapes Docker container logs via socket
- Extracts JSON fields:
level,event,request_id,timestamp - Labels:
service,container_name
Loki Configuration:
- BoltDB shipper + filesystem storage
- 30-day retention (720 hours)
- Max 5000 entries per query
Example LogQL Queries:
# All errors
{service="core"} | json | level="error"
# Trace a specific request
{service="core"} | json | request_id="abc-123"
# Stripe webhook events
{service="core"} | json | event=~"stripe.*"
# Error rate over 5 minutes
rate({service="core"} | json | level="error" [5m])
Push to dev/PR
│
▼
┌─────────────────────────────────────┐
│ Stage 1: Quality │
│ ┌─────────┐ ┌─────────────────┐ │
│ │ Lint │ │ Type Check │ │
│ │ (ruff) │ │ (mypy) │ │
│ └─────────┘ └─────────────────┘ │
└──────────────────┬──────────────────┘
│
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Unit │ │Integration│ │ E2E │
│ Tests │ │ Tests │ │ Tests │
│ (SQLite) │ │ (PG+Redis)│ │ (Docker) │
└──────────┘ └──────────┘ └──────────┘
│
▼
┌──────────────┐
│ Docker Build │
│ (multi-stage) │
└──────────────┘
| Scan | Tool | Scope |
|---|---|---|
| Dependency Audit | pip-audit | All Python deps |
| SAST | Semgrep | core/auth, core/api, core/config, core/database |
| Secret Scanning | Gitleaks | Entire repo |
./ci/deploy.sh # Standard deploy
./ci/deploy.sh --build # Build from source
./ci/deploy.sh --no-migrate # Skip database migrations
./ci/deploy.sh --rollback # Rollback to previous imageZero-downtime process:
- Save current image digest (for rollback)
- Pull/build new image
- Run migrations in ephemeral container
- Restart services
- Health check polling (30 retries × 2s)
- If health check fails, automatic rollback
# Create new migration
alembic revision --autogenerate -m "add column X"
# Apply all pending migrations
alembic upgrade head
# Rollback one migration
alembic downgrade -1
# View migration history
alembic historyDaily automated backups at 03:00 UTC:
# Manual backup
docker exec $(docker ps -q -f name=db) \
pg_dump -U rendertrust rendertrust | gzip > backup-$(date +%Y%m%d).sql.gz
# Restore from backup
gunzip -c backup-20260313.sql.gz | \
docker exec -i $(docker ps -q -f name=db) psql -U rendertrust rendertrustRetention: 30 days local, recommended S3 sync for off-site.
- Database Corruption: Restore from latest pg_dump backup
- Application Failure:
./ci/deploy.sh --rollback - VPS Failure: Provision new VPS, install Coolify, restore DB backup, deploy
- DNS/CDN Failure: Direct-to-IP fallback (temporary)
| Metric | Target |
|---|---|
| RPO (data loss) | 24 hours (daily backups) |
| RTO (recovery time) | 1 hour (new VPS + restore) |
MIT License | Copyright (c) 2026 ByBren, LLC