This document defines the full context for a coding agent working on the cockroachdb-live-evolution repository, plus human-readable documentation for maintainers and contributors.
The goal: make it trivial for an AI coding agent (and humans) to understand what this repo does, how it’s structured, and how to safely extend it — including adding real Grafana dashboards and a small sample application that participates in the evolution story.
This file is the single source of truth for:
- Repo purpose and narrative
- Folder structure and conventions
- Safe evolution constraints (zero downtime, no destructive ops)
- New deliverables: observability + sample app
This repository demonstrates how to evolve a CockroachDB deployment on Kubernetes:
- From single node (Postgres-like)
- To two nodes (introducing a “reader” path + read-only user)
- To three nodes (full RF=3 distributed cluster)
- With an optional multi-cluster extension (conceptual)
Key properties:
- Zero downtime during topology changes
- CockroachDB Operator managed
- Continuous load (Cockroach
workloadtool) - Backups between every phase
- Mermaid diagrams for architecture and evolution
- Demo script for live presentations
New in this update:
- Real Prometheus + Grafana observability, with importable dashboards
- A small sample app that:
- Performs writes via
crdb-writer - Performs reads via
crdb-reader(with a read-only DB user) - Emits metrics and traces (OpenTelemetry optional but preferred)
- Helps demonstrate “viral login/search spike” behavior safely
- Uses the same schema and database as the CockroachDB load test application
- Performs writes via
The repo is meant as:
- A reference implementation for teams migrating from Postgres to CockroachDB
- A live demo for talks, workshops, internal enablement
- A starting point for more advanced multi-cluster/multi-region patterns
-
Phase 0 — Single node
- 1 CockroachDB node
cluster.default_replication_factor = 1- Behaves like a single Postgres instance
- Sample app runs and functions (no code changes between phases)
-
Phase 1 — Two nodes (reader path)
- 2 CockroachDB nodes
cluster.default_replication_factor = 2- “Writer” path for main app
- “Reader” path for legacy system (read-only user)
- Sample app uses:
- Writes via writer service + writer user
- Reads via reader service + readonly user (to validate the contract)
-
Phase 2 — Three nodes (distributed cluster)
- 3 CockroachDB nodes
cluster.default_replication_factor = 3- Full RF=3, fault-tolerant cluster
- Sample app continues uninterrupted
-
Phase 3 — Optional multi-cluster
- Conceptual extension for DR / multi-cluster
- Diagrammed and described, not required for base demo
Core:
- CockroachDB Operator manages:
CrdbClusterresource (k8s/crdb-cluster.yaml)- Node count scaling (
spec.nodes)
- Services:
crdb-writer— primary application entrypointcrdb-reader— legacy read-only entrypoint
- Load:
crdb-loadgen— Cockroachworkloadtool (TPC-C by default)
New:
- Sample app:
- K8s Deployment + Service
- Configured to hit CockroachDB via writer/reader services
- Provides endpoints to simulate “viral spike” safely
- Exposes Prometheus metrics for app behavior and DB errors/latency
- Uses the same database and schema as the CockroachDB load test application
Observability:
- Prometheus scraping:
- CockroachDB metrics endpoint
- Sample app metrics endpoint
- Kubernetes resource metrics (optional; keep minimal)
- Grafana dashboards:
- CockroachDB overview
- Evolution-specific panels (RF change, rebalancing, replicas, range health)
- Sample app behavior (RPS, latency, error rate, read/write split)
cockroachdb-live-evolution/
├── README.md
├── context.md # This file (single source of truth)
├── demo/
│ └── DEMO.md # Live demo script (10–15 min)
├── diagrams/
│ ├── phase0-single-node.md
│ ├── phase1-two-nodes.md
│ ├── phase2-three-nodes.md
│ ├── phase3-multicluster.md
│ └── evolution-timeline.md
├── k8s/
│ ├── crdb-cluster.yaml # Operator-managed Cockroach cluster
│ ├── services.yaml # crdb-writer and crdb-reader Services
│ ├── loadgen-job.yaml # Cockroach workload generator
│ ├── sql/ # SQL files applied in phases
│ │ ├── 00-bootstrap.sql
│ │ ├── 10-phase1-reader-user.sql
│ │ └── 20-phase2-rf3.sql
│ ├── app/ # Sample app manifests
│ │ ├── namespace.yaml
│ │ ├── configmap.yaml
│ │ ├── secret.yaml # Demo-friendly credentials (documented)
│ │ ├── deployment.yaml
│ │ ├── service.yaml
│ │ └── hpa.yaml # Optional, clearly labeled
│ └── observability/ # Prometheus + Grafana setup
│ ├── namespace.yaml
│ ├── prometheus.yaml # Simple Prometheus deployment
│ ├── grafana.yaml # Grafana deployment + service
│ ├── datasources.yaml # Grafana datasource provisioning
│ ├── dashboards.yaml # Grafana dashboards provisioning
│ └── servicemonitors.yaml # If using Prometheus Operator; optional
├── grafana/
│ ├── dashboards/
│ │ ├── cockroachdb-overview.json
│ │ ├── cockroachdb-evolution.json
│ │ └── sample-app.json
│ └── README.md # How to run dashboards locally/in cluster
├── app/ # Sample app source
│ ├── README.md
│ ├── src/
│ ├── Dockerfile
│ └── k8s-local/ # Optional quickstart manifests for local
└── scripts/
├── evolve.sh # Optional: 1→2→3 + RF updates + backups
├── backup.sh # Explicit backup helper
└── check.sh # Health checks + phase verification
Keep the base demo runnable with only kubectl apply -f k8s/....
Any “optional” components must be clearly labeled and not required for the core narrative.
The sample app exists to:
Provide a realistic “viral login/search spike” scenario
Demonstrate read/write separation by endpoint AND by DB user permission
Provide metrics that align with the blog’s narrative:
login rate spikes
search rate spikes
latency changes
error rate if misconfigured
Remain safe and non-destructive
Uses SignalR to visual represent a timeline of the app behavior and metrics
Implements a simple UI dashboard that shows the app behavior and metrics in real-time
The sample app should accept commands like start load, stop load, start readers, stop readers, start writers, stop writers, etc.
The app must expose these HTTP endpoints:
POST /login
Writes a login event to the DB through the writer connection
Minimal schema: user_id, ts, ip_hash, user_agent_hash (or similar)
Must be idempotent-safe (e.g., insert-only event table)
GET /search?q=...
Reads via the reader connection (readonly user)
Query is safe and indexed (avoid table scans where possible)
Reads should be “hot path” and frequently called in spike mode
POST /spike
Accepts JSON body e.g. { "seconds": 30, "rps": 200, "mix": {"login": 0.2, "search": 0.8} }
Generates load internally (async workers) against its own endpoints OR direct DB calls
Must have guardrails: max seconds, max rps; defaults safe
GET /healthz
Liveness
GET /readyz
Readiness checks that:
Writer connection works
Reader connection works
(Optional) confirms readonly user cannot write (lightweight test)
Use a small schema with a clear story:
users (optional)
login_events (write-heavy)
search_terms (seed data; read-heavy)
search_events (optional, can be write-heavy—avoid if it complicates)
Minimum required:
login_events (writes)
search_terms (reads)
Schema must be created via k8s/sql/00-bootstrap.sql so the demo is reproducible.
The app must support:
APP_DB_WRITER_URL (points to crdb-writer)
APP_DB_READER_URL (points to crdb-reader)
TLS on by default if Cockroach is TLS-enabled in manifests
For demo simplicity:
Use Kubernetes Secrets to mount credentials
Provide documented defaults
Do not hardcode secrets in the app source
The readonly user creation must occur in Phase 1 SQL:
k8s/sql/10-phase1-reader-user.sql
Must GRANT SELECT only (no INSERT/UPDATE/DELETE)
The app must expose Prometheus metrics at:
GET /metrics
Required metrics (names can vary, but must exist):
HTTP requests total by route/status
Request duration histogram by route
DB query duration histogram split by writer vs reader
DB errors total split by writer vs reader
Spike mode current state (gauge): active_workers, target_rps
Optional (preferred):
OpenTelemetry tracing (OTLP exporter)
Trace DB calls and HTTP spans
Keep it simple and common. Recommended:
.NET minimal API (Npgsql)S
Selection criteria:
Single binary/container is ideal
Simple metrics library support
Straightforward Dockerfile
Observability must:
Be easy to deploy
Provide dashboards that match the demo narrative
Avoid requiring a full kube-prometheus-stack unless explicitly labeled optional
Default: provide a minimal in-cluster Prometheus + Grafana setup under k8s/observability/.
Optional: provide a second option using Helm kube-prometheus-stack, documented in grafana/README.md.
Prometheus must scrape:
CockroachDB metrics endpoint
Cockroach exposes Prometheus metrics; configure the scrape target via Service discovery or static target
Ensure the scrape works in all phases
Sample app /metrics
Scrape interval defaults:
15s is fine for demo purposes
Grafana should be deployed with:
Provisioned Prometheus datasource
Provisioned dashboards loaded from grafana/dashboards/*.json
Dashboards must be importable without manual clicks.
Create three real dashboards, stored as JSON:
grafana/dashboards/cockroachdb-overview.json Panels must include:
Node liveness / up
SQL throughput (QPS / TPS)
P50/P95/P99 SQL latency
CPU/memory per node (if available)
Disk / capacity (if available)
grafana/dashboards/cockroachdb-evolution.json Panels must include evolution-specific views:
Replication factor setting (or proxy metric showing replicas per range)
Under-replicated ranges
Range rebalancing activity
Replica count by node
Raft leadership distribution (if available)
Store capacity and rebalance signals
grafana/dashboards/sample-app.json Panels must include:
RPS by route (/login, /search)
Error rate by route/status
Latency percentiles per route
DB latency writer vs reader
Spike state: target_rps, active_workers
Each dashboard must:
Use variables where helpful (namespace, pod)
Avoid environment-specific hardcoding
Have clear titles and panel descriptions (short)
Add grafana/README.md with:
How to deploy observability manifests
Port-forward instructions for Prometheus and Grafana
Default credentials (if any) and how to change them
Dashboard list and what each is for
Troubleshooting steps (scrape target missing, no data, etc.)
The live demo script (demo/DEMO.md) must be updated to include:
Starting the observability stack (optional step, but recommended)
Starting the sample app
Demonstrating spike mode
Calling out what to look for in dashboards during each phase
Required “story beats” in the script:
Phase 0: observe baseline + single-node fragility
Spike: show read pressure rising
Phase 1: create readonly user, show reader path, see stabilization
Phase 2: scale to 3 nodes, RF=3, watch under-replication drop and rebalancing settle
“Boring is good”: graphs remain stable while changes occur
The README should link:
Demo script
Diagrams
Dashboards
App usage
When acting on this repo, the coding agent must prioritize:
The evolution path (1 → 2 → 3 nodes) must remain clear and reproducible.
Demo script, diagrams, dashboards, and manifests must stay in sync.
Any changes must not introduce:
Forced restart of all DB nodes simultaneously
Breaking changes to Services (crdb-writer, crdb-reader)
Removal or interruption of load generation during evolution
Destructive SQL operations
Prefer clear YAML and SQL snippets over hidden automation.
Document any new flags/settings in README and/or DEMO.md.
Default path should be minimal Prometheus + Grafana. Optional helm-based setup is allowed but must be clearly labeled optional.
No destructive actions. Guardrails on load generation. Clear separation of writer/reader connectivity.
The coding agent must not:
Introduce destructive operations into the main demo flow:
No DROP DATABASE/TABLE in core scripts
No kubectl delete of core resources without explicit “optional teardown” section
Change core semantics of:
crdb-writer as primary entrypoint for writes
crdb-reader as read-only entrypoint for legacy reads
Assume non-operator deployments:
This repo is Operator-based
If adding non-operator examples, keep them in a clearly separated folder
The coding agent should:
Keep YAML valid and kubectl apply-able
Keep Mermaid diagrams syntactically correct
Keep SQL compatible with CockroachDB
A PR implementing this update is complete when:
kubectl apply -f k8s/crdb-cluster.yaml brings up Phase 0 CockroachDB
kubectl apply -f k8s/services.yaml creates writer and reader services
kubectl apply -f k8s/loadgen-job.yaml starts continuous workload
kubectl apply -f k8s/app/ deploys sample app and it can:
POST /login successfully writes (writer connection)
GET /search successfully reads (reader connection)
Phase 1 SQL creates a readonly user and the app uses it for /search
kubectl apply -f k8s/observability/ brings up Prometheus and Grafana
Grafana loads all three dashboards automatically and they show live data
Demo script is updated to narrate metrics during evolution
No step requires manual dashboard import, hidden scripts, or destructive ops
“Implement the sample app in .NET10 minimal API with /login, /search, /spike, /metrics, and deploy it under k8s/app/.”
“Create SQL bootstrap scripts under k8s/sql/ and update demo flow to apply them in phases.”
“Add minimal Prometheus + Grafana manifests under k8s/observability/ and provision dashboards automatically.”
“Build three real Grafana dashboards in JSON and store them under grafana/dashboards/.”
“Update README and demo/DEMO.md to include app usage, spike story beats, and dashboard callouts.”
This repo is a live-evolution demo:
1 → 2 → 3 CockroachDB nodes under continuous load
backups between phases
services separate writer and reader intent
new: sample app + real Grafana dashboards to make the story visible
If you’re a human: treat this as your orientation guide. If you’re a coding agent: treat this as your operational contract.