Skip to content

Latest commit

 

History

History
560 lines (351 loc) · 16.1 KB

File metadata and controls

560 lines (351 loc) · 16.1 KB

CockroachDB Live Evolution — Coding Agent Context (UPDATED)

This document defines the full context for a coding agent working on the cockroachdb-live-evolution repository, plus human-readable documentation for maintainers and contributors.

The goal: make it trivial for an AI coding agent (and humans) to understand what this repo does, how it’s structured, and how to safely extend it — including adding real Grafana dashboards and a small sample application that participates in the evolution story.

This file is the single source of truth for:

  • Repo purpose and narrative
  • Folder structure and conventions
  • Safe evolution constraints (zero downtime, no destructive ops)
  • New deliverables: observability + sample app

1. Repository purpose

This repository demonstrates how to evolve a CockroachDB deployment on Kubernetes:

  • From single node (Postgres-like)
  • To two nodes (introducing a “reader” path + read-only user)
  • To three nodes (full RF=3 distributed cluster)
  • With an optional multi-cluster extension (conceptual)

Key properties:

  • Zero downtime during topology changes
  • CockroachDB Operator managed
  • Continuous load (Cockroach workload tool)
  • Backups between every phase
  • Mermaid diagrams for architecture and evolution
  • Demo script for live presentations

New in this update:

  • Real Prometheus + Grafana observability, with importable dashboards
  • A small sample app that:
    • Performs writes via crdb-writer
    • Performs reads via crdb-reader (with a read-only DB user)
    • Emits metrics and traces (OpenTelemetry optional but preferred)
    • Helps demonstrate “viral login/search spike” behavior safely
    • Uses the same schema and database as the CockroachDB load test application

The repo is meant as:

  • A reference implementation for teams migrating from Postgres to CockroachDB
  • A live demo for talks, workshops, internal enablement
  • A starting point for more advanced multi-cluster/multi-region patterns

2. High-level architecture

2.1 Logical evolution

  1. Phase 0 — Single node

    • 1 CockroachDB node
    • cluster.default_replication_factor = 1
    • Behaves like a single Postgres instance
    • Sample app runs and functions (no code changes between phases)
  2. Phase 1 — Two nodes (reader path)

    • 2 CockroachDB nodes
    • cluster.default_replication_factor = 2
    • “Writer” path for main app
    • “Reader” path for legacy system (read-only user)
    • Sample app uses:
      • Writes via writer service + writer user
      • Reads via reader service + readonly user (to validate the contract)
  3. Phase 2 — Three nodes (distributed cluster)

    • 3 CockroachDB nodes
    • cluster.default_replication_factor = 3
    • Full RF=3, fault-tolerant cluster
    • Sample app continues uninterrupted
  4. Phase 3 — Optional multi-cluster

    • Conceptual extension for DR / multi-cluster
    • Diagrammed and described, not required for base demo

2.2 Kubernetes components

Core:

  • CockroachDB Operator manages:
    • CrdbCluster resource (k8s/crdb-cluster.yaml)
    • Node count scaling (spec.nodes)
  • Services:
    • crdb-writer — primary application entrypoint
    • crdb-reader — legacy read-only entrypoint
  • Load:
    • crdb-loadgen — Cockroach workload tool (TPC-C by default)

New:

  • Sample app:
    • K8s Deployment + Service
    • Configured to hit CockroachDB via writer/reader services
    • Provides endpoints to simulate “viral spike” safely
    • Exposes Prometheus metrics for app behavior and DB errors/latency
    • Uses the same database and schema as the CockroachDB load test application

Observability:

  • Prometheus scraping:
    • CockroachDB metrics endpoint
    • Sample app metrics endpoint
    • Kubernetes resource metrics (optional; keep minimal)
  • Grafana dashboards:
    • CockroachDB overview
    • Evolution-specific panels (RF change, rebalancing, replicas, range health)
    • Sample app behavior (RPS, latency, error rate, read/write split)

3. Repository layout (UPDATED)

cockroachdb-live-evolution/
├── README.md
├── context.md                         # This file (single source of truth)
├── demo/
│   └── DEMO.md                        # Live demo script (10–15 min)
├── diagrams/
│   ├── phase0-single-node.md
│   ├── phase1-two-nodes.md
│   ├── phase2-three-nodes.md
│   ├── phase3-multicluster.md
│   └── evolution-timeline.md
├── k8s/
│   ├── crdb-cluster.yaml              # Operator-managed Cockroach cluster
│   ├── services.yaml                  # crdb-writer and crdb-reader Services
│   ├── loadgen-job.yaml               # Cockroach workload generator
│   ├── sql/                           # SQL files applied in phases
│   │   ├── 00-bootstrap.sql
│   │   ├── 10-phase1-reader-user.sql
│   │   └── 20-phase2-rf3.sql
│   ├── app/                           # Sample app manifests
│   │   ├── namespace.yaml
│   │   ├── configmap.yaml
│   │   ├── secret.yaml                # Demo-friendly credentials (documented)
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── hpa.yaml                   # Optional, clearly labeled
│   └── observability/                 # Prometheus + Grafana setup
│       ├── namespace.yaml
│       ├── prometheus.yaml            # Simple Prometheus deployment
│       ├── grafana.yaml               # Grafana deployment + service
│       ├── datasources.yaml           # Grafana datasource provisioning
│       ├── dashboards.yaml            # Grafana dashboards provisioning
│       └── servicemonitors.yaml       # If using Prometheus Operator; optional
├── grafana/
│   ├── dashboards/
│   │   ├── cockroachdb-overview.json
│   │   ├── cockroachdb-evolution.json
│   │   └── sample-app.json
│   └── README.md                      # How to run dashboards locally/in cluster
├── app/                               # Sample app source
│   ├── README.md
│   ├── src/
│   ├── Dockerfile
│   └── k8s-local/                     # Optional quickstart manifests for local
└── scripts/
    ├── evolve.sh                      # Optional: 1→2→3 + RF updates + backups
    ├── backup.sh                      # Explicit backup helper
    └── check.sh                       # Health checks + phase verification

Notes:

Keep the base demo runnable with only kubectl apply -f k8s/....

Any “optional” components must be clearly labeled and not required for the core narrative.

4. New deliverable: Sample app (DETAILED REQUIREMENTS)

4.1 Goals

The sample app exists to:

Provide a realistic “viral login/search spike” scenario

Demonstrate read/write separation by endpoint AND by DB user permission

Provide metrics that align with the blog’s narrative:

login rate spikes

search rate spikes

latency changes

error rate if misconfigured

Remain safe and non-destructive

Uses SignalR to visual represent a timeline of the app behavior and metrics

Implements a simple UI dashboard that shows the app behavior and metrics in real-time

The sample app should accept commands like start load, stop load, start readers, stop readers, start writers, stop writers, etc.

4.2 App behavior contract

The app must expose these HTTP endpoints:

POST /login

Writes a login event to the DB through the writer connection

Minimal schema: user_id, ts, ip_hash, user_agent_hash (or similar)

Must be idempotent-safe (e.g., insert-only event table)

GET /search?q=...

Reads via the reader connection (readonly user)

Query is safe and indexed (avoid table scans where possible)

Reads should be “hot path” and frequently called in spike mode

POST /spike

Accepts JSON body e.g. { "seconds": 30, "rps": 200, "mix": {"login": 0.2, "search": 0.8} }

Generates load internally (async workers) against its own endpoints OR direct DB calls

Must have guardrails: max seconds, max rps; defaults safe

GET /healthz

Liveness

GET /readyz

Readiness checks that:

Writer connection works

Reader connection works

(Optional) confirms readonly user cannot write (lightweight test)

4.3 Database schema contract

Use a small schema with a clear story:

users (optional)

login_events (write-heavy)

search_terms (seed data; read-heavy)

search_events (optional, can be write-heavy—avoid if it complicates)

Minimum required:

login_events (writes)

search_terms (reads)

Schema must be created via k8s/sql/00-bootstrap.sql so the demo is reproducible.

4.4 Credentials and security

The app must support:

APP_DB_WRITER_URL (points to crdb-writer)

APP_DB_READER_URL (points to crdb-reader)

TLS on by default if Cockroach is TLS-enabled in manifests

For demo simplicity:

Use Kubernetes Secrets to mount credentials

Provide documented defaults

Do not hardcode secrets in the app source

The readonly user creation must occur in Phase 1 SQL:

k8s/sql/10-phase1-reader-user.sql

Must GRANT SELECT only (no INSERT/UPDATE/DELETE)

4.5 Observability in the app

The app must expose Prometheus metrics at:

GET /metrics

Required metrics (names can vary, but must exist):

HTTP requests total by route/status

Request duration histogram by route

DB query duration histogram split by writer vs reader

DB errors total split by writer vs reader

Spike mode current state (gauge): active_workers, target_rps

Optional (preferred):

OpenTelemetry tracing (OTLP exporter)

Trace DB calls and HTTP spans

4.6 Language + framework choice

Keep it simple and common. Recommended:

.NET minimal API (Npgsql)S

Selection criteria:

Single binary/container is ideal

Simple metrics library support

Straightforward Dockerfile

5. New deliverable: Observability (Prometheus + Grafana)

5.1 Goals

Observability must:

Be easy to deploy

Provide dashboards that match the demo narrative

Avoid requiring a full kube-prometheus-stack unless explicitly labeled optional

Default: provide a minimal in-cluster Prometheus + Grafana setup under k8s/observability/.

Optional: provide a second option using Helm kube-prometheus-stack, documented in grafana/README.md.

5.2 Prometheus scraping requirements

Prometheus must scrape:

CockroachDB metrics endpoint

Cockroach exposes Prometheus metrics; configure the scrape target via Service discovery or static target

Ensure the scrape works in all phases

Sample app /metrics

Scrape interval defaults:

15s is fine for demo purposes

5.3 Grafana provisioning requirements

Grafana should be deployed with:

Provisioned Prometheus datasource

Provisioned dashboards loaded from grafana/dashboards/*.json

Dashboards must be importable without manual clicks.

5.4 Required dashboards (REAL JSON)

Create three real dashboards, stored as JSON:

grafana/dashboards/cockroachdb-overview.json Panels must include:

Node liveness / up

SQL throughput (QPS / TPS)

P50/P95/P99 SQL latency

CPU/memory per node (if available)

Disk / capacity (if available)

grafana/dashboards/cockroachdb-evolution.json Panels must include evolution-specific views:

Replication factor setting (or proxy metric showing replicas per range)

Under-replicated ranges

Range rebalancing activity

Replica count by node

Raft leadership distribution (if available)

Store capacity and rebalance signals

grafana/dashboards/sample-app.json Panels must include:

RPS by route (/login, /search)

Error rate by route/status

Latency percentiles per route

DB latency writer vs reader

Spike state: target_rps, active_workers

Each dashboard must:

Use variables where helpful (namespace, pod)

Avoid environment-specific hardcoding

Have clear titles and panel descriptions (short)

5.5 Documentation requirements

Add grafana/README.md with:

How to deploy observability manifests

Port-forward instructions for Prometheus and Grafana

Default credentials (if any) and how to change them

Dashboard list and what each is for

Troubleshooting steps (scrape target missing, no data, etc.)

6. Demo narrative updates required

The live demo script (demo/DEMO.md) must be updated to include:

Starting the observability stack (optional step, but recommended)

Starting the sample app

Demonstrating spike mode

Calling out what to look for in dashboards during each phase

Required “story beats” in the script:

Phase 0: observe baseline + single-node fragility

Spike: show read pressure rising

Phase 1: create readonly user, show reader path, see stabilization

Phase 2: scale to 3 nodes, RF=3, watch under-replication drop and rebalancing settle

“Boring is good”: graphs remain stable while changes occur

The README should link:

Demo script

Diagrams

Dashboards

App usage

7. Coding agent goals (UPDATED)

When acting on this repo, the coding agent must prioritize:

7.1 Preserve the demo narrative

The evolution path (1 → 2 → 3 nodes) must remain clear and reproducible.

Demo script, diagrams, dashboards, and manifests must stay in sync.

7.2 Maintain zero-downtime semantics

Any changes must not introduce:

Forced restart of all DB nodes simultaneously

Breaking changes to Services (crdb-writer, crdb-reader)

Removal or interruption of load generation during evolution

Destructive SQL operations

7.3 Keep configuration explicit and auditable

Prefer clear YAML and SQL snippets over hidden automation.

Document any new flags/settings in README and/or DEMO.md.

7.4 Add observability without bloat

Default path should be minimal Prometheus + Grafana. Optional helm-based setup is allowed but must be clearly labeled optional.

7.5 Sample app must be demo-safe

No destructive actions. Guardrails on load generation. Clear separation of writer/reader connectivity.

8. Coding agent constraints (UPDATED)

The coding agent must not:

Introduce destructive operations into the main demo flow:

No DROP DATABASE/TABLE in core scripts

No kubectl delete of core resources without explicit “optional teardown” section

Change core semantics of:

crdb-writer as primary entrypoint for writes

crdb-reader as read-only entrypoint for legacy reads

Assume non-operator deployments:

This repo is Operator-based

If adding non-operator examples, keep them in a clearly separated folder

The coding agent should:

Keep YAML valid and kubectl apply-able

Keep Mermaid diagrams syntactically correct

Keep SQL compatible with CockroachDB

9. Acceptance criteria (for PR completion)

A PR implementing this update is complete when:

kubectl apply -f k8s/crdb-cluster.yaml brings up Phase 0 CockroachDB

kubectl apply -f k8s/services.yaml creates writer and reader services

kubectl apply -f k8s/loadgen-job.yaml starts continuous workload

kubectl apply -f k8s/app/ deploys sample app and it can:

POST /login successfully writes (writer connection)

GET /search successfully reads (reader connection)

Phase 1 SQL creates a readonly user and the app uses it for /search

kubectl apply -f k8s/observability/ brings up Prometheus and Grafana

Grafana loads all three dashboards automatically and they show live data

Demo script is updated to narrate metrics during evolution

No step requires manual dashboard import, hidden scripts, or destructive ops

10. Example prompts for the coding agent (UPDATED)

“Implement the sample app in .NET10 minimal API with /login, /search, /spike, /metrics, and deploy it under k8s/app/.”

“Create SQL bootstrap scripts under k8s/sql/ and update demo flow to apply them in phases.”

“Add minimal Prometheus + Grafana manifests under k8s/observability/ and provision dashboards automatically.”

“Build three real Grafana dashboards in JSON and store them under grafana/dashboards/.”

“Update README and demo/DEMO.md to include app usage, spike story beats, and dashboard callouts.”

11. Summary

This repo is a live-evolution demo:

1 → 2 → 3 CockroachDB nodes under continuous load

backups between phases

services separate writer and reader intent

new: sample app + real Grafana dashboards to make the story visible

If you’re a human: treat this as your orientation guide. If you’re a coding agent: treat this as your operational contract.