Toygres Control Plane - Implementation Plan

Overview

Toygres is a Rust-based control plane for hosting PostgreSQL containers as a service on Azure Kubernetes Service (AKS). It uses the Duroxide framework for durable workflow orchestration and PostgreSQL for metadata storage.

Project Structure

The project is organized as a Cargo workspace with the following crates:

toygres-models: Shared data structures (instance metadata, deployment config, health status)
toygres-activities: Duroxide activities wrapping Azure/K8s operations
toygres-orchestrations: Duroxide orchestrations coordinating the activities
toygres-server: Main control plane server exposing APIs and running the Duroxide worker

Infrastructure Setup

Prerequisites

Azure Kubernetes Service (AKS) cluster already provisioned
PostgreSQL database for metadata storage
Azure credentials configured (via environment variables or Azure CLI)

Bootstrap Scripts

Create the following scripts to help with infrastructure setup:

scripts/setup-infra.sh: Terraform/Azure CLI script to provision AKS cluster, networking, storage classes
scripts/db-init.sh: Applies the initial CMS migration and prepares the Duroxide schema
scripts/db-migrate.sh: Runs incremental CMS migrations (none yet, but keeps the pattern consistent with duroxide-pg)

Environment Configuration

The control plane uses environment variables for configuration (see .env.example):

DATABASE_URL: Connection string for metadata PostgreSQL database
AKS_CLUSTER_NAME: Name of the AKS cluster
AKS_RESOURCE_GROUP: Azure resource group containing the AKS cluster
AKS_NAMESPACE: Kubernetes namespace for PostgreSQL deployments (default: toygres)
Azure authentication via DefaultAzureCredential

Core Activities (toygres-activities)

Implement Duroxide activities for atomic operations:

DeployPostgresActivity: Creates K8s resources (StatefulSet, PVC, Service) for a PostgreSQL pod
DeletePostgresActivity: Removes K8s resources for a PostgreSQL instance
GetInstanceStatusActivity: Queries K8s API for pod status
HealthCheckActivity: Connects to PostgreSQL instance and verifies it's responsive
UpdateMetadataActivity: Updates instance state in metadata database
GenerateConnectionStringActivity: Builds connection string from K8s service endpoint

Technologies:

kube-rs for Kubernetes operations
sqlx for database operations
Azure SDK for Rust for Azure-specific operations (if needed)

Orchestrations (toygres-orchestrations)

Implement durable orchestrations using the Duroxide framework:

1. CreateInstanceOrchestration

Purpose: Create a new PostgreSQL instance

Flow:

Call DeployPostgresActivity with name, credentials
Poll GetInstanceStatusActivity until ready
Call GenerateConnectionStringActivity
Call UpdateMetadataActivity with "running" state
Start detached HealthCheckOrchestration for this instance
Store health check orchestration ID in metadata
Return connection string

Input: DeploymentConfig (name, username, password, storage size, version)

Output: CreateInstanceResponse (instance_id, connection_string, orchestration_id)

2. DeleteInstanceOrchestration

Purpose: Delete an existing PostgreSQL instance

Flow:

Call UpdateMetadataActivity with "deleting" state
Retrieve and cancel the health check orchestration ID from metadata
Call DeletePostgresActivity
Call UpdateMetadataActivity with "deleted" state

Input: Instance ID

Output: Success/failure status

3. HealthCheckOrchestration

Purpose: Continuous health monitoring for a single PostgreSQL instance

Flow:

Input: instance_id
Loop forever:
- Call HealthCheckActivity for the instance
- Call UpdateMetadataActivity with health status
- Wait 30 seconds

Lifecycle: Started by CreateInstanceOrchestration, cancelled by DeleteInstanceOrchestration

4. MonitorOperationOrchestration

Purpose: Query the status of any orchestration

Flow:

Query orchestration status from Duroxide
Return current state (pending, running, completed, failed)

Input: Orchestration ID

Output: OperationStatus

Cross-Crate Registry Pattern

Use the Duroxide cross-crate registry pattern to register orchestrations and activities across the workspace.

Control Plane Server (toygres-server)

Build the main server binary with the following components:

1. Configuration

Load .env file using dotenvy
Parse database connection string
Validate Azure and AKS configuration

2. Database Client

Initialize sqlx PostgreSQL pool for metadata DB
Run migrations on startup

3. Duroxide Worker

Initialize duroxide-pg worker connecting to metadata DB
Register all orchestrations and activities
Start worker loop

4. API Layer (REST API)

Expose the following endpoints using axum:

POST /instances → Start CreateInstanceOrchestration
- Body: CreateInstanceRequest
- Response: CreateInstanceResponse
DELETE /instances/{id} → Start DeleteInstanceOrchestration
- Response: Operation status
GET /instances → List all from metadata DB
- Response: ListInstancesResponse
GET /instances/{id} → Get single instance details
- Response: InstanceMetadata
GET /operations/{id} → Monitor operation status
- Response: OperationStatus
GET /health → Health check endpoint for the control plane itself

5. Health Check Scheduler

Background service that ensures all running instances have active health check orchestrations. On startup:

Query metadata DB for all instances in "running" state
Verify they have health check orchestration IDs
Start new health check orchestrations if missing

Dependencies

Key Rust crates to include:

Duroxide Framework

duroxide - Core durable workflow framework
duroxide-pg - PostgreSQL provider for Duroxide

Kubernetes

kube - Kubernetes client
k8s-openapi - Kubernetes API types

Azure

azure_core - Azure SDK core
azure_identity - Azure authentication

Database

sqlx - Async SQL with compile-time query checking
Features: runtime-tokio, postgres, macros, migrate

Web Framework

axum - HTTP server framework
tower - Middleware
tower-http - HTTP middleware (tracing, CORS)

Utilities

tokio - Async runtime
serde / serde_json - Serialization
dotenvy - Environment variable loading
tracing / tracing-subscriber - Logging
anyhow / thiserror - Error handling
chrono - Date/time
uuid - Unique identifiers

Database Schema

`instances` table

CREATE TYPE instance_state AS ENUM ('creating', 'running', 'deleting', 'deleted', 'failed');
CREATE TYPE health_status AS ENUM ('healthy', 'unhealthy', 'unknown');

CREATE TABLE instances (
    id UUID PRIMARY KEY,
    name VARCHAR(255) UNIQUE NOT NULL,
    state instance_state NOT NULL,
    health_status health_status NOT NULL DEFAULT 'unknown',
    connection_string TEXT,
    health_check_orchestration_id TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_instances_state ON instances(state);
CREATE INDEX idx_instances_health_status ON instances(health_status);

Testing Strategy

Unit Tests

Test individual activities with mocked K8s/Azure clients
Test data models and serialization
Test configuration loading

Integration Tests

Test orchestrations using Duroxide in-memory provider
Test activity coordination and error handling
Test API endpoints with test server

End-to-End Tests

Test against local Kubernetes cluster (kind/minikube)
Test full instance lifecycle (create → health checks → delete)
Test failure scenarios and recovery

Implementation Phases - Incremental Approach

Philosophy: Build working code first, then add Duroxide complexity. Validate each layer before adding the next.

Phase 0: Proof of Concept ⭐ START HERE

Goal: Get basic Kubernetes/PostgreSQL deployment working without Duroxide

Tasks:

✅ Initialize Cargo workspace with four crates
✅ Define data models in toygres-models
✅ Create database schema and migration scripts
✅ Create infrastructure bootstrap scripts
Create toygres-server/examples/manual_deploy.rs that:
- Connects to AKS cluster using kube-rs
- Creates PostgreSQL StatefulSet, PVC, and Service
- Waits for pod to be ready
- Extracts connection string from Service
- Tests connection to PostgreSQL
- Cleans up resources

Success Criteria: Can deploy and connect to a PostgreSQL instance in AKS using a simple Rust binary.

Why first? Proves core functionality works without framework complexity. Fast iteration on K8s configurations.

Phase 1: Extract Core Logic into Modules

Goal: Refactor POC into reusable, testable functions

Tasks:

Create toygres-server/src/k8s.rs module:
- create_postgres_resources(config) -> Result<()>
- delete_postgres_resources(name) -> Result<()>
- get_pod_status(name) -> Result<PodStatus>
- get_service_endpoint(name) -> Result<String>
Create toygres-server/src/postgres.rs module:
- test_connection(connection_string) -> Result<bool>
- check_health(connection_string) -> Result<HealthStatus>
Write unit tests with mocked K8s clients
Write integration tests against real cluster

Success Criteria: Clean, testable modules that can be called independently.

Why next? Separates concerns, enables testing, and creates reusable code for activities.

Phase 2: Add Metadata Database

Goal: Store instance state before adding workflows

Tasks:

Run scripts/db-init.sh (and scripts/db-migrate.sh) to create schema
Create toygres-server/src/db.rs module:
- insert_instance(metadata) -> Result<Uuid>
- update_instance_state(id, state) -> Result<()>
- update_health_status(id, status) -> Result<()>
- get_instance(id) -> Result<InstanceMetadata>
- list_instances() -> Result<Vec<InstanceMetadata>>
Create test binary that:
- Creates instance in K8s
- Stores metadata in database
- Queries and updates state
- Cleans up

Success Criteria: Can track instance lifecycle in database while managing K8s resources.

Why next? Database logic separated from workflow logic. Foundation for Duroxide state tracking.

Phase 3: Simple REST API (No Duroxide Yet)

Goal: Working API for basic operations

Tasks:

Implement synchronous API endpoints in toygres-server/src/api.rs:
- POST /instances - Blocks until instance ready, returns connection string
- DELETE /instances/{id} - Blocks until deletion complete
- GET /instances - Lists all from database
- GET /instances/{id} - Gets instance details
Wire up modules: API → K8s module → Database module
Add full error handling and logging
Manual testing with curl

Success Criteria: Can create/delete instances via REST API. Everything stored in database.

Why next? Validates entire flow works end-to-end. Understand timing/latency requirements.

Phase 4: Wrap in Duroxide Activities

Goal: Convert modules to Duroxide activities

Tasks:

Implement activities one by one in toygres-activities/:
- DeployPostgresActivity (wraps k8s::create_postgres_resources)
- GetInstanceStatusActivity (wraps k8s::get_pod_status)
- DeletePostgresActivity (wraps k8s::delete_postgres_resources)
- HealthCheckActivity (wraps postgres::check_health)
- UpdateMetadataActivity (wraps db::update_*)
- GenerateConnectionStringActivity (wraps k8s::get_service_endpoint)
Test each activity independently
Verify serialization/deserialization
Confirm error handling and retries

Success Criteria: All activities work independently with Duroxide.

Why next? We know underlying code works. Just adding Duroxide wrapper. Can test in isolation.

Phase 5: Create Simple Orchestrations

Goal: Build durable workflows

Tasks:

Implement CreateInstanceOrchestration in toygres-orchestrations/:
- Call activities in sequence
- Poll for readiness
- Return connection string
- DON'T start health check yet (that's Phase 7)
Test with Duroxide in-memory provider
Verify orchestration completes, test retry behavior
Implement DeleteInstanceOrchestration:
- Update state, call delete activity
- DON'T worry about canceling health checks yet

Success Criteria: Can create/delete instances using durable workflows.

Why next? Start simple with linear workflows. Learn Duroxide patterns. Validate durability/retry.

Phase 6: Add Duroxide to API

Goal: Make API asynchronous with durable workflows

Tasks:

Initialize Duroxide worker in toygres-server/src/worker.rs:
- Connect to duroxide-pg
- Register activities and orchestrations
- Start worker loop
Update API to start orchestrations:
- POST /instances → Start CreateInstanceOrchestration, return orchestration ID
- GET /operations/{id} → Query orchestration status
- Keep synchronous endpoints for comparison/testing
Test async operations and resumption after worker restart

Success Criteria: API starts durable workflows. Can query status. Workflows survive restarts.

Why next? Everything else working. Just changing API semantics. Can compare with sync version.

Phase 7: Health Check Orchestrations

Goal: Add continuous monitoring

Tasks:

Implement HealthCheckOrchestration in toygres-orchestrations/:
- Infinite loop with Duroxide timer
- Call HealthCheckActivity
- Update database with health status
Update CreateInstanceOrchestration:
- Start detached health check orchestration
- Store orchestration ID in metadata
Update DeleteInstanceOrchestration:
- Retrieve and cancel health check orchestration ID
- Then proceed with deletion
Test full lifecycle:
- Create instance → Health checks start
- Monitor database updates every 30s
- Delete instance → Health checks stop

Success Criteria: Continuous health monitoring with automatic start/stop on create/delete.

Why last? Most complex feature. Depends on everything else. Involves orchestration cancellation.

Phase 8: Polish & Production Readiness

Goal: Make it production-grade

Tasks:

Comprehensive error handling and recovery
Add metrics and monitoring (Prometheus?)
Security hardening (RBAC, secrets management)
Performance optimization
Complete documentation
Deployment guides and Helm charts
End-to-end tests against real AKS cluster

Success Criteria: Production-ready control plane with monitoring, docs, and deployment automation.

Current Status

✅ Phase 0: Scaffolding complete (workspace, models, scripts, docs)
🔄 Phase 0: Need to implement manual_deploy.rs POC
⏳ Phase 1-8: Not started

Next Immediate Steps

Implement toygres-server/examples/manual_deploy.rs
Test against real AKS cluster
Document learnings and K8s resource configurations
Move to Phase 1 when POC works reliably

FilesExpand file tree

plan.md

Latest commit

History

plan.md

File metadata and controls

Toygres Control Plane - Implementation Plan

Overview

Project Structure

Infrastructure Setup

Prerequisites

Bootstrap Scripts

Environment Configuration

Core Activities (toygres-activities)

Orchestrations (toygres-orchestrations)

1. CreateInstanceOrchestration

2. DeleteInstanceOrchestration

3. HealthCheckOrchestration

4. MonitorOperationOrchestration

Cross-Crate Registry Pattern

Control Plane Server (toygres-server)

1. Configuration

2. Database Client

3. Duroxide Worker

4. API Layer (REST API)

5. Health Check Scheduler

Dependencies

Duroxide Framework

Kubernetes

Azure

Database

Web Framework

Utilities

Database Schema

instances table

Testing Strategy

Unit Tests

Integration Tests

End-to-End Tests

Implementation Phases - Incremental Approach

Phase 0: Proof of Concept ⭐ START HERE

Phase 1: Extract Core Logic into Modules

Phase 2: Add Metadata Database

Phase 3: Simple REST API (No Duroxide Yet)

Phase 4: Wrap in Duroxide Activities

Phase 5: Create Simple Orchestrations

Phase 6: Add Duroxide to API

Phase 7: Health Check Orchestrations

Phase 8: Polish & Production Readiness

Current Status

Next Immediate Steps

`instances` table