diff --git a/docs/toolbox/toolbox-command-guide.md b/docs/toolbox/toolbox-command-guide.md new file mode 100644 index 00000000..3a6f688e --- /dev/null +++ b/docs/toolbox/toolbox-command-guide.md @@ -0,0 +1,773 @@ +# FORGE Toolbox Command Writing Guide + +**Version**: 1.0 +**Target Audience**: Developers, AI Agents +**Last Updated**: 2026-05-06 + +## Table of Contents + +1. [Architecture Overview](#architecture-overview) +2. [Design Principles](#design-principles) +3. [Command Structure](#command-structure) +4. [DSL Usage Guide](#dsl-usage-guide) +5. [Input/Output Guidelines](#inputoutput-guidelines) +6. [Artifact Management](#artifact-management) +7. [Security Best Practices](#security-best-practices) +8. [Templating and Kubernetes Objects](#templating-and-kubernetes-objects) +9. [Troubleshooting and Post-Mortem Support](#troubleshooting-and-post-mortem-support) +10. [Examples](#examples) +11. [Common Antipatterns](#common-antipatterns) + +--- + +## Architecture Overview + +FORGE follows a **three-layer architecture** inspired by TOPSAIL but using a Python DSL instead of Ansible: + +```text +┌─────────────────────┐ +│ ORCHESTRATION │ ← CI entrypoints, config-driven workflows +│ (pre_cleanup, │ Uses config files to coordinate testing +│ prepare, test, │ +│ post_cleanup) │ +├─────────────────────┤ +│ TOOLBOX │ ← Focused, standalone commands +│ (deploy operator, │ One task = one command +│ scale cluster, │ CLI reproducible, no external deps +│ deploy service) │ +├─────────────────────┤ +│ POST-PROCESSING │ ← Visualization, KPIs, regression analysis +│ (parsing, plots, │ Report generation and historical tracking +│ reporting, KPIs) │ +└─────────────────────┘ +``` + +**Key Insight**: The toolbox layer is **NOT plain Python**. It's a Python **DSL** (Domain Specific Language) that enforces structure, logging, and troubleshooting capabilities. + +--- + +## Design Principles + +### 1. Single Responsibility +- **One toolbox command = one high-level task** +- Think cookbook/algorithm: "install operator", "scale cluster", "create LLMISVC", "wait for pod" +- If you find yourself doing multiple unrelated things, split into separate commands + +### 2. Standalone and Reproducible +- **No external dependencies** outside the command directory (except global platform conventions) +- **CLI reproducible**: you should be able to copy-paste the command from logs and run it manually +- **Simple parameters**: flat scalars, avoid complex nested structures unless absolutely necessary + +### 3. Easy Troubleshooting +- **One directory per command execution** in logs - you can see at a glance which steps succeeded/failed +- **Clear task names** that explain what's happening +- **Comprehensive artifact capture** for post-mortem analysis and scrutiny + +### 4. Reusability Across Projects +- **Scalar parameters instead of single config blob** +- Libraries for common rendering functions (like K8s object generation) +- Think about what other projects will need + +--- + +## Command Structure + +### Directory Layout Example +```text +projects/your_project/toolbox/your_command/ +├── main.py # Entry point with @entrypoint +├── templates/ # Jinja2 templates for K8s objects (optional) +│ ├── pod.yaml.j2 +│ └── service.yaml.j2 +└── scripts/ # Static scripts (optional) + └── setup.sh +``` + +### Entry Point Pattern +```python +@entrypoint +def run( + # Required positional parameters + cluster_name: str, + namespace: str, + # Optional keyword parameters with defaults + *, + replicas: int = 1, + timeout_minutes: int = 30, + gpu_type: str = None, +): + """ + Brief description of what this command does + + Args: + cluster_name: Description of this parameter + namespace: Kubernetes namespace to use + replicas: Number of replicas to deploy (default: 1) + timeout_minutes: How long to wait for completion (default: 30) + gpu_type: GPU type required, if any (default: None) + """ + return execute_tasks(locals()) +``` + +--- + +## DSL Usage Guide + +### Task Definition +```python +@task +def your_task_name(args, ctx): + """Clear description of what this task does""" + + # Access parameters + cluster_name = args.cluster_name + namespace = args.namespace + artifact_dir = args.artifact_dir + + # Use context for data sharing between tasks + ctx.some_value = "computed result" + + # Execute commands + shell.run("oc get pods") + + # Return success message or truthy value + return "Task completed successfully" +``` + +### Task Decorators + +#### Conditional Execution +```python +@when(lambda: previous_task.status.return_value is True) +@task +def conditional_task(args, ctx): + """Only run if previous_task succeeded""" + return "Only ran because condition was met" +``` + +#### Retry Logic +```python +@retry(attempts=5, delay=10, backoff=1.0) +@task +def wait_for_resource(args, ctx): + """Wait for a resource to be ready""" + + result = shell.run("oc get pod mypod -o jsonpath='{.status.phase}'", check=False) + + if not result.success: + return False # Retry + + if result.stdout.strip() == "Running": + return "Pod is ready" + else: + return False # Retry +``` + +#### Always Execute (Cleanup and K8s Status Capture) +```python +@always +@task +def capture_resources(args, ctx): + """Clean up resources even if previous tasks failed""" + + # This always runs, even after failures + shell.run("oc delete pod mypod", check=False) + return "Cleanup completed" + +@always +@task +def cleanup_resources(args, ctx): + """Clean up resources even if previous tasks failed""" + + # This always runs, even after failures + shell.run("oc delete pod mypod", check=False) + return "Cleanup completed" +``` + +### Command Execution +```python +from projects.core.dsl import shell + +# Basic command +shell.run("oc get pods") + +# With output capture +result = shell.run("oc get pods -o json") +if result.success: + import json + pods = json.loads(result.stdout) + +# Save output to file +shell.run("oc describe pod mypod", stdout_dest=args.artifact_dir / "artifacts" / "pod-description.txt") + +# Don't fail on error +shell.run("oc delete pod optional-pod", check=False) +``` + +--- + +## Input/Output Guidelines + +### ✅ GOOD: Scalar Parameters +```python +def run( + cluster_name: str, + namespace: str, + benchmark: str, + *, + platform: str = "aws", + replicas: int = 1, +): +``` + +### ❌ BAD: Config Blob +```python +def run(inputs_file: str): # Don't do this! + with open(inputs_file) as f: + config = yaml.load(f) + # Now you have to guess what's in config... +``` + +### File Parameters +Files are acceptable for: +- **Secrets** (mandatory for security) +- **Complex YAML manifests** that shouldn't be parameters +- **Large data** that doesn't fit in command line + +```python +def run( + secret_file: str, # For passwords, tokens + manifest_file: str = None, # For complex K8s YAML +): +``` + +### Environment Variables +- **KUBECONFIG** is OK (platform convention) +- **NO other environment variables** should be needed +- **NO passwords in environment variables** (they appear in logs!) + +--- + +## Artifact Management + +### Directory Structure +```text +{artifact_dir}/ +├── artifacts/ # Data FROM the cluster (oc get, oc describe) +│ ├── pod-status.yaml +│ ├── service-describe.txt +│ └── logs/ +│ └── mypod.log +└── src/ # Data TO the cluster (oc create, oc apply) + ├── pod-manifest.yaml + └── service-manifest.yaml +``` + +### Usage Guidelines + +#### Artifacts Directory (`artifacts/`) +- **Capture cluster state**: `oc get`, `oc describe`, `oc logs` +- **For human review**: prefer YAML format +- **For large datasets**: capture both JSON (fast parsing) and YAML (human readable) +- **For post-processing**: capture JSON + +```python +@task +def capture_pod_status(args, ctx): + """Capture pod information for debugging""" + + # Human readable + shell.run( + "oc describe pod mypod", + stdout_dest=args.artifact_dir / "artifacts" / "pod-describe.txt" + ) + + # Machine readable + shell.run( + "oc get pod mypod -o yaml", + stdout_dest=args.artifact_dir / "artifacts" / "pod-status.yaml" + ) + + # For large scale tests, also capture JSON + shell.run( + "oc get pods -o json", + stdout_dest=args.artifact_dir / "artifacts" / "all-pods.json" + ) +``` + +#### Source Directory (`src/`) +- **Generated manifests**: anything you `oc apply` +- **Configuration files**: derived from templates + +```python +@task +def create_pod_manifest(args, ctx): + """Generate pod manifest from template""" + + manifest_file = args.artifact_dir / "src" / "pod-manifest.yaml" + shell.mkdir(manifest_file.parent) + + template.render_template_to_file("pod.yaml.j2", manifest_file) + + shell.run(f"oc apply -f {manifest_file}") + + return f"Applied manifest: {manifest_file}" +``` + +--- + +## Security Best Practices + +### Secret Handling + +#### ✅ GOOD: File-based secrets +```python +@task +def create_secret(args, ctx): + """Create secret from file""" + + # Create secret directly from file (avoids shell interpolation) + shell.run(["oc", "create", "secret", "generic", "mysecret", "--from-file", args.secret_file]) + + return "Secret created" +``` + + +#### ✅ GOOD: No logging for sensitive operations +```python +@task +def handle_sensitive_data(args, ctx): + """Process sensitive information""" + + # Use log_command=False to prevent parameter logging + with open(args.secret_file) as f: + secret_data = f.read() + + # Process secret_data... + shell.run("oc apply -f -", stdin=secret_data, log_command=False) + # Sensitive command content is not logged +``` + +#### ❌ BAD: Secrets in parameters +```python +def run(password: str): # DON'T DO THIS! + # password will appear in logs! +``` + +#### ❌ BAD: Secrets in environment variables +```python +os.environ['SECRET_TOKEN'] = token # DON'T DO THIS! +# Environment variables appear in logs and process lists +``` + +### Secure Command Execution +```python +# For commands that might contain secrets +shell.run("oc apply -f -", stdin=secret_yaml_content, log_command=False) +``` + +--- + +## Templating and Kubernetes Objects + +### ✅ GOOD: Use Templates +Store K8s objects as Jinja2 templates, not inline Python: + +```yaml +# templates/pod.yaml.j2 +apiVersion: v1 +kind: Pod +metadata: + name: {{ args.pod_name }} + namespace: {{ args.namespace }} + labels: + app: {{ args.app_name }} +spec: + containers: + - name: main + image: {{ args.image }} + resources: + requests: + cpu: {{ args.cpu }} + memory: {{ args.memory }} +``` + +```python +@task +def create_pod(args, ctx): + """Create pod from template""" + + pod_file = args.artifact_dir / "src" / f"{args.pod_name}-pod.yaml" + template.render_template_to_file("pod.yaml.j2", pod_file) + + shell.run(f"oc apply -f {pod_file}") + return f"Pod {args.pod_name} created" +``` + +### Static Scripts with Environment Variables +Scripts should be static files that use environment variables (set in pod specs): + +```bash +# scripts/setup.sh (static file - no template!) +#!/bin/bash +set -euo pipefail + +echo "Setting up ${SERVICE_NAME}..." +echo "Namespace: ${NAMESPACE}" +echo "Replicas: ${REPLICAS}" + +# Your script logic using environment variables +curl -X POST "${API_ENDPOINT}/setup" \ + -H "Content-Type: application/json" \ + -d "{\"service\": \"${SERVICE_NAME}\", \"replicas\": ${REPLICAS}}" +``` + +Environment variables are defined in the pod template: +```yaml +# templates/pod.yaml.j2 +apiVersion: v1 +kind: Pod +metadata: + name: {{ args.pod_name }} +spec: + containers: + - name: setup + image: {{ args.image }} + env: + - name: SERVICE_NAME + value: "{{ args.service_name }}" + - name: NAMESPACE + value: "{{ args.namespace }}" + - name: REPLICAS + value: "{{ args.replicas }}" + - name: API_ENDPOINT + value: "{{ args.api_endpoint }}" + command: ["/scripts/setup.sh"] +``` + +```python +@task +def create_setup_configmap(args, ctx): + """Create ConfigMap with static script""" + + # Copy static script to src for reference + script_source = Path(__file__).parent / "scripts" / "setup.sh" + script_dest = args.artifact_dir / "src" / "setup.sh" + shell.run(f"cp {script_source} {script_dest}") + + # Create ConfigMap with the script + shell.run(f"oc create configmap setup-script --from-file={script_source}") + + return "Setup script ConfigMap created" +``` + +### ❌ BAD: Inline K8s Objects +```python +# Don't do this! +@task +def create_pod_inline(args, ctx): + pod_yaml = f""" +apiVersion: v1 +kind: Pod +metadata: + name: {args.pod_name} +... + """ + # IDEs can't syntax highlight this properly + # Hard to maintain and read +``` + +--- + +## Troubleshooting and Post-Mortem Support + +### Task Naming Strategy +Use descriptive task names that explain the **purpose**, not just the action: + +#### ✅ GOOD +```python +@task +def wait_for_operator_to_be_ready(args, ctx): + """Wait for operator deployment to reach ready state""" + +@task +def capture_failed_pod_logs(args, ctx): + """Collect logs from pods that failed to start""" + +@task +def verify_service_endpoints_are_available(args, ctx): + """Check that service endpoints are responding to requests""" +``` + +#### ❌ BAD +```python +@task +def check_stuff(args, ctx): # Too vague + +@task +def run_commands(args, ctx): # Too generic +``` + +### Error Context +When tasks fail, provide context for debugging: + +```python +@task +def wait_for_resource_ready(args, ctx): + """Wait for custom resource to be ready""" + + result = shell.run(f"oc get myresource {args.resource_name} -o jsonpath='{{.status.ready}}'") + + if not result.success: + # Capture debugging info before failing + shell.run( + f"oc describe myresource {args.resource_name}", + stdout_dest=args.artifact_dir / "artifacts" / "failed-resource-describe.txt", + check=False + ) + raise RuntimeError(f"Failed to query resource {args.resource_name}: {result.stderr}") + + if result.stdout.strip() != "true": + return False # Retry + + return f"Resource {args.resource_name} is ready" +``` + +### Always Tasks for Cleanup and State Capture +```python +@always +@task +def capture_cluster_state(args, ctx): + """Capture cluster state for post-mortem analysis""" + + # Capture even if tests failed + shell.run("oc get pods --all-namespaces", + stdout_dest=args.artifact_dir / "artifacts" / "all-pods.txt", + check=False) + + shell.run("oc get events --sort-by='.lastTimestamp'", + stdout_dest=args.artifact_dir / "artifacts" / "events.txt", + check=False) + + return "Cluster state captured" +``` + +--- + +## Examples + +### Simple Example: Deploy a Service +```python +#!/usr/bin/env python3 + +from projects.core.dsl import ( + always, entrypoint, execute_tasks, retry, shell, task, template +) + +@entrypoint +def run( + service_name: str, + namespace: str, + *, + image: str = "nginx:latest", + replicas: int = 1, + port: int = 80, +): + """ + Deploy a simple service to Kubernetes + + Args: + service_name: Name of the service to create + namespace: Target namespace + image: Container image to use (default: nginx:latest) + replicas: Number of replicas (default: 1) + port: Service port (default: 80) + """ + return execute_tasks(locals()) + +@task +def validate_inputs(args, ctx): + """Validate input parameters""" + + if not args.service_name: + raise ValueError("service_name is required") + if not args.namespace: + raise ValueError("namespace is required") + if args.replicas < 1: + raise ValueError("replicas must be >= 1") + + return "Inputs validated" + +@task +def setup_directories(args, ctx): + """Create artifact directories""" + + shell.mkdir("artifacts") + shell.mkdir("src") + return "Directories created" + +@task +def verify_namespace_exists(args, ctx): + """Ensure target namespace exists""" + + result = shell.run(f"oc get namespace {args.namespace}", check=False) + if not result.success: + raise RuntimeError(f"Namespace {args.namespace} does not exist") + + return f"Namespace {args.namespace} verified" + +@task +def create_deployment_manifest(args, ctx): + """Generate deployment manifest""" + + manifest_file = args.artifact_dir / "src" / f"{args.service_name}-deployment.yaml" + template.render_template_to_file("deployment.yaml.j2", manifest_file) + + return f"Deployment manifest created: {manifest_file}" + +@task +def apply_deployment(args, ctx): + """Apply the deployment to cluster""" + + manifest_file = args.artifact_dir / "src" / f"{args.service_name}-deployment.yaml" + shell.run(f"oc apply -f {manifest_file}") + + return f"Deployment {args.service_name} applied" + +@retry(attempts=10, delay=5) +@task +def wait_for_deployment_ready(args, ctx): + """Wait for deployment to be ready""" + + result = shell.run( + f"oc get deployment {args.service_name} -n {args.namespace} " + f"-o jsonpath='{{.status.readyReplicas}}'", + check=False + ) + + if not result.success: + return False # Retry + + ready_replicas = result.stdout.strip() + if ready_replicas == str(args.replicas): + return f"Deployment {args.service_name} is ready ({ready_replicas}/{args.replicas})" + + return False # Retry + +@always +@task +def capture_deployment_status(args, ctx): + """Capture final deployment status""" + + # Capture deployment details + shell.run( + f"oc describe deployment {args.service_name} -n {args.namespace}", + stdout_dest=args.artifact_dir / "artifacts" / "deployment-describe.txt", + check=False + ) + + # Capture pod status + shell.run( + f"oc get pods -l app={args.service_name} -n {args.namespace} -o yaml", + stdout_dest=args.artifact_dir / "artifacts" / "pods-status.yaml", + check=False + ) + + return "Deployment status captured" + +if __name__ == "__main__": + run.main() +``` + +--- + +## Common Antipatterns + +### ❌ DON'T: Long, Complex Commands +```python +# BAD: Too many responsibilities in one command +@entrypoint +def run(inputs_file: str): + # This command: + # - Installs 3 different operators + # - Scales the cluster + # - Deploys 5 services + # - Runs performance tests + # - Generates reports + + # Split this into separate commands! +``` + +**Fix**: Split into focused commands: +- `install_operators` +- `scale_cluster` +- `deploy_services` +- `run_performance_tests` +- `generate_reports` + +### ❌ DON'T: Build K8s Objects in Python +```python +# BAD: Hard to read and maintain +@task +def create_complex_deployment(args, ctx): + deployment = { + "apiVersion": "apps/v1", + "kind": "Deployment", + "metadata": {"name": args.name}, + # 50 more lines of nested dictionaries... + } +``` + +**Fix**: Use templates in `templates/deployment.yaml.j2` + +### ❌ DON'T: Generic Task Names +```python +# BAD: What does this actually do? +@task +def step1(args, ctx): +@task +def do_stuff(args, ctx): +@task +def handle_things(args, ctx): +``` + +**Fix**: Descriptive names that explain the purpose + +### ❌ DON'T: Secret Leakage +```python +# BAD: Secrets in parameters or environment +def run(password: str, api_key: str): + +# BAD: Secrets in command lines +shell.run(f"curl -H 'Authorization: Bearer {token}' https://api.example.com") +``` + +**Fix**: Use file-based secrets with `log_command=False` + +--- + +## Quick Checklist + +When writing a toolbox command, ensure: + +- [ ] **Single responsibility**: Command does one focused task +- [ ] **Scalar parameters**: No complex config blobs as input +- [ ] **CLI reproducible**: Can copy command from logs and run manually +- [ ] **Descriptive task names**: Clear what each task does +- [ ] **Templates for K8s objects**: Not inline Python dictionaries +- [ ] **Proper artifact organization**: `src/` for generated, `artifacts/` for captured +- [ ] **Security**: Secrets via files, never in parameters/environment +- [ ] **Always tasks**: Capture debugging info even on failure +- [ ] **Error context**: Helpful error messages with debugging info +- [ ] **Library functions**: Extract reusable rendering functions + +--- + +## Getting Help + +- **Specs**: Check `specs/008-toolbox-dsl/` for detailed DSL documentation +- **Examples**: Look at existing commands in `projects/*/toolbox/` +- **TOPSAIL Reference**: Check [openshift-psap/topsail](https://github.com/openshift-psap/topsail/pulls/@me) for mature patterns +- **Code Review**: Have colleagues review for adherence to these principles + +Remember: The toolbox is a **DSL**, not plain Python. Embrace the constraints - they make your commands more reliable, debuggable, and reusable!