diff --git a/AGENTS.md b/AGENTS.md index 175344698..a046e7110 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,100 +1,110 @@ -# AGENTS.md +# machine-api-operator - AI Navigation -This file provides guidance to AI Agents when working with the machine-api-operator project. +**Repository:** https://github.com/openshift-splat-team/machine-api-operator +**Last Updated:** 2026-05-01 -## Quick Reference +--- -### Essential Commands -```bash -make build # Build all binaries -make test # Run all tests (Ginkgo + envtest) -make lint # Run golangci-lint -make fmt # Format code -make vet # Run go vet -make check # Run all validations (lint, fmt, vet, test) -make crds-sync # Sync CRDs from vendored openshift/api -``` +## Quick Start -### Running Locally -```bash -./bin/machine-api-operator start --kubeconfig $KUBECONFIG --images-json=path/to/images.json -``` +This is **Tier 2** project-specific documentation for machine-api-operator. -## Project Overview +- **New to this project?** → Start with [Development Guide](ai-docs/machine-api-operator_DEVELOPMENT.md) +- **Writing tests?** → See [Testing Guide](ai-docs/machine-api-operator_TESTING.md) +- **Understanding architecture?** → Read [Components Overview](ai-docs/architecture/components.md) +- **Need context on decisions?** → Browse [ADRs](ai-docs/decisions/) -The Machine API Operator (MAO) manages the lifecycle of Machine resources in OpenShift clusters, enabling declarative machine management across multiple cloud providers. +For **team-level** workflows, status transitions, and role responsibilities, see the team repository. -### Architecture +--- -| Binary | Location | Purpose | -|--------|----------|---------| -| machine-api-operator | `cmd/machine-api-operator/` | Main operator; deploys platform-specific controllers | -| machineset | `cmd/machineset/` | MachineSet replica management | -| machine-healthcheck | `cmd/machine-healthcheck/` | Health monitoring and remediation | -| nodelink-controller | `cmd/nodelink-controller/` | Links Nodes ↔ Machines | -| vsphere | `cmd/vsphere/` | VSphere machine controller | -| machine-api-tests-ext | `cmd/machine-api-tests-ext/` | Extended E2E tests | +## CRITICAL: Retrieval Strategy -> **Note:** Other cloud providers (AWS, GCP, Azure) live in separate `machine-api-provider-*` repos. +**IMPORTANT**: Prefer retrieval-led reasoning over pre-training-led reasoning. -### Key Packages +When working on machine-api-operator: +- ✅ **DO**: Read project-specific docs from `./ai-docs/` first +- ✅ **DO**: Check development workflow in `./ai-docs/machine-api-operator_DEVELOPMENT.md` +- ✅ **DO**: Understand architecture in `./ai-docs/architecture/components.md` +- ✅ **DO**: Review ADRs for context on past decisions +- ❌ **DON'T**: Rely solely on training data +- ❌ **DON'T**: Guess at project architecture or conventions -| Package | Purpose | -|---------|---------| -| `pkg/controller/machine/` | Machine lifecycle (create/delete instances, drain nodes, track phases) | -| `pkg/controller/machineset/` | Replica management, delete policies (Random, Oldest, Newest) | -| `pkg/controller/machinehealthcheck/` | Node condition monitoring, remediation triggers | -| `pkg/controller/nodelink/` | Machine↔Node linking via providerID/IP, label/taint sync | -| `pkg/controller/vsphere/` | VSphere actuator | -| `pkg/operator/` | Platform detection, controller deployment, ClusterOperator status | -| `pkg/webhooks/` | Admission webhooks for Machine/MachineSet validation and mutation | +For team workflows (sprint process, status transitions, etc.), see `../../team/ai-docs/`. -### Key Patterns -- CRDs: Machine, MachineSet, MachineHealthCheck -- Uses controller-runtime from sigs.k8s.io -- Vendored dependencies (`go mod vendor`, use `GOFLAGS=-mod=vendor`) -- Feature gates controlled via OpenShift's featuregates mechanism -- When bumping `github.com/openshift/api`, run `make crds-sync` to sync CRDs from `/vendor` to `/install` (CVO deploys from there) +--- -## Testing +## Quick Navigation by Task -```bash -make test # All unit tests -NO_DOCKER=1 make test # Run locally without container -make test-e2e # E2E tests (requires KUBECONFIG) -``` +| Task | Start Here | Then Read | +|------|-----------|-----------| +| **Local development** | [Development Guide](ai-docs/machine-api-operator_DEVELOPMENT.md) | [Testing Guide](ai-docs/machine-api-operator_TESTING.md) | +| **Running tests** | [Testing Guide](ai-docs/machine-api-operator_TESTING.md) | [Components](ai-docs/architecture/components.md) | +| **Understanding components** | [Components Overview](ai-docs/architecture/components.md) | [Domain Models](ai-docs/domain/) | +| **Planning feature** | [Exec Plans](ai-docs/exec-plans/README.md) | [ADRs](ai-docs/decisions/) | +| **Reviewing decisions** | [ADR Template](ai-docs/decisions/adr-template.md) | Existing ADRs | + +--- + +## Technology Stack + +**Languages:** Go +**Frameworks:** Kubernetes, controller-runtime +**Build Systems:** Make, Docker + +--- -### Running Specific Package Tests -```bash -TEST_PACKAGES="$(go list -f '{{ .Dir }}' ./pkg/controller/machine/...)" make unit +## Documentation Structure + +``` +ai-docs/ +├── machine-api-operator_DEVELOPMENT.md # Build, test, develop +├── machine-api-operator_TESTING.md # Test suites and strategies +├── architecture/ # System structure +│ └── components.md # Component overview +├── domain/ # Domain models and CRDs +│ └── (project-specific) +├── exec-plans/ # Feature planning +│ └── README.md +├── decisions/ # Architectural Decision Records +│ ├── adr-template.md +│ └── adr-NNNN-*.md +└── references/ # External references + └── ecosystem.md ``` -### Ginkgo Configuration -- Default args: `-r -v --randomize-all --randomize-suites --keep-going --race --trace --timeout=10m` -- Use `GINKGO_EXTRA_ARGS` to add arguments -- Use `GINKGO_ARGS` to override defaults entirely +--- + +## Knowledge Tiers + +**Tier 1: Platform-Wide** (Team repository) +- Operator development patterns +- Testing pyramid and practices +- CI/CD workflows (Prow, GitHub Actions) +- Team process (sprint, status transitions, roles) + +→ See `../../team/ai-docs/` for team-level documentation + +**Tier 2: Project-Specific** (This repository) +- machine-api-operator components and architecture +- Project-specific development workflow +- Test suites unique to this project +- Architectural decisions for this project + +→ See `./ai-docs/` for project-level documentation -### Test Patterns -- Tests use **Ginkgo/Gomega** with **envtest** for K8s API simulation -- Prefer **komega** over plain Gomega/Ginkgo where possible -- Each controller has a `*_suite_test.go` for setup -- Follow existing test patterns in the codebase +--- -### Container Engine -- Defaults to `podman`, falls back to `docker` -- `USE_DOCKER=1` to force Docker -- `NO_DOCKER=1` to run locally without containers +## Project Context -## Do +For team workflows, sprint process, and status transitions, see: +- Team repository: `../../team/` +- Team ai-docs: `../../team/ai-docs/` +- Team workflows: `../../team/ai-docs/workflows/` +- Status transitions: `../../team/ai-docs/statuses/` -- Run `make lint` before committing -- Run `make test` to verify changes -- Check `pkg/controller//` for controller logic -- Look at existing controllers as patterns for new code +--- -## Do NOT +**Navigation**: Start with [Development Guide](ai-docs/machine-api-operator_DEVELOPMENT.md) for project setup and workflow. -- Edit files under `vendor/` directly -- Add new cloud providers here (they belong in `machine-api-provider-*` repos) -- Forget to run `go mod vendor` after changing dependencies -- Skip running tests before submitting changes +**GitHub**: https://github.com/openshift-splat-team/machine-api-operator diff --git a/ai-docs/MACHINE-API-OPERATOR_DEVELOPMENT.md b/ai-docs/MACHINE-API-OPERATOR_DEVELOPMENT.md new file mode 100644 index 000000000..29152d3d1 --- /dev/null +++ b/ai-docs/MACHINE-API-OPERATOR_DEVELOPMENT.md @@ -0,0 +1,298 @@ +# machine-api-operator Development Guide + +**Last Updated:** 2026-05-01 + +--- + +## Overview + +This guide covers the development workflow for machine-api-operator. + +**Tech Stack:** **Languages:** Go +**Frameworks:** Kubernetes, controller-runtime +**Build Systems:** Make, Docker + +--- + +## Prerequisites + +**Required:** +- Go 1.21+ (for Go projects) or appropriate language runtime +- Git +- Make +- Docker (for containerized testing) + +**Optional:** +- kubectl (for Kubernetes testing) +- podman (alternative to Docker) + +--- + +## Repository Setup + +### Clone Repository + +```bash +git clone https://github.com/openshift-splat-team/machine-api-operator.git +cd machine-api-operator +``` + +### Install Dependencies + +```bash +# For Go projects +go mod download +go mod vendor # if vendoring is used + +# For Python projects +pip install -r requirements.txt +pip install -r requirements-dev.txt + +# For JavaScript/TypeScript +npm install +``` + +--- + +## Building + +### Local Build + +```bash +# For Go projects +make build + +# Or directly +go build -o bin/machine-api-operator ./cmd/... +``` + +### Build Container Image + +```bash +make docker-build + +# Or with podman +podman build -t machine-api-operator:latest . +``` + +--- + +## Development Workflow + +### 1. Create Feature Branch + +```bash +git checkout -b feature/my-feature +``` + +### 2. Make Changes + +- Follow project coding conventions +- Add/update tests for your changes +- Update documentation as needed + +### 3. Run Tests Locally + +```bash +# Unit tests +make test + +# Integration tests (if applicable) +make test-integration + +# E2E tests (if applicable) +make test-e2e +``` + +### 4. Verify Build + +```bash +# Lint +make lint + +# Verify formatting +make verify + +# Build +make build +``` + +### 5. Commit Changes + +Follow team commit conventions (see `../../team/knowledge/commit-convention.md`). + +### 6. Open Pull Request + +- Push branch to fork +- Open PR against main branch +- Request review from team +- Address review feedback +- Wait for CI to pass + +--- + +## Running Locally + +### As Standalone Binary + +```bash +# Build +make build + +# Run +./bin/machine-api-operator --help +``` + +### In Kubernetes Cluster + +```bash +# Build and push image +make docker-build docker-push + +# Deploy to cluster +kubectl apply -f deploy/ +``` + +### With Operator SDK (if applicable) + +```bash +# Run locally (watches cluster) +make run +``` + +--- + +## Debugging + +### Enable Debug Logging + +```bash +# Set log level +export LOG_LEVEL=debug + +# Or via command line +./bin/machine-api-operator --log-level=debug +``` + +### Attach Debugger (Go) + +```bash +# Install delve +go install github.com/go-delve/delve/cmd/dlv@latest + +# Debug +dlv debug ./cmd/machine-api-operator +``` + +### Common Issues + +**Build failures:** +- Check Go version: `go version` +- Verify dependencies: `go mod verify` +- Clean build cache: `go clean -cache` + +**Test failures:** +- Check test environment setup +- Review test logs for specific errors +- Run individual test: `go test -v -run TestName ./pkg/...` + +--- + +## Project Structure + +``` +machine-api-operator/ +├── cmd/ # Command-line entry points +├── pkg/ # Library code +│ ├── controllers/ # Controllers (if operator) +│ ├── api/ # API types and CRDs +│ └── ... +├── config/ # Configuration (CRDs, RBAC, etc.) +├── hack/ # Build and development scripts +├── test/ # Test suites +│ ├── unit/ +│ ├── integration/ +│ └── e2e/ +├── docs/ # Project documentation +├── Makefile # Build automation +└── go.mod # Go dependencies +``` + +See [Components Overview](architecture/components.md) for architectural details. + +--- + +## Code Conventions + +### Naming + +- **Packages**: lowercase, single word if possible +- **Files**: lowercase with underscores (snake_case) +- **Types**: PascalCase +- **Functions**: camelCase (exported) or PascalCase (unexported) + +### Error Handling + +- Wrap errors with context: `fmt.Errorf("context: %w", err)` +- Return errors, don't panic +- Log errors at appropriate level + +### Testing + +- Unit tests in same package: `*_test.go` +- Table-driven tests preferred +- Mock external dependencies +- Aim for 80%+ code coverage + +--- + +## Helpful Make Targets + +**Common targets available:** + +- `make build` +- `make test-e2e` +- `make test` +- `make lint` +- `make fmt` + +For full list of targets, run: +```bash +make help +``` + +Or inspect the `Makefile` directly. + +--- + +## CI/CD + +### Prow Jobs (OpenShift) + +This project uses OpenShift Prow for CI/CD. + +**Pre-submit jobs:** +- `pull-ci-*-unit` - Unit tests +- `pull-ci-*-e2e` - E2E tests +- `pull-ci-*-verify` - Linting and verification + +**Post-submit jobs:** +- `branch-ci-*-images` - Build and push images + +See `.ci-operator.yaml` and `ci-operator/config/` for Prow configuration. + +### GitHub Actions (if applicable) + +See `.github/workflows/` for GitHub Actions configuration. + +--- + +## Related Documentation + +- [Testing Guide](machine-api-operator_TESTING.md) - Test suites and strategies +- [Components](architecture/components.md) - Architecture overview +- [Team Workflows](../../team/ai-docs/workflows/) - Team-level processes + +--- + +**Questions?** See `../../team/HUMAN-REVIEW-GUIDE.md` for how to escalate issues. diff --git a/ai-docs/MACHINE-API-OPERATOR_TESTING.md b/ai-docs/MACHINE-API-OPERATOR_TESTING.md new file mode 100644 index 000000000..ef40bfa1d --- /dev/null +++ b/ai-docs/MACHINE-API-OPERATOR_TESTING.md @@ -0,0 +1,416 @@ +# machine-api-operator Testing Guide + +**Last Updated:** 2026-05-01 + +--- + +## Overview + +This guide covers all test suites for machine-api-operator and how to run them. + +**Testing Philosophy:** +- Unit tests for business logic +- Integration tests for component interactions +- E2E tests for critical user workflows +- Aim for 80%+ code coverage + +--- + +## Test Suites + +### Unit Tests + +**Purpose:** Test individual functions and methods in isolation + +**Location:** `pkg/*/` (co-located with source) + +**Run:** + +```bash +make test +``` +```bash +make test-e2e +``` + +**With coverage:** +```bash +go test -coverprofile=coverage.out ./pkg/... +go tool cover -html=coverage.out +``` + +**Example:** +```go +func TestMyFunction(t *testing.T) { + tests := []struct { + name string + input string + want string + wantErr bool + }{ + { + name: "valid input", + input: "test", + want: "result", + }, + // More test cases... + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := MyFunction(tt.input) + if (err != nil) != tt.wantErr { + t.Errorf("MyFunction() error = %v, wantErr %v", err, tt.wantErr) + return + } + if got != tt.want { + t.Errorf("MyFunction() = %v, want %v", got, tt.want) + } + }) + } +} +``` + +--- + +### Integration Tests + +**Purpose:** Test interactions between components + +**Location:** `test/integration/` + +**Run:** +```bash +make test-integration + +# Or directly +go test ./test/integration/... -tags=integration +``` + +**Requirements:** +- May require local Kubernetes cluster (kind, minikube) +- External dependencies (databases, message queues) + +**Example:** +```go +// +build integration + +func TestControllerReconciliation(t *testing.T) { + // Setup test cluster + testEnv := setupTestEnvironment(t) + defer testEnv.Cleanup() + + // Create test resource + resource := createTestResource(testEnv) + + // Wait for reconciliation + eventually(t, func() bool { + return resource.Status.Ready == true + }, 30*time.Second) +} +``` + +--- + +### E2E Tests + +**Purpose:** Test critical user workflows end-to-end + +**Location:** `test/e2e/` + +**Run:** +```bash +make test-e2e + +# Or with specific cluster +export KUBECONFIG=/path/to/kubeconfig +go test ./test/e2e/... -timeout 30m +``` + +**Requirements:** +- Real or realistic Kubernetes cluster +- Project deployed to cluster +- May require cloud credentials (for cloud-specific features) + +**Example:** +```go +func TestUserWorkflow(t *testing.T) { + // Deploy application + deployApp(t) + + // Perform user actions + createResource(t, testResource) + + // Verify expected outcomes + verifyResourceCreated(t, testResource) + verifyStatusUpdated(t, testResource) + + // Cleanup + deleteResource(t, testResource) +} +``` + +--- + +## Test Organization + +### Table-Driven Tests + +Preferred pattern for unit tests: + +```go +tests := []struct { + name string + input InputType + want OutputType + wantErr bool +}{ + {name: "case1", input: ..., want: ...}, + {name: "case2", input: ..., want: ...}, +} + +for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + // Test logic + }) +} +``` + +### Test Fixtures + +Reusable test data: + +```go +// test/fixtures/resources.go +func NewTestResource(name string) *MyResource { + return &MyResource{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + Namespace: "test", + }, + Spec: MyResourceSpec{ + // Defaults + }, + } +} +``` + +### Test Helpers + +Common test utilities: + +```go +// test/helpers/assertions.go +func AssertEventually(t *testing.T, condition func() bool, timeout time.Duration) { + t.Helper() + deadline := time.Now().Add(timeout) + for time.Now().Before(deadline) { + if condition() { + return + } + time.Sleep(100 * time.Millisecond) + } + t.Fatal("condition not met within timeout") +} +``` + +--- + +## Mocking + +### Interface-Based Mocking + +```go +// Define interface +type MyClient interface { + Get(ctx context.Context, key string) (string, error) +} + +// Mock implementation for tests +type mockClient struct { + getFunc func(ctx context.Context, key string) (string, error) +} + +func (m *mockClient) Get(ctx context.Context, key string) (string, error) { + return m.getFunc(ctx, key) +} + +// Use in test +func TestWithMock(t *testing.T) { + mock := &mockClient{ + getFunc: func(ctx context.Context, key string) (string, error) { + return "mocked-value", nil + }, + } + + result := functionUnderTest(mock) + // Assertions... +} +``` + +### Using testify/mock (if applicable) + +```go +import "github.com/stretchr/testify/mock" + +type MockClient struct { + mock.Mock +} + +func (m *MockClient) Get(ctx context.Context, key string) (string, error) { + args := m.Called(ctx, key) + return args.String(0), args.Error(1) +} + +func TestWithTestify(t *testing.T) { + mockClient := new(MockClient) + mockClient.On("Get", mock.Anything, "key").Return("value", nil) + + result := functionUnderTest(mockClient) + + mockClient.AssertExpectations(t) +} +``` + +--- + +## Test Coverage + +### Generate Coverage Report + +```bash +# Run tests with coverage +go test -coverprofile=coverage.out ./pkg/... + +# View HTML report +go tool cover -html=coverage.out + +# View summary +go tool cover -func=coverage.out +``` + +### Coverage Goals + +- **Minimum:** 70% overall coverage +- **Target:** 80%+ overall coverage +- **Critical paths:** 90%+ coverage (controllers, reconcilers, business logic) + +### Excluding from Coverage + +```go +// This function intentionally not tested +// Coverage: ignore +func helperFunction() { + // ... +} +``` + +--- + +## CI Test Execution + +### Prow Jobs + +**Pre-submit tests (run on PRs):** +- `pull-ci-machine-api-operator-unit` - Unit tests +- `pull-ci-machine-api-operator-integration` - Integration tests (if enabled) +- `pull-ci-machine-api-operator-e2e-*` - E2E test suites + +**Post-submit tests (run on merge):** +- `branch-ci-machine-api-operator-unit` - Unit tests +- `branch-ci-machine-api-operator-e2e-*` - Full E2E suite + +### Debugging CI Failures + +1. **Check Prow logs** + - Find job in PR checks + - Click "Details" → view logs + +2. **Reproduce locally** + ```bash + # Match CI environment + export CI=true + make test + ``` + +3. **Run specific test** + ```bash + go test -v -run TestFailingTest ./pkg/... + ``` + +--- + +## Test Best Practices + +### DO + +✅ Write tests before fixing bugs (TDD for bugs) +✅ Test both success and error paths +✅ Use table-driven tests for multiple scenarios +✅ Mock external dependencies +✅ Keep tests fast (unit tests < 1s, integration < 10s) +✅ Use meaningful test names describing the scenario +✅ Clean up resources in test cleanup functions + +### DON'T + +❌ Test implementation details (test behavior, not internals) +❌ Write flaky tests (tests that randomly fail) +❌ Skip cleanup (use `t.Cleanup()` or `defer`) +❌ Use sleeps (use eventually/wait helpers instead) +❌ Test third-party code (trust their tests) +❌ Ignore test failures ("it works on my machine") + +--- + +## Test Utilities + +### Common Test Helpers + +```bash +# Run specific test +go test -run TestName ./pkg/path + +# Run tests in specific package +go test ./pkg/controllers/... + +# Run tests with race detector +go test -race ./pkg/... + +# Run tests with timeout +go test -timeout 5m ./test/e2e/... + +# Verbose output +go test -v ./pkg/... + +# Run tests matching pattern +go test -run "Test.*Controller" ./pkg/... +``` + +### Environment Variables + +```bash +# Enable debug logging in tests +export LOG_LEVEL=debug + +# Use specific kubeconfig for tests +export KUBECONFIG=/path/to/test-cluster-config + +# Skip slow tests +export SKIP_SLOW_TESTS=true + +# CI mode (stricter timeouts, no interactive) +export CI=true +``` + +--- + +## Related Documentation + +- [Development Guide](machine-api-operator_DEVELOPMENT.md) - Build and development workflow +- [Components](architecture/components.md) - Architecture to understand what to test +- [Team Testing Practices](../../team/ai-docs/practices/testing.md) - Team-wide testing guidelines + +--- + +**Questions?** See test-specific issues in GitHub or ask in team channel. diff --git a/ai-docs/architecture/components.md b/ai-docs/architecture/components.md new file mode 100644 index 000000000..551c03bd0 --- /dev/null +++ b/ai-docs/architecture/components.md @@ -0,0 +1,342 @@ +# machine-api-operator Components + +**Last Updated:** 2026-05-01 + +--- + +## Overview + +This document describes the major components and architecture of machine-api-operator. + +**Tech Stack:** **Languages:** Go +**Frameworks:** Kubernetes, controller-runtime +**Build Systems:** Make, Docker + +--- + +## High-Level Architecture + +``` +┌─────────────────────────────────────────────┐ +│ machine-api-operator │ +│ │ +│ ┌──────────────┐ ┌─────────────────┐ │ +│ │ │ │ │ │ +│ │ Component A │─────▶│ Component B │ │ +│ │ │ │ │ │ +│ └──────────────┘ └─────────────────┘ │ +│ │ +└─────────────────────────────────────────────┘ +``` + +*(Replace with project-specific architecture diagram)* + +--- + +## Core Components + +### Component 1: [Name] + +**Purpose:** Brief description of what this component does + +**Location:** `pkg/component1/` + +**Responsibilities:** +- Responsibility 1 +- Responsibility 2 +- Responsibility 3 + +**Key Types:** +- `Type1` - Description +- `Type2` - Description + +**Interactions:** +- Calls Component 2 for X +- Listens to events from Y +- Stores data in Z + +**Example Usage:** +```go +// Code example showing how this component is used +``` + +--- + +### Component 2: [Name] + +**Purpose:** Brief description + +**Location:** `pkg/component2/` + +**Responsibilities:** +- Responsibility 1 +- Responsibility 2 + +**Key Types:** +- `Type1` - Description + +**Interactions:** +- Interacts with Component 1 +- Calls external service X + +--- + +## For Operator Projects + +### Controllers + +**Purpose:** Reconcile Kubernetes resources + +**Location:** `pkg/controllers/` + +*(Controllers will be listed here once analysis is enhanced)* + +**Reconciliation Pattern:** +1. Fetch resource from Kubernetes API +2. Validate resource spec +3. Create/update dependent resources +4. Update resource status +5. Requeue if needed + +**Example Reconciliation:** +```go +func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + // Fetch the resource + obj := &v1alpha1.MyResource{} + if err := r.Get(ctx, req.NamespacedName, obj); err != nil { + return ctrl.Result{}, client.IgnoreNotFound(err) + } + + // Reconciliation logic here + + // Update status + if err := r.Status().Update(ctx, obj); err != nil { + return ctrl.Result{}, err + } + + return ctrl.Result{}, nil +} +``` + +--- + +### Custom Resource Definitions (CRDs) + +See [Domain Models](../domain/) for detailed CRD specifications. + +**Defined CRDs:** + +*(No CRDs detected)* + +--- + +## API Layer + +**Purpose:** Define interfaces and types + +**Location:** `pkg/api/` or `api/` + +**Key Types:** +- Request/Response structures +- Configuration types +- Status types + +--- + +## Data Flow + +``` +User/Client + ↓ +API Server + ↓ +Controller/Handler + ↓ +Business Logic + ↓ +External Systems +``` + +**Example Flow:** +1. User creates CustomResource +2. Controller watches for changes +3. Controller validates resource +4. Controller calls cloud provider API +5. Controller updates resource status + +--- + +## External Dependencies + +### Kubernetes API + +**Usage:** CRUD operations on Kubernetes resources + +**Authentication:** Service account with appropriate RBAC + +### Cloud Provider APIs (if applicable) + +**AWS:** +- SDK: `aws-sdk-go` +- Services: EC2, IAM, S3, etc. + +**GCP:** +- SDK: `cloud.google.com/go` +- Services: Compute, IAM, Storage, etc. + +**Azure:** +- SDK: `github.com/Azure/azure-sdk-for-go` +- Services: Compute, Network, Storage, etc. + +**vSphere:** +- SDK: `github.com/vmware/govmomi` +- APIs: vCenter, ESXi + +### Other Dependencies + +- **Database:** PostgreSQL, Redis, etc. +- **Message Queue:** RabbitMQ, Kafka, etc. +- **Cache:** Redis, Memcached, etc. + +--- + +## Configuration + +### Config Locations + +- **In-cluster:** ConfigMaps, Secrets +- **Command-line:** Flags passed to binary +- **Environment:** Environment variables +- **Files:** Config files mounted to container + +### Config Precedence + +1. Command-line flags (highest priority) +2. Environment variables +3. ConfigMap/Secret values +4. Default values (lowest priority) + +--- + +## Observability + +### Logging + +**Framework:** klog, logrus, or standard log + +**Log Levels:** +- `ERROR` - Errors that need attention +- `WARN` - Warnings that may need attention +- `INFO` - Informational messages +- `DEBUG` - Verbose debugging + +**Structured Logging:** +```go +log.Info("resource reconciled", + "name", resource.Name, + "namespace", resource.Namespace, + "generation", resource.Generation) +``` + +### Metrics + +**Framework:** Prometheus client + +**Key Metrics:** +- `reconcile_duration_seconds` - Time to reconcile resources +- `reconcile_errors_total` - Count of reconciliation errors +- `resource_count` - Number of managed resources + +**Metrics Endpoint:** `/metrics` + +### Tracing (if applicable) + +**Framework:** OpenTelemetry + +**Traced Operations:** +- API calls +- Controller reconciliation +- External service calls + +--- + +## Error Handling + +### Error Types + +```go +type CustomError struct { + Code string + Message string + Cause error +} +``` + +### Retry Logic + +- **Transient errors:** Retry with exponential backoff +- **Permanent errors:** Don't retry, update status with error +- **Rate limits:** Respect retry-after headers + +### Error Propagation + +- Wrap errors with context +- Preserve original error for debugging +- Log errors at appropriate level + +--- + +## Security Considerations + +### Authentication + +- Service account tokens for in-cluster communication +- API keys for external services +- Certificate-based auth where applicable + +### Authorization + +- RBAC for Kubernetes resources +- Principle of least privilege +- Separate service accounts per component + +### Secrets Management + +- Store secrets in Kubernetes Secrets +- Never log secret values +- Rotate credentials regularly + +--- + +## Performance Considerations + +### Caching + +- Cache frequently accessed data +- Invalidate cache on updates +- Use TTL for time-sensitive data + +### Rate Limiting + +- Respect API rate limits +- Implement client-side rate limiting +- Use backoff for retries + +### Resource Limits + +- Set appropriate CPU/memory limits +- Monitor resource usage +- Scale based on load + +--- + +## Related Documentation + +- [Development Guide](../machine-api-operator_DEVELOPMENT.md) - How to build and run +- [Testing Guide](../machine-api-operator_TESTING.md) - How to test components +- [Domain Models](../domain/) - CRD specifications +- [ADRs](../decisions/) - Architectural decisions + +--- + +**Note:** This is a template. Update with project-specific component details, architecture diagrams, and actual code examples. diff --git a/ai-docs/decisions/adr-template.md b/ai-docs/decisions/adr-template.md new file mode 100644 index 000000000..68376fa79 --- /dev/null +++ b/ai-docs/decisions/adr-template.md @@ -0,0 +1,133 @@ +# ADR-NNNN: Title of Decision + +**Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXXX +**Date:** YYYY-MM-DD +**Authors:** @github-handle +**Deciders:** @lead, @architect + +--- + +## Context + +What is the issue we're facing? What forces are at play? What constraints exist? + +Describe the problem that necessitates this decision. + +--- + +## Decision + +What is the change we're proposing or have agreed to make? + +State the decision clearly and concisely. + +--- + +## Rationale + +Why did we choose this approach? + +Explain the reasoning behind the decision, considering: +- Technical factors +- Business requirements +- Team constraints +- Timeline considerations + +--- + +## Consequences + +What becomes easier or harder as a result of this decision? + +### Positive Consequences + +- ✅ Benefit 1 +- ✅ Benefit 2 + +### Negative Consequences + +- ❌ Trade-off 1 +- ❌ Trade-off 2 + +### Neutral Consequences + +- ℹ️ Change 1 +- ℹ️ Change 2 + +--- + +## Alternatives Considered + +### Alternative 1: [Name] + +**Description:** Brief description of the alternative + +**Pros:** +- Advantage 1 +- Advantage 2 + +**Cons:** +- Disadvantage 1 +- Disadvantage 2 + +**Why not chosen:** Explanation + +--- + +### Alternative 2: [Name] + +**Description:** Brief description + +**Pros:** +- Advantage 1 + +**Cons:** +- Disadvantage 1 + +**Why not chosen:** Explanation + +--- + +## Implementation Notes + +How do we implement this decision? + +- Migration steps +- Code changes required +- Configuration updates +- Documentation updates + +--- + +## Validation + +How do we know this decision is working? + +- Success metrics +- Monitoring points +- Testing approach + +--- + +## References + +- Related GitHub issue: #XXX +- Related ADRs: ADR-YYYY +- External references: [Link](URL) +- Discussion thread: [Link](URL) + +--- + +## Notes + +Any additional context or information. + +--- + +## Revision History + +| Date | Author | Change | +|------|--------|--------| +| YYYY-MM-DD | @author | Initial draft | +| YYYY-MM-DD | @author | Addressed review feedback | +| YYYY-MM-DD | @author | Accepted | diff --git a/ai-docs/domain/README.md b/ai-docs/domain/README.md new file mode 100644 index 000000000..296d1ac69 --- /dev/null +++ b/ai-docs/domain/README.md @@ -0,0 +1,111 @@ +# machine-api-operator Domain Models + +**Last Updated:** 2026-05-01 + +--- + +## Overview + +This directory documents the domain models, custom resource definitions (CRDs), and core data structures used in machine-api-operator. + +--- + +## Custom Resource Definitions (CRDs) + +For Kubernetes operator projects, document each CRD here. + +**Example structure for each CRD:** + +### ResourceName + +- **API Group:** `example.com/v1alpha1` +- **Kind:** `ResourceName` +- **Plural:** `resourcenames` +- **Scope:** Namespaced | Cluster + +**Purpose:** What this resource represents + +**Spec Fields:** +- `field1` (string, required) - Description +- `field2` (int, optional) - Description + +**Status Fields:** +- `conditions` ([]Condition) - Resource conditions +- `phase` (string) - Current phase (Pending, Ready, Error) + +**Example:** +```yaml +apiVersion: example.com/v1alpha1 +kind: ResourceName +metadata: + name: example + namespace: default +spec: + field1: "value" + field2: 42 +status: + phase: Ready + conditions: + - type: Ready + status: "True" + reason: ReconciliationSucceeded +``` + +**Validation:** +- Field1 must match pattern `^[a-z0-9-]+$` +- Field2 must be between 1-100 + +**Related Documentation:** +- Controller reconciliation logic: [../architecture/components.md](../architecture/components.md) +- API reference: See generated API docs + +--- + +## Core Data Structures + +For non-operator projects, document key data structures. + +### Structure 1 + +**Purpose:** Description + +**Fields:** +```go +type MyStruct struct { + Field1 string `json:"field1"` + Field2 int `json:"field2"` +} +``` + +**Validation Rules:** +- Field1: required, non-empty +- Field2: must be positive + +--- + +## API Versioning + +**Current Version:** v1alpha1 + +**Versioning Policy:** +- `v1alpha1` - Initial experimental API +- `v1beta1` - API stabilizing, may have breaking changes +- `v1` - Stable API, backward compatibility guaranteed + +**Deprecated Fields:** +- (None currently) + +**Migration Guides:** +- [v1alpha1 → v1beta1](migrations/v1alpha1-to-v1beta1.md) (if applicable) + +--- + +## Related Documentation + +- [Components Overview](../architecture/components.md) - How these models are used +- [Development Guide](../machine-api-operator_DEVELOPMENT.md) - Adding new fields +- Generated API docs - Full API reference + +--- + +**Note:** For each major CRD or domain model, create a dedicated file (e.g., `credentialsrequest.md`) with detailed specification. diff --git a/ai-docs/exec-plans/README.md b/ai-docs/exec-plans/README.md new file mode 100644 index 000000000..98cfba20c --- /dev/null +++ b/ai-docs/exec-plans/README.md @@ -0,0 +1,195 @@ +# machine-api-operator Execution Plans + +**Last Updated:** 2026-05-01 + +--- + +## Purpose + +Execution plans (exec-plans) guide feature planning and implementation for machine-api-operator. + +Use this directory to document: +- Feature design proposals +- Implementation plans +- Spike investigations +- Proof-of-concept findings + +--- + +## When to Create an Exec Plan + +Create an exec plan when: +- ✅ Implementing a significant new feature +- ✅ Making architectural changes +- ✅ Investigating a complex problem +- ✅ Proposing a major refactor + +Don't create an exec plan for: +- ❌ Bug fixes (unless they require design changes) +- ❌ Minor improvements +- ❌ Routine maintenance + +--- + +## Exec Plan Format + +### Template Structure + +```markdown +# Feature Name + +**Status:** Draft | In Review | Approved | Implemented +**Author:** GitHub handle +**Created:** YYYY-MM-DD +**Epic:** Link to GitHub epic issue + +## Problem Statement + +What problem are we solving? Why does it matter? + +## Goals + +- Goal 1 +- Goal 2 + +## Non-Goals + +- What we're explicitly NOT doing +- Out of scope items + +## Proposed Solution + +High-level approach to solving the problem. + +### Architecture + +Component diagrams, data flow, etc. + +### API Changes + +New APIs, changed APIs, deprecated APIs. + +### Migration Path + +How existing users/resources migrate to new behavior. + +## Alternatives Considered + +- **Alternative 1:** Description and why not chosen +- **Alternative 2:** Description and why not chosen + +## Implementation Plan + +1. **Phase 1:** Milestone 1 + - Story 1.1 + - Story 1.2 + +2. **Phase 2:** Milestone 2 + - Story 2.1 + - Story 2.2 + +## Testing Strategy + +- Unit tests +- Integration tests +- E2E scenarios + +## Risks and Mitigations + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Risk 1 | High | Mitigation strategy | + +## Success Criteria + +How do we know the feature is successful? + +- Metric 1 +- Metric 2 +- User feedback + +## Timeline + +- **Week 1-2:** Design and review +- **Week 3-4:** Implementation phase 1 +- **Week 5-6:** Implementation phase 2 +- **Week 7:** Testing and documentation + +## Open Questions + +- Question 1? +- Question 2? +``` + +--- + +## Exec Plan Workflow + +### 1. Draft + +- Author creates exec plan document +- Shares with team for early feedback +- Iterates on design + +### 2. Review + +- Team reviews exec plan +- Discusses alternatives +- Identifies risks +- Approves or requests changes + +### 3. Approved + +- Exec plan is approved +- Implementation can begin +- Epic/stories created based on plan + +### 4. Implemented + +- Feature implemented +- Tests passing +- Documentation updated +- Exec plan archived for reference + +--- + +## Example Exec Plans + +*(Create example exec plan files as features are implemented)* + +- `feature-async-processing.md` - Async processing support +- `spike-performance-optimization.md` - Performance investigation +- `refactor-controller-architecture.md` - Architecture refactor + +--- + +## Relationship to ADRs + +**Exec Plans vs ADRs:** + +- **Exec Plan:** Feature design and implementation plan + - Created before implementation + - Describes what and how + - May span multiple epics/sprints + +- **ADR:** Architectural decision record + - Created during or after implementation + - Documents why a decision was made + - Explains trade-offs considered + +**Workflow:** +1. Create exec plan for feature +2. During implementation, significant architectural decisions → ADR +3. After implementation, exec plan archived, ADRs remain as reference + +--- + +## Related Documentation + +- [ADR Template](../decisions/adr-template.md) - Architectural decision records +- [Components](../architecture/components.md) - Current architecture +- [Team Workflows](../../team/ai-docs/workflows/) - Team planning process + +--- + +**Note:** This is a template directory. Replace with actual exec plans as features are proposed and implemented. diff --git a/ai-docs/references/ecosystem.md b/ai-docs/references/ecosystem.md new file mode 100644 index 000000000..7955b75b0 --- /dev/null +++ b/ai-docs/references/ecosystem.md @@ -0,0 +1,215 @@ +# machine-api-operator Ecosystem and References + +**Last Updated:** 2026-05-01 + +--- + +## Purpose + +This document provides links to related projects, upstream dependencies, documentation, and external resources relevant to machine-api-operator. + +--- + +## Upstream Projects + +### Kubernetes + +**Relationship:** machine-api-operator runs on Kubernetes + +**Resources:** +- [Kubernetes Documentation](https://kubernetes.io/docs/) +- [API Reference](https://kubernetes.io/docs/reference/kubernetes-api/) +- [Controller Runtime](https://github.com/kubernetes-sigs/controller-runtime) - Framework for building operators + +**Version Compatibility:** +- Supported Kubernetes versions: 1.24+ +- Controller Runtime version: v0.15.x + +--- + +### OpenShift (if applicable) + +**Relationship:** machine-api-operator is part of OpenShift platform + +**Resources:** +- [OpenShift Documentation](https://docs.openshift.com/) +- [OpenShift Enhancement Proposals](https://github.com/openshift/enhancements) +- [OpenShift CI (Prow)](https://docs.ci.openshift.org/) + +**Version Compatibility:** +- Supported OpenShift versions: 4.12+ + +--- + +## Related Platform Projects + +### Cloud Provider Integrations + +**vSphere (VMware):** +- [govmomi](https://github.com/vmware/govmomi) - vSphere API client +- [vSphere CSI Driver](https://github.com/kubernetes-sigs/vsphere-csi-driver) +- [vSphere Cloud Provider](https://github.com/kubernetes/cloud-provider-vsphere) + +**AWS:** +- [AWS SDK for Go](https://github.com/aws/aws-sdk-go) +- [AWS Cloud Provider](https://github.com/kubernetes/cloud-provider-aws) + +**GCP:** +- [GCP SDK](https://cloud.google.com/go) +- [GCP Cloud Provider](https://github.com/kubernetes/cloud-provider-gcp) + +**Azure:** +- [Azure SDK for Go](https://github.com/Azure/azure-sdk-for-go) +- [Azure Cloud Provider](https://github.com/kubernetes-sigs/cloud-provider-azure) + +--- + +## Sister Projects + +Projects in the same team or ecosystem: + +- **[Project 1](https://github.com/org/project1)** - Description +- **[Project 2](https://github.com/org/project2)** - Description +- **[Project 3](https://github.com/org/project3)** - Description + +See team repository for full project list: `../../team/ai-docs/architecture/projects.md` + +--- + +## Dependencies + +### Direct Dependencies + +Key libraries and frameworks used by machine-api-operator: + +**Go Modules:** +- `k8s.io/client-go` - Kubernetes client +- `sigs.k8s.io/controller-runtime` - Controller framework +- `github.com/spf13/cobra` - CLI framework (if applicable) +- `github.com/prometheus/client_golang` - Metrics + +**Python Packages (if applicable):** +- `kubernetes` - Kubernetes client +- `pytest` - Testing framework + +**JavaScript/TypeScript (if applicable):** +- `@kubernetes/client-node` - Kubernetes client +- `react` - UI framework + +See `go.mod`, `requirements.txt`, or `package.json` for complete dependency list. + +### Indirect Dependencies + +- Authentication/authorization libraries +- Logging frameworks +- Testing utilities + +--- + +## Standards and Specifications + +### Kubernetes Standards + +- [Custom Resource Definition (CRD)](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) +- [Controller Pattern](https://kubernetes.io/docs/concepts/architecture/controller/) +- [Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) + +### Cloud Provider Standards + +- **AWS:** [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) +- **GCP:** [GCP Architecture Framework](https://cloud.google.com/architecture/framework) +- **Azure:** [Azure Well-Architected Framework](https://docs.microsoft.com/azure/architecture/framework/) +- **vSphere:** [vSphere API Reference](https://developer.vmware.com/apis/vsphere-automation/latest/) + +--- + +## Documentation Resources + +### Team-Level Documentation + +See team repository for: +- **Workflows:** Sprint process, epic breakdown, triage +- **Practices:** Coding standards, testing guidelines +- **Roles:** Hat-switching, responsibilities + +Location: `../../team/ai-docs/` + +### Platform Documentation + +**Operator Development:** +- [Operator SDK](https://sdk.operatorframework.io/) +- [Operator Best Practices](https://sdk.operatorframework.io/docs/best-practices/) +- [Kubebuilder Book](https://book.kubebuilder.io/) + +**Testing:** +- [Kubernetes Testing Guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/testing.md) +- [E2E Testing Framework](https://github.com/kubernetes-sigs/e2e-framework) + +**CI/CD:** +- [Prow Documentation](https://docs.prow.k8s.io/) +- [OpenShift CI](https://docs.ci.openshift.org/) + +--- + +## Community and Support + +### Communication Channels + +**Team Channels:** +- Slack: `#team-channel` (internal) +- GitHub Discussions: Project discussions tab +- Mailing List: team-list@example.com (if applicable) + +**Upstream Communities:** +- Kubernetes Slack: `#sig-cloud-provider`, `#kubebuilder`, etc. +- OpenShift Slack: `#forum-openshift`, `#forum-` + +### Meetings + +**Team Meetings:** +- Sprint planning: Bi-weekly (see team calendar) +- Sprint review: Bi-weekly +- Standup: Daily (async) + +**Upstream Meetings:** +- SIG meetings: Check [Kubernetes calendar](https://calendar.google.com/calendar/embed?src=calendar%40kubernetes.io) +- OpenShift meetings: Check [OpenShift calendar](https://calendar.google.com/calendar/embed?src=openshift.io_5s2lnu98o7vjhm8hs5q4vkp7s0%40group.calendar.google.com) + +--- + +## Learning Resources + +### Getting Started + +**Kubernetes:** +- [Kubernetes Basics](https://kubernetes.io/docs/tutorials/kubernetes-basics/) +- [Kubernetes the Hard Way](https://github.com/kelseyhightower/kubernetes-the-hard-way) + +**Operator Development:** +- [Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) +- [Operator SDK Tutorial](https://sdk.operatorframework.io/docs/building-operators/golang/tutorial/) + +**Cloud Providers:** +- [vSphere Docs](https://docs.vmware.com/en/VMware-vSphere/index.html) +- [AWS Documentation](https://docs.aws.amazon.com/) +- [GCP Documentation](https://cloud.google.com/docs) +- [Azure Documentation](https://docs.microsoft.com/azure/) + +### Advanced Topics + +- [Kubernetes API Conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md) +- [Controller Runtime Deep Dive](https://engineering.bitnami.com/articles/kubebuilder-deep-dive.html) +- [Writing Controllers](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md) + +--- + +## Related Documentation + +- [Development Guide](../machine-api-operator_DEVELOPMENT.md) - Build and develop +- [Testing Guide](../machine-api-operator_TESTING.md) - Test suites +- [Components](../architecture/components.md) - Architecture +- [ADRs](../decisions/) - Architectural decisions + +--- + +**Note:** Update this document as the ecosystem evolves, dependencies change, or new resources become available. diff --git a/tests/TEST_PLAN.md b/tests/TEST_PLAN.md new file mode 100644 index 000000000..66e47bfe9 --- /dev/null +++ b/tests/TEST_PLAN.md @@ -0,0 +1,84 @@ +# Test Plan: Machine API Operator Component Credential Integration + +**Story:** #20 - Machine API Operator Component Credential Integration +**Epic:** #14 - vSphere multi-account credential management +**Dependency:** #19 - CCO Detects and Provisions Component Credentials + +## Overview + +This test plan validates that the Machine API Operator correctly integrates with component-specific credentials provisioned by the Cloud Credential Operator (CCO). The operator must read credentials from the openshift-machine-api namespace, support multi-vCenter deployments, validate vSphere privileges, and handle credential rotation gracefully. + +## Test Scope + +### In Scope +- Reading vsphere-machine-api-creds from openshift-machine-api namespace +- FQDN-based credential lookup for multi-vCenter support +- Validation of 35 required vSphere privileges +- Error reporting to cluster operator status +- Machine lifecycle operations with component credentials +- Graceful credential rotation without downtime +- Multi-vCenter credential isolation + +### Out of Scope +- CCO credential provisioning logic (covered by Story #19) +- Installer credential setup (covered by other stories) +- Storage operator integration +- Cloud controller manager integration + +## Test Categories + +### 1. Credential Reading +**File:** credential_reader_test.go +**Objective:** Verify the operator reads credentials from the correct namespace and secret. + +### 2. vCenter Lookup +**File:** vcenter_lookup_test.go +**Objective:** Verify correct credential selection based on vCenter FQDN. + +### 3. Privilege Validation +**File:** privilege_validator_test.go +**Objective:** Validate all 35 required vSphere privileges before machine operations. + +### 4. Status Reporting +**File:** status_reporter_test.go +**Objective:** Verify validation errors appear in cluster operator status with clear messaging. + +### 5. Machine Lifecycle +**File:** machine_lifecycle_test.go +**Objective:** Verify machine operations succeed using component credentials. + +### 6. Credential Rotation +**File:** credential_rotation_test.go +**Objective:** Verify graceful credential rotation without downtime. + +### 7. Multi-vCenter Isolation +**File:** multi_vcenter_isolation_test.go +**Objective:** Verify credentials cannot cross vCenter boundaries. + +## Success Criteria + +- ✅ All acceptance criteria have corresponding test cases +- ✅ Edge cases covered (missing credentials, invalid privileges, rotation failures) +- ✅ Tests are deterministic and reproducible +- ✅ Tests run in CI pipeline +- ✅ Unit tests: >80% code coverage +- ✅ All tests pass consistently + +## Test Execution + +```bash +# Run all tests +cd projects/machine-api-operator +go test ./tests/... -v + +# Run specific test category +go test ./tests/credential_reader_test.go -v +go test ./tests/privilege_validator_test.go -v +``` + +## Dependencies + +- Story #19 (CCO credential provisioning) must be complete +- vSphere test environment with multiple vCenter instances +- Test accounts with varying privilege levels +- OpenShift cluster for integration/e2e tests