Document and validate client VM config passing strategy (runtime tfvars vs Terraform locals)

## Overview

Analysis of the client VM configuration passing strategy and whether it should adopt the Terraform locals pattern used by allocator infrastructure deployment or maintain the current runtime tfvars approach.

**Priority:** Low (documentation/validation issue)
**Scope:** Architecture discussion, configuration management

## Background

The allocator service uses two different patterns for passing configuration to Terraform:

1. **Client VM deployment** (inside allocator package): Python generates `terraform.runtime.tfvars` from cfg object
2. **Allocator infrastructure deployment** (lablink-template repo): Terraform reads config.yaml directly and uses locals

## Current Client VM Pattern

### Flow
```
config.yaml (mounted at /config)
    ↓
get_config() loads into cfg object (Python)
    ↓
/api/launch endpoint extracts values:
  - allocator_url (calculated via get_allocator_url())
  - gpu_support (calculated via check_support_nvidia())
  - machine_type, image_name, repository, etc.
    ↓
Python writes terraform.runtime.tfvars:
```hcl
allocator_ip = "1.2.3.4"
allocator_url = "https://test.lablink.sleap.ai"
machine_type = "g4dn.xlarge"
image_name = "ghcr.io/talmolab/lablink-client-base-image:latest"
repository = "https://github.com/talmolab/sleap-tutorial-data.git"
client_ami_id = "ami-0601752c11b394251"
subject_software = "sleap"
resource_suffix = "prod"
gpu_support = "true"
cloud_init_output_log_group = "lablink-cloud-init-logs"
region = "us-west-2"
```
```
    ↓
terraform apply -var-file=terraform.runtime.tfvars
    ↓
Terraform reads variables.tf defaults + runtime overrides
    ↓
templatefile("user_data.sh", {allocator_url = var.allocator_url, ...})
    ↓
Cloud-init sets environment variables in client VM
```

**Code locations:**
- Config generation: [main.py:383-395](packages/allocator/src/lablink_allocator_service/main.py#L383-L395)
- Terraform vars: [variables.tf](packages/allocator/src/lablink_allocator_service/terraform/variables.tf)
- Template injection: [main.tf:73-84](packages/allocator/src/lablink_allocator_service/terraform/main.tf#L73-L84)

### Characteristics
- **Dynamic calculation**: allocator_url calculated based on DNS/SSL config
- **Python orchestration**: Flask endpoint controls when/how VMs are launched
- **Runtime flexibility**: Can modify config without Terraform redeployment
- **Audit trail**: runtime.tfvars uploaded to S3 for debugging
- **Separation of concerns**: Python handles business logic, Terraform handles provisioning

## Allocator Infrastructure Pattern

### Flow (from lablink-template repo)
```
config/config.yaml
    ↓
Terraform reads file("config/config.yaml")
    ↓
yamldecode() converts to Terraform data structure
    ↓
locals block extracts values:
```hcl
locals {
  config = yamldecode(file("${path.module}/config/config.yaml"))
  dns_enabled = try(local.config.dns.enabled, false)
  dns_domain = try(local.config.dns.domain, "")
  allocator_image_tag = try(local.config.allocator.image, "latest")
  # ... many more locals
}
```
```
    ↓
Resources use locals directly:
```hcl
resource "aws_instance" "allocator" {
  user_data = templatefile("user_data.sh", {
    CONFIG_CONTENT = file("${path.module}/config/config.yaml")
    DOMAIN_NAME = local.dns_domain
  })
}
```

**Code locations:**
- Terraform config: [lablink-template/lablink-infrastructure/main.tf](https://github.com/talmolab/lablink-template/blob/main/lablink-infrastructure/main.tf)

### Characteristics
- **Pure Terraform**: No Python orchestration needed
- **Direct YAML reading**: Terraform has native yamldecode()
- **Static deployment**: Config changes require terraform apply
- **Single Terraform run**: All infrastructure in one deployment

## Question

Should client VM deployment adopt the allocator infrastructure pattern (Terraform locals reading config.yaml directly)?

## Analysis

### Option A: Switch to Terraform Locals (Like Allocator Infrastructure)

**Implementation:**
```hcl
# In client VM main.tf
locals {
  config = yamldecode(file("${path.module}/../../conf/config.yaml"))
  machine_type = local.config.machine.machine_type
  image_name = local.config.machine.image
  repository = local.config.machine.repository
  # ...
}

resource "aws_instance" "lablink_vm" {
  instance_type = local.machine_type
  user_data = templatefile("user_data.sh", {
    allocator_url = local.allocator_url
    # ...
  })
}
```

**Pros:**
- Single source of truth (config.yaml)
- No intermediate runtime file
- Consistent with allocator infrastructure pattern
- Terraform validates config at plan time

**Cons:**
- ❌ **Cannot calculate allocator_url**: get_allocator_url() is complex Python logic (DNS patterns, SSL, sanitization)
- ❌ **Cannot detect GPU support**: check_support_nvidia() queries AWS API via Python
- ❌ **Loses runtime flexibility**: Every config change requires terraform plan/apply
- ❌ **No audit trail**: No runtime.tfvars file to debug what values were used
- ❌ **Config location issue**: Terraform would need to read from `/config/config.yaml` (mounted) but that's not available at plan time
- ❌ **No Python orchestration**: Flask endpoint controls VM launch timing and count

### Option B: Keep Runtime Tfvars (Current Approach)

**Pros:**
- ✅ **Python orchestration works**: Flask endpoint controls launch process
- ✅ **Dynamic calculations**: allocator_url, gpu_support calculated at runtime
- ✅ **Audit trail**: runtime.tfvars uploaded to S3, traceable
- ✅ **Separation of concerns**: Business logic in Python, provisioning in Terraform
- ✅ **Runtime flexibility**: Can modify without Terraform redeployment
- ✅ **Config still single source**: Python reads cfg from config.yaml

**Cons:**
- Requires Python bridge between config and Terraform
- Two-step process (generate tfvars, then apply)
- Different pattern from allocator infrastructure

## Recommendation

**Keep the current runtime tfvars approach (Option B).**

### Why This Is the Right Pattern

1. **Different use cases**:
   - **Allocator infrastructure**: One-time deployment of allocator EC2, DNS, SSL (pure Terraform)
   - **Client VMs**: Dynamic, repeated launches orchestrated by Python Flask app

2. **Config IS still single source of truth**:
   - Python reads cfg from config.yaml (via get_config())
   - runtime.tfvars is a **serialization** of cfg values, not a separate source
   - It's a bridge format, not a competing configuration

3. **Python orchestration is necessary**:
   - allocator_url requires complex logic: DNS pattern resolution, SSL staging detection, URL sanitization
   - GPU support requires AWS API queries
   - Launch timing controlled by Flask /api/launch endpoint
   - VM count calculated dynamically (num_vms + database.get_row_count())

4. **Pattern consistency within domain**:
   - Client VMs are orchestrated by the allocator service
   - Allocator infrastructure is self-deploying
   - Different domains, different patterns appropriate

### What the Allocator Infrastructure Pattern Achieves

The Terraform locals pattern in lablink-template is appropriate because:
- Pure infrastructure deployment (EC2, DNS, EIP, Lambda)
- No runtime orchestration needed
- Config changes are infrequent (DNS, SSL settings)
- Terraform plan can show full diff before apply

### What the Client VM Pattern Achieves

The runtime tfvars pattern is appropriate because:
- Dynamic orchestration (Flask controls timing, count, validation)
- Complex calculations (URL generation, GPU detection)
- Frequent operations (launch/destroy VMs on demand)
- Audit trail (runtime.tfvars uploaded to S3)
- Python has already loaded and validated the config

## Potential Improvements (Optional)

While the pattern is appropriate, small enhancements could help:

### 1. Extract Port from allocator_url
```python
from urllib.parse import urlparse

parsed = urlparse(allocator_url)
port = parsed.port or (443 if parsed.scheme == "https" else 80)

# Add to runtime.tfvars:
f.write(f'allocator_port = {port}\n')
```

**Benefit**: Explicit port variable for client services fallback logic.

### 2. Validate Runtime Values Match Config
```python
# Before terraform apply, verify:
assert runtime_values["machine_type"] == cfg.machine.machine_type
assert runtime_values["region"] == cfg.app.region
```

**Benefit**: Catch bugs where runtime.tfvars diverges from cfg.

### 3. Generate variables.tf Defaults from structured_config.py
```python
# Tool to sync variable defaults with dataclass defaults
# Ensures Terraform defaults match Python config defaults
```

**Benefit**: Reduces duplication between structured_config.py and variables.tf.

### 4. Document the Pattern
Add to CLAUDE.md:
```markdown
## Client VM Configuration Pattern

Client VMs use **runtime tfvars** generated by Python (not Terraform locals).

Why: Python orchestration needed for:
- Dynamic allocator URL calculation (DNS/SSL logic)
- GPU support detection (AWS API)
- Launch timing and VM count control (Flask endpoint)

The pattern:
1. Flask endpoint reads cfg (from config.yaml)
2. Python calculates derived values
3. Writes terraform.runtime.tfvars
4. Terraform applies with runtime overrides
5. runtime.tfvars uploaded to S3 for audit

This differs from allocator infrastructure (lablink-template) which uses
Terraform locals because it's pure infrastructure deployment without
runtime orchestration.
```

## Non-Goals

- This does NOT propose changing the client VM pattern to Terraform locals
- This does NOT propose merging allocator infrastructure and client VM Terraform
- This does NOT question the single source of truth (config.yaml still is)

## Conclusion

**The current runtime tfvars pattern is appropriate and should be maintained.**

The allocator infrastructure and client VM deployments have different requirements:
- Allocator infrastructure: Pure Terraform, static config, infrequent deployment
- Client VMs: Python-orchestrated, dynamic config, frequent launch/destroy

Both patterns are valid for their respective use cases. The config.yaml remains the single source of truth in both cases; the difference is in how that truth is propagated to Terraform.

## Acceptance Criteria

- [ ] Document the rationale for runtime tfvars pattern in CLAUDE.md
- [ ] Consider adding port extraction improvement
- [ ] Consider adding runtime validation
- [ ] No changes to client VM Terraform pattern required


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document and validate client VM config passing strategy (runtime tfvars vs Terraform locals) #218

Overview

Background

Current Client VM Pattern

Flow

Characteristics

Allocator Infrastructure Pattern

Flow (from lablink-template repo)

Characteristics

Question

Analysis

Option A: Switch to Terraform Locals (Like Allocator Infrastructure)

Option B: Keep Runtime Tfvars (Current Approach)

Recommendation

Why This Is the Right Pattern

What the Allocator Infrastructure Pattern Achieves

What the Client VM Pattern Achieves

Potential Improvements (Optional)

1. Extract Port from allocator_url

2. Validate Runtime Values Match Config

3. Generate variables.tf Defaults from structured_config.py

4. Document the Pattern

Non-Goals

Conclusion

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Document and validate client VM config passing strategy (runtime tfvars vs Terraform locals) #218

Description

Overview

Background

Current Client VM Pattern

Flow

Characteristics

Allocator Infrastructure Pattern

Flow (from lablink-template repo)

Characteristics

Question

Analysis

Option A: Switch to Terraform Locals (Like Allocator Infrastructure)

Option B: Keep Runtime Tfvars (Current Approach)

Recommendation

Why This Is the Right Pattern

What the Allocator Infrastructure Pattern Achieves

What the Client VM Pattern Achieves

Potential Improvements (Optional)

1. Extract Port from allocator_url

2. Validate Runtime Values Match Config

3. Generate variables.tf Defaults from structured_config.py

4. Document the Pattern

Non-Goals

Conclusion

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions