Overview
Analysis of the client VM configuration passing strategy and whether it should adopt the Terraform locals pattern used by allocator infrastructure deployment or maintain the current runtime tfvars approach.
Priority: Low (documentation/validation issue)
Scope: Architecture discussion, configuration management
Background
The allocator service uses two different patterns for passing configuration to Terraform:
- Client VM deployment (inside allocator package): Python generates
terraform.runtime.tfvars from cfg object
- Allocator infrastructure deployment (lablink-template repo): Terraform reads config.yaml directly and uses locals
Current Client VM Pattern
Flow
config.yaml (mounted at /config)
↓
get_config() loads into cfg object (Python)
↓
/api/launch endpoint extracts values:
- allocator_url (calculated via get_allocator_url())
- gpu_support (calculated via check_support_nvidia())
- machine_type, image_name, repository, etc.
↓
Python writes terraform.runtime.tfvars:
```hcl
allocator_ip = "1.2.3.4"
allocator_url = "https://test.lablink.sleap.ai"
machine_type = "g4dn.xlarge"
image_name = "ghcr.io/talmolab/lablink-client-base-image:latest"
repository = "https://github.com/talmolab/sleap-tutorial-data.git"
client_ami_id = "ami-0601752c11b394251"
subject_software = "sleap"
resource_suffix = "prod"
gpu_support = "true"
cloud_init_output_log_group = "lablink-cloud-init-logs"
region = "us-west-2"
↓
terraform apply -var-file=terraform.runtime.tfvars
↓
Terraform reads variables.tf defaults + runtime overrides
↓
templatefile("user_data.sh", {allocator_url = var.allocator_url, ...})
↓
Cloud-init sets environment variables in client VM
Code locations:
Characteristics
- Dynamic calculation: allocator_url calculated based on DNS/SSL config
- Python orchestration: Flask endpoint controls when/how VMs are launched
- Runtime flexibility: Can modify config without Terraform redeployment
- Audit trail: runtime.tfvars uploaded to S3 for debugging
- Separation of concerns: Python handles business logic, Terraform handles provisioning
Allocator Infrastructure Pattern
Flow (from lablink-template repo)
config/config.yaml
↓
Terraform reads file("config/config.yaml")
↓
yamldecode() converts to Terraform data structure
↓
locals block extracts values:
```hcl
locals {
config = yamldecode(file("${path.module}/config/config.yaml"))
dns_enabled = try(local.config.dns.enabled, false)
dns_domain = try(local.config.dns.domain, "")
allocator_image_tag = try(local.config.allocator.image, "latest")
# ... many more locals
}
↓
Resources use locals directly:
```hcl
resource "aws_instance" "allocator" {
user_data = templatefile("user_data.sh", {
CONFIG_CONTENT = file("${path.module}/config/config.yaml")
DOMAIN_NAME = local.dns_domain
})
}
Code locations:
Characteristics
- Pure Terraform: No Python orchestration needed
- Direct YAML reading: Terraform has native yamldecode()
- Static deployment: Config changes require terraform apply
- Single Terraform run: All infrastructure in one deployment
Question
Should client VM deployment adopt the allocator infrastructure pattern (Terraform locals reading config.yaml directly)?
Analysis
Option A: Switch to Terraform Locals (Like Allocator Infrastructure)
Implementation:
# In client VM main.tf
locals {
config = yamldecode(file("${path.module}/../../conf/config.yaml"))
machine_type = local.config.machine.machine_type
image_name = local.config.machine.image
repository = local.config.machine.repository
# ...
}
resource "aws_instance" "lablink_vm" {
instance_type = local.machine_type
user_data = templatefile("user_data.sh", {
allocator_url = local.allocator_url
# ...
})
}
Pros:
- Single source of truth (config.yaml)
- No intermediate runtime file
- Consistent with allocator infrastructure pattern
- Terraform validates config at plan time
Cons:
- ❌ Cannot calculate allocator_url: get_allocator_url() is complex Python logic (DNS patterns, SSL, sanitization)
- ❌ Cannot detect GPU support: check_support_nvidia() queries AWS API via Python
- ❌ Loses runtime flexibility: Every config change requires terraform plan/apply
- ❌ No audit trail: No runtime.tfvars file to debug what values were used
- ❌ Config location issue: Terraform would need to read from
/config/config.yaml (mounted) but that's not available at plan time
- ❌ No Python orchestration: Flask endpoint controls VM launch timing and count
Option B: Keep Runtime Tfvars (Current Approach)
Pros:
- ✅ Python orchestration works: Flask endpoint controls launch process
- ✅ Dynamic calculations: allocator_url, gpu_support calculated at runtime
- ✅ Audit trail: runtime.tfvars uploaded to S3, traceable
- ✅ Separation of concerns: Business logic in Python, provisioning in Terraform
- ✅ Runtime flexibility: Can modify without Terraform redeployment
- ✅ Config still single source: Python reads cfg from config.yaml
Cons:
- Requires Python bridge between config and Terraform
- Two-step process (generate tfvars, then apply)
- Different pattern from allocator infrastructure
Recommendation
Keep the current runtime tfvars approach (Option B).
Why This Is the Right Pattern
-
Different use cases:
- Allocator infrastructure: One-time deployment of allocator EC2, DNS, SSL (pure Terraform)
- Client VMs: Dynamic, repeated launches orchestrated by Python Flask app
-
Config IS still single source of truth:
- Python reads cfg from config.yaml (via get_config())
- runtime.tfvars is a serialization of cfg values, not a separate source
- It's a bridge format, not a competing configuration
-
Python orchestration is necessary:
- allocator_url requires complex logic: DNS pattern resolution, SSL staging detection, URL sanitization
- GPU support requires AWS API queries
- Launch timing controlled by Flask /api/launch endpoint
- VM count calculated dynamically (num_vms + database.get_row_count())
-
Pattern consistency within domain:
- Client VMs are orchestrated by the allocator service
- Allocator infrastructure is self-deploying
- Different domains, different patterns appropriate
What the Allocator Infrastructure Pattern Achieves
The Terraform locals pattern in lablink-template is appropriate because:
- Pure infrastructure deployment (EC2, DNS, EIP, Lambda)
- No runtime orchestration needed
- Config changes are infrequent (DNS, SSL settings)
- Terraform plan can show full diff before apply
What the Client VM Pattern Achieves
The runtime tfvars pattern is appropriate because:
- Dynamic orchestration (Flask controls timing, count, validation)
- Complex calculations (URL generation, GPU detection)
- Frequent operations (launch/destroy VMs on demand)
- Audit trail (runtime.tfvars uploaded to S3)
- Python has already loaded and validated the config
Potential Improvements (Optional)
While the pattern is appropriate, small enhancements could help:
1. Extract Port from allocator_url
from urllib.parse import urlparse
parsed = urlparse(allocator_url)
port = parsed.port or (443 if parsed.scheme == "https" else 80)
# Add to runtime.tfvars:
f.write(f'allocator_port = {port}\n')
Benefit: Explicit port variable for client services fallback logic.
2. Validate Runtime Values Match Config
# Before terraform apply, verify:
assert runtime_values["machine_type"] == cfg.machine.machine_type
assert runtime_values["region"] == cfg.app.region
Benefit: Catch bugs where runtime.tfvars diverges from cfg.
3. Generate variables.tf Defaults from structured_config.py
# Tool to sync variable defaults with dataclass defaults
# Ensures Terraform defaults match Python config defaults
Benefit: Reduces duplication between structured_config.py and variables.tf.
4. Document the Pattern
Add to CLAUDE.md:
## Client VM Configuration Pattern
Client VMs use **runtime tfvars** generated by Python (not Terraform locals).
Why: Python orchestration needed for:
- Dynamic allocator URL calculation (DNS/SSL logic)
- GPU support detection (AWS API)
- Launch timing and VM count control (Flask endpoint)
The pattern:
1. Flask endpoint reads cfg (from config.yaml)
2. Python calculates derived values
3. Writes terraform.runtime.tfvars
4. Terraform applies with runtime overrides
5. runtime.tfvars uploaded to S3 for audit
This differs from allocator infrastructure (lablink-template) which uses
Terraform locals because it's pure infrastructure deployment without
runtime orchestration.
Non-Goals
- This does NOT propose changing the client VM pattern to Terraform locals
- This does NOT propose merging allocator infrastructure and client VM Terraform
- This does NOT question the single source of truth (config.yaml still is)
Conclusion
The current runtime tfvars pattern is appropriate and should be maintained.
The allocator infrastructure and client VM deployments have different requirements:
- Allocator infrastructure: Pure Terraform, static config, infrequent deployment
- Client VMs: Python-orchestrated, dynamic config, frequent launch/destroy
Both patterns are valid for their respective use cases. The config.yaml remains the single source of truth in both cases; the difference is in how that truth is propagated to Terraform.
Acceptance Criteria
Overview
Analysis of the client VM configuration passing strategy and whether it should adopt the Terraform locals pattern used by allocator infrastructure deployment or maintain the current runtime tfvars approach.
Priority: Low (documentation/validation issue)
Scope: Architecture discussion, configuration management
Background
The allocator service uses two different patterns for passing configuration to Terraform:
terraform.runtime.tfvarsfrom cfg objectCurrent Client VM Pattern
Flow
Code locations:
Characteristics
Allocator Infrastructure Pattern
Flow (from lablink-template repo)
Code locations:
Characteristics
Question
Should client VM deployment adopt the allocator infrastructure pattern (Terraform locals reading config.yaml directly)?
Analysis
Option A: Switch to Terraform Locals (Like Allocator Infrastructure)
Implementation:
Pros:
Cons:
/config/config.yaml(mounted) but that's not available at plan timeOption B: Keep Runtime Tfvars (Current Approach)
Pros:
Cons:
Recommendation
Keep the current runtime tfvars approach (Option B).
Why This Is the Right Pattern
Different use cases:
Config IS still single source of truth:
Python orchestration is necessary:
Pattern consistency within domain:
What the Allocator Infrastructure Pattern Achieves
The Terraform locals pattern in lablink-template is appropriate because:
What the Client VM Pattern Achieves
The runtime tfvars pattern is appropriate because:
Potential Improvements (Optional)
While the pattern is appropriate, small enhancements could help:
1. Extract Port from allocator_url
Benefit: Explicit port variable for client services fallback logic.
2. Validate Runtime Values Match Config
Benefit: Catch bugs where runtime.tfvars diverges from cfg.
3. Generate variables.tf Defaults from structured_config.py
Benefit: Reduces duplication between structured_config.py and variables.tf.
4. Document the Pattern
Add to CLAUDE.md:
Non-Goals
Conclusion
The current runtime tfvars pattern is appropriate and should be maintained.
The allocator infrastructure and client VM deployments have different requirements:
Both patterns are valid for their respective use cases. The config.yaml remains the single source of truth in both cases; the difference is in how that truth is propagated to Terraform.
Acceptance Criteria