Skip to content

feat(tfc-job-agent): Init#790

Open
zacharyblasczyk wants to merge 21 commits intomainfrom
tfc-workspace-job-agent
Open

feat(tfc-job-agent): Init#790
zacharyblasczyk wants to merge 21 commits intomainfrom
tfc-workspace-job-agent

Conversation

@zacharyblasczyk
Copy link
Member

@zacharyblasczyk zacharyblasczyk commented Feb 5, 2026

Summary

Refactor the TFE job agent to own the run lifecycle directly instead of delegating to a verification provider.

TFE has definitive run states and doesn't need continuous reconciliation — the dispatcher now polls TFC every 15s, maps ~25 run states to ctrlplane job statuses, and publishes events on each state transition.

Changes

  • Remove verification provider indirection — delete tfe_verifications.go entirely; the TFE dispatcher handles its own polling and state mapping
  • Engine-restart resilience — add Restorable interface so jobs with an ExternalId (TFC run ID) resume polling after restart; orphaned jobs are marked as failed
  • Accurate resource change counts — include the plan relation in the poll response
  • Status mapping fix — map RunPending to pending instead of inProgress
  • Job status tooltips — add Message field to JobSummary response so the release targets page can show status context
  • 23 unit tests for the TFE dispatcher covering state transitions, restore behavior, and edge cases

Test plan

  • 23 new unit tests in tfe_test.go (go test ./pkg/workspace/jobagents/terraformcloud/...)
  • Verify pnpm build passes for @ctrlplane/workspace-engine
  • Manually trigger a TFC run and confirm status transitions appear correctly
  • Kill and restart the workspace engine mid-run; confirm the job resumes polling

Summary by CodeRabbit

  • New Features

    • Introduced Terraform Cloud integration for workspace job agents, enabling workspace management, variable synchronization, and run monitoring with asynchronous status updates.
    • Added support for job agent configuration merging with proper precedence handling across agent, deployment, and version levels.
    • Implemented job restoration to resume monitoring in-flight runs after service restart.
  • Tests

    • Added end-to-end test validating Terraform Cloud job agent configuration merge precedence and functionality.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds Terraform Cloud job agent implementation and wiring: new templating types and full dispatch flow (workspace upsert, variable sync, run creation/monitoring, event emission), registry change to pass a verification.Manager, an end-to-end test for config merge precedence, and small build and lint tweaks.

Changes

Cohort / File(s) Summary
Terraform Cloud implementation
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go, apps/workspace-engine/pkg/workspace/jobagents/registry.go
Adds TFC templating types (VCSRepoTemplate, WorkspaceTemplate, VariableTemplate), implements Dispatch flow (parse config, render workspace, upsert workspace, sync variables, create & poll runs, send events), adds RestoreJobs to resume in-flight runs, and changes NewTFE signature to accept verifications *verification.Manager. Registry now passes verifications to constructor.
End-to-end test
apps/workspace-engine/test/e2e/engine_job_agent_tfe_test.go
New E2E test TestEngine_TerraformCloudJobAgentConfigMerge validating JobAgentConfig merge precedence across agent, deployment, and version and asserting pending job creation.
Build script
apps/workspace-engine/package.json
Build script updated to build current module (.) instead of a specific main.go path.
Reconcile memory logic
apps/workspace-engine/pkg/reconcile/memory/memory.go
Minor conditional rewrite in Enqueue to an equivalent logical form (no behavior change).
Controller lint workaround
apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
Adds TODO and assigns unused resourceMatchedIds to blank identifier to satisfy linters (no functional change).

Sequence Diagram(s)

sequenceDiagram
    participant JobSystem as Job System
    participant TFE as TFE Dispatcher
    participant Store as Store
    participant TFC as Terraform Cloud API
    participant Verifier as Verification Manager
    participant Kafka as Kafka Producer

    JobSystem->>TFE: Dispatch(job)
    TFE->>Store: getTemplatableJob(job)
    Store-->>TFE: templatable job + templates
    TFE->>TFE: parseJobAgentConfig & render workspace template
    TFE->>TFC: upsertWorkspace(workspace)
    TFC-->>TFE: workspace (id/url)
    TFE->>TFC: syncVariables(workspace, variables)
    TFC-->>TFE: variables synced
    TFE->>TFC: createRun(workspace)
    TFC-->>TFE: run (externalId)
    TFE->>Verifier: verify(run)
    Verifier-->>TFE: verification result
    TFE->>Kafka: sendJobEvent(status, metadata)
    Kafka-->>TFE: ack
    TFE->>JobSystem: async completion / status updates
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Suggested Reviewers

  • jsbroks

Poem

🐰 In clouds we plant a templated bed,
Workspaces hop where variables spread,
Runs take root and verifications hum,
Kafka flutters with each status drum,
A bunny cheers — the infra's fed!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(tfc-job-agent): Init' accurately reflects the main objective of the PR, which is to initialize the Terraform Cloud job agent implementation with core dispatcher functionality, templating, and polling logic.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tfc-workspace-job-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 95-129: The goroutine launching the TFE workflow ignores errors
from sendJobFailureEvent/sendJobUpdateEvent and has no panic recovery; modify
the anonymous goroutine to (1) add a deferred recover handler (defer func() { if
r := recover() { log the panic and call sendJobFailureEvent with a descriptive
message/error } }) so panics are caught and reported, and (2) check and handle
the error return values of every sendJobFailureEvent and sendJobUpdateEvent
invocation (e.g., after calling sendJobFailureEvent(...) capture err and if
non-nil attempt a fallback log and/or retry/report) so failures to publish
events are not silently dropped; apply these changes around calls in the
goroutine
(getClient/upsertWorkspace/syncVariables/createRun/verification.StartVerification/sendJobUpdateEvent).
- Around line 453-463: The code converts VariableTemplate.Category directly to
tfe.CategoryType in VariableTemplate.toCreateOptions which allows invalid
categories to reach the API; validate the Category string against the allowed
values ("terraform" and "env") and fail fast: either add validation in
parseJobAgentConfig (preferred) to return an error when an invalid category is
encountered, or change VariableTemplate.toCreateOptions to return
(tfe.VariableCreateOptions, error) and verify v.Category is one of the allowed
values before converting (use constants for "terraform"/"env" and return a clear
error mentioning the invalid value and the VariableTemplate.Key).
- Around line 199-201: The code currently checks the error by comparing
err.Error() to a string; replace that fragile comparison with a sentinel error
check using errors.Is(err, tfe.ErrResourceNotFound) (or err ==
tfe.ErrResourceNotFound) so wrapped errors are handled correctly: change the
condition to if err != nil && !errors.Is(err, tfe.ErrResourceNotFound) { return
nil, fmt.Errorf("failed to read workspace: %w", err) } and add the "errors"
import if missing; reference symbols: err, tfe.ErrResourceNotFound in tfe.go.

In `@apps/workspace-engine/test/e2e/engine_argocd_template_test.go`:
- Around line 255-263: The code block parsing jobAgentConfigJSON has multiple
statements concatenated on single lines (e.g., json.Unmarshal into
jobAgentConfig, require.NoError, t.Logf, template extraction and checks) causing
gofmt failures; split each statement onto its own line and ensure proper
statement separators so that json.Unmarshal(..., &jobAgentConfig) is followed by
require.NoError(t, err), then t.Logf(...) on its own line, then the template
lookup (template, ok := jobAgentConfig["template"]) and subsequent
require.True/require.NotEmpty/templateStr conversion and t.Logf or
assert.Contains calls are each on separate lines; adjust the sequence around
jobAgentConfig, template, templateStr, require/ assert and t.Logf to restore
correct formatting and compile cleanly.
🧹 Nitpick comments (6)
apps/workspace-engine/test/e2e/engine_argocd_template_test.go (2)

147-151: Consider simplifying map iteration for single job retrieval.

The loop iterates over the map just to get the first (and only expected) job. A more direct approach would improve readability.

♻️ Suggested simplification
-	var job *oapi.Job
-	for _, j := range pendingJobs {
-		job = j
-		break
-	}
+	// Get the first (and only) pending job
+	var job *oapi.Job
+	for _, j := range pendingJobs {
+		job = j
+		break
+	}

Alternatively, if there's a helper to get jobs as a slice, that would be cleaner. The current approach works but the break after assignment is a common Go pattern for "get any one element from map."


242-244: Unchecked type assertion could panic.

The type assertion template.(string) on line 244 could panic if the value is not a string. While this would just fail the test, using a checked assertion is safer.

♻️ Safer type assertion pattern
 	template, ok := dv.JobAgentConfig["template"]
 	require.True(t, ok, "template field should be present in JobAgentConfig")
-	assert.Contains(t, template.(string), "{{.Resource.Name}}", "template should contain resource name placeholder")
+	templateStr, ok := template.(string)
+	require.True(t, ok, "template field should be a string")
+	assert.Contains(t, templateStr, "{{.Resource.Name}}", "template should contain resource name placeholder")

This same pattern applies to lines 261 and 294 as well.

apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe_verifications.go (2)

10-16: Consider adding documentation for the exported type.

TFERunVerification is an exported type but lacks a documentation comment explaining its purpose and usage.

📝 Suggested documentation
+// TFERunVerification manages the verification workflow for Terraform Cloud runs,
+// polling run status and reporting success/failure based on the run outcome.
 type TFERunVerification struct {
 	verifications *verification.Manager
 	job           *oapi.Job
 	address       string
 	token         string
 	runID         string
 }

As per coding guidelines: "Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods."


48-61: Consider extracting magic numbers as constants for clarity.

The polling configuration uses hardcoded values (100 iterations, 60-second intervals) that could benefit from being named constants for self-documentation and easier tuning.

♻️ Suggested refactor with named constants
+const (
+	// TFE run verification polls up to 100 times at 60-second intervals (~100 min max)
+	defaultVerificationCount    = 100
+	defaultVerificationInterval = 60
+	defaultSuccessThreshold     = 1
+	defaultFailureThreshold     = 1
+)
+
 func (v *TFERunVerification) buildMetricSpec(provider oapi.MetricProvider) oapi.VerificationMetricSpec {
 	failureCondition := "result.status == 'canceled' || result.status == 'discarded' || result.status == 'errored'"
-	successThreshold := 1
-	failureThreshold := 1
+	successThreshold := defaultSuccessThreshold
+	failureThreshold := defaultFailureThreshold
 	return oapi.VerificationMetricSpec{
-		Count:            100,
-		IntervalSeconds:  60,
+		Count:            defaultVerificationCount,
+		IntervalSeconds:  defaultVerificationInterval,
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (2)

366-374: Kafka producer created per event is inefficient.

getKafkaProducer() creates a new producer for each event, which is immediately closed after publishing. This adds connection overhead per message. Consider caching the producer at the TFE struct level or using a shared producer pool.

This pattern would be problematic at scale. For now, it works but consider initializing the producer once during TFE construction and reusing it across events.


69-71: Consider adding documentation for exported constructor and types.

NewTFE and the exported template types (WorkspaceTemplate, VariableTemplate, VCSRepoTemplate) lack documentation comments. While not blocking, these would help consumers understand the API.

As per coding guidelines: "document exported functions/types/methods."

Comment on lines 95 to 129
go func() {
ctx := context.WithoutCancel(ctx)
client, err := t.getClient(address, token)
if err != nil {
t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error()))
return
}

targetWorkspace, err := t.upsertWorkspace(ctx, client, organization, workspace)
if err != nil {
t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to upsert workspace: %s", err.Error()))
return
}

if len(workspace.Variables) > 0 {
if err := t.syncVariables(ctx, client, targetWorkspace.ID, workspace.Variables); err != nil {
t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to sync variables: %s", err.Error()))
return
}
}

run, err := t.createRun(ctx, client, targetWorkspace.ID, dispatchCtx.Job.Id)
if err != nil {
t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create run: %s", err.Error()))
return
}

verification := newTFERunVerification(t.verifications, dispatchCtx.Job, address, token, run.ID)
if err := verification.StartVerification(ctx); err != nil {
t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to start verification: %s", err.Error()))
return
}

t.sendJobUpdateEvent(address, organization, targetWorkspace.Name, run, dispatchCtx)
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Goroutine ignores errors from sendJobFailureEvent and lacks panic recovery.

The goroutine calls sendJobFailureEvent but ignores its return value. If event publishing fails, the job failure is silently lost with no logging or alerting. Additionally, any panic in this goroutine would crash silently.

🛡️ Proposed fix to handle errors and add recovery
 	go func() {
+		defer func() {
+			if r := recover(); r != nil {
+				// Log panic - consider using structured logging
+				fmt.Printf("panic in TFE dispatch goroutine: %v\n", r)
+			}
+		}()
+
 		ctx := context.WithoutCancel(ctx)
 		client, err := t.getClient(address, token)
 		if err != nil {
-			t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error()))
+			if sendErr := t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error())); sendErr != nil {
+				// Log the failure to send the event
+				fmt.Printf("failed to send job failure event: %v (original error: %v)\n", sendErr, err)
+			}
 			return
 		}

Apply similar error handling to all sendJobFailureEvent and sendJobUpdateEvent calls within the goroutine.

🤖 Prompt for AI Agents
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 95 - 129, The goroutine launching the TFE workflow ignores errors from
sendJobFailureEvent/sendJobUpdateEvent and has no panic recovery; modify the
anonymous goroutine to (1) add a deferred recover handler (defer func() { if r
:= recover() { log the panic and call sendJobFailureEvent with a descriptive
message/error } }) so panics are caught and reported, and (2) check and handle
the error return values of every sendJobFailureEvent and sendJobUpdateEvent
invocation (e.g., after calling sendJobFailureEvent(...) capture err and if
non-nil attempt a fallback log and/or retry/report) so failures to publish
events are not silently dropped; apply these changes around calls in the
goroutine
(getClient/upsertWorkspace/syncVariables/createRun/verification.StartVerification/sendJobUpdateEvent).

Comment on lines 255 to 263
}` // Parse it like the CLI does
var jobAgentConfig map[string]interface{}
err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
require.NoError(t, err) t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) // Check template is present
template, ok := jobAgentConfig["template"]
require.True(t, ok, "template field should be present")
require.NotEmpty(t, template, "template should not be empty") templateStr := template.(string)
t.Logf("Template length: %d", len(templateStr))
assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Formatting issues: missing newlines between statements.

Lines 255-263 have multiple statements concatenated on the same line, which violates gofmt compliance. This appears to be a copy-paste or merge artifact.

🔧 Proposed fix to add proper newlines
-}`	// Parse it like the CLI does
-	var jobAgentConfig map[string]interface{}
-	err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
-	require.NoError(t, err)	t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig)	// Check template is present
-	template, ok := jobAgentConfig["template"]
-	require.True(t, ok, "template field should be present")
-	require.NotEmpty(t, template, "template should not be empty")	templateStr := template.(string)
+}
+
+	// Parse it like the CLI does
+	var jobAgentConfig map[string]interface{}
+	err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
+	require.NoError(t, err)
+
+	t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig)
+
+	// Check template is present
+	template, ok := jobAgentConfig["template"]
+	require.True(t, ok, "template field should be present")
+	require.NotEmpty(t, template, "template should not be empty")
+
+	templateStr := template.(string)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
}` // Parse it like the CLI does
var jobAgentConfig map[string]interface{}
err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
require.NoError(t, err) t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) // Check template is present
template, ok := jobAgentConfig["template"]
require.True(t, ok, "template field should be present")
require.NotEmpty(t, template, "template should not be empty") templateStr := template.(string)
t.Logf("Template length: %d", len(templateStr))
assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder")
}
// Parse it like the CLI does
var jobAgentConfig map[string]interface{}
err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
require.NoError(t, err)
t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig)
// Check template is present
template, ok := jobAgentConfig["template"]
require.True(t, ok, "template field should be present")
require.NotEmpty(t, template, "template should not be empty")
templateStr := template.(string)
t.Logf("Template length: %d", len(templateStr))
assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder")
🤖 Prompt for AI Agents
In `@apps/workspace-engine/test/e2e/engine_argocd_template_test.go` around lines
255 - 263, The code block parsing jobAgentConfigJSON has multiple statements
concatenated on single lines (e.g., json.Unmarshal into jobAgentConfig,
require.NoError, t.Logf, template extraction and checks) causing gofmt failures;
split each statement onto its own line and ensure proper statement separators so
that json.Unmarshal(..., &jobAgentConfig) is followed by require.NoError(t,
err), then t.Logf(...) on its own line, then the template lookup (template, ok
:= jobAgentConfig["template"]) and subsequent
require.True/require.NotEmpty/templateStr conversion and t.Logf or
assert.Contains calls are each on separate lines; adjust the sequence around
jobAgentConfig, template, templateStr, require/ assert and t.Logf to restore
correct formatting and compile cleanly.

@jsbroks
Copy link
Member

jsbroks commented Feb 6, 2026

@adityachoudhari26 can you review this.

Also I dont think we need a post verification

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (1)

87-116: Goroutine still lacks panic recovery and ignores sendJobEvent errors.

The fire-and-forget goroutine discards sendJobEvent return values at lines 91, 97, 103, 110, and 114. A panic anywhere in this goroutine will crash silently. This was flagged in a prior review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 87 - 116, The goroutine that calls t.getClient, t.upsertWorkspace,
t.syncVariables, t.createRun, t.sendJobEvent and t.pollRunStatus must be made
resilient: wrap the goroutine body with a defer that recovers from panics and
sends a failure job event if a panic occurs, and check/handle the returned error
from every t.sendJobEvent call (at the spots after client creation,
upsertWorkspace, syncVariables, createRun and before pollRunStatus) instead of
discarding it; ensure sendJobEvent errors are logged/propagated (e.g., call
t.sendJobEvent and if it returns an error, log/send another failure event and
return), and keep the rest of the control flow (calling t.pollRunStatus)
unchanged.
🧹 Nitpick comments (2)
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (1)

196-202: RunPending in runStatusMap is dead code.

mapRunStatus handles tfe.RunPending in its switch (line 240) before consulting the map, so the entry at line 202 is never reached.

♻️ Remove the dead entry
 var runStatusMap = map[tfe.RunStatus]struct {
 	status  oapi.JobStatus
 	message string
 }{
-	// Pending
-	tfe.RunPending: {oapi.JobStatusPending, "Run pending in queue"},
-
 	// In-progress phases
 	tfe.RunFetching:                 {oapi.JobStatusInProgress, "Fetching configuration..."},

Also applies to: 239-241

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 196 - 202, The runStatusMap contains a dead entry for tfe.RunPending
because mapRunStatus already handles tfe.RunPending before consulting the map;
remove the tfe.RunPending element from the runStatusMap declaration (and any
other duplicated entries in the same map) so the mapping only contains statuses
that mapRunStatus does not already short-circuit; locate the map by the
runStatusMap variable and the mapRunStatus function to make this change.
apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go (1)

93-94: Non-trivial computation is discarded — consider short-circuiting until the TODO is addressed.

The function queries all workspace resources (line 68), compiles a CEL expression (line 61), and evaluates it per resource (lines 77-91) — all to produce resourceMatchedIds which is then thrown away. This work runs on every reconciliation cycle. Consider returning early or skipping the CEL evaluation loop until the association update is implemented, to avoid wasted DB and CPU overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`
around lines 93 - 94, The computed resourceMatchedIds and the CEL evaluation
loop are doing non-trivial DB/CPU work only to be discarded; short-circuit that
work until the TODO (“use resourceMatchedIds to update deployment-resource
associations”) is implemented. Modify the reconciliation code so that before
compiling/evaluating the CEL expression and before calling the workspace
resources query, you check a guard (e.g., a boolean flag or feature check) and
return/skip early if the association update is not yet implemented; remove the
placeholder "_ = resourceMatchedIds" and either wire resourceMatchedIds into the
actual update logic or skip the entire CEL/evaluation loop and DB query until
the update logic (the TODO) is added. Ensure you reference the same symbols
(resourceMatchedIds, the CEL compile/evaluate block) so the change avoids
running that loop unnecessarily.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 329-333: The code sets eventJob.StartedAt and appends
oapi.JobUpdateEventFieldsToUpdateStartedAt every time mapRunStatus returns
InProgress, overwriting the original start time; change the logic in the block
that handles status == oapi.JobStatusInProgress (the snippet that sets
eventJob.StartedAt and appends to fields) to only set StartedAt and append the
field when eventJob.StartedAt is nil (i.e., it has not been set yet) so the
timestamp is recorded only on the first transition to InProgress.
- Around line 300-308: The URL-prefix logic in the block that builds runUrl and
workspaceUrl incorrectly prepends "https://" even when address already contains
a scheme (e.g., "http://"), so update the check in the code that constructs
runUrl and workspaceUrl (using variables address, organization, wsName and
run.ID in tfe.go) to detect an existing scheme first — for example, call
url.Parse(address) and if parsedURL.Scheme is empty then prefix "https://",
otherwise use address as-is (or, alternatively, use strings.Contains(address,
"://") to decide); ensure both runUrl and workspaceUrl apply the same
scheme-aware logic to avoid producing malformed URLs like "https://http://...".
- Around line 153-184: The pollRunStatus loop in pollRunStatus currently logs
all ReadWithOptions errors and continues forever, leaking a goroutine on
persistent API failures; modify the loop to detect terminal error types
(specifically check for tfe.ErrResourceNotFound returned by Runs.ReadWithOptions
and treat it as a terminal condition by mapping to an ExternalRunNotFound job
status via mapRunStatus/sendJobEvent and then return) and also add a
consecutive-error counter (e.g., const maxConsecutiveErrors = 10) that
increments on non-terminal errors and resets on a successful run read; when the
counter exceeds the threshold, send a final failed job event with an explanatory
message via sendJobEvent and exit the goroutine, while preserving existing
behavior for successful reads and normal terminal statuses determined by
isTerminalJobStatus.
- Around line 441-450: The syncVariables function currently calls
client.Variables.List(ctx, workspaceID, nil) which only returns the first page;
update syncVariables to paginate using tfe.VariableListOptions (incrementing
PageNumber) and repeatedly call client.Variables.List until CurrentPage >=
TotalPages, appending each page's Items into existingVars (and building
existingByKey) so all workspace variables are collected before
comparing/creating; keep using the Variables.List endpoint (not ListAll) to
avoid inherited varsets.

---

Duplicate comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 87-116: The goroutine that calls t.getClient, t.upsertWorkspace,
t.syncVariables, t.createRun, t.sendJobEvent and t.pollRunStatus must be made
resilient: wrap the goroutine body with a defer that recovers from panics and
sends a failure job event if a panic occurs, and check/handle the returned error
from every t.sendJobEvent call (at the spots after client creation,
upsertWorkspace, syncVariables, createRun and before pollRunStatus) instead of
discarding it; ensure sendJobEvent errors are logged/propagated (e.g., call
t.sendJobEvent and if it returns an error, log/send another failure event and
return), and keep the rest of the control flow (calling t.pollRunStatus)
unchanged.

---

Nitpick comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 196-202: The runStatusMap contains a dead entry for tfe.RunPending
because mapRunStatus already handles tfe.RunPending before consulting the map;
remove the tfe.RunPending element from the runStatusMap declaration (and any
other duplicated entries in the same map) so the mapping only contains statuses
that mapRunStatus does not already short-circuit; locate the map by the
runStatusMap variable and the mapRunStatus function to make this change.

In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`:
- Around line 93-94: The computed resourceMatchedIds and the CEL evaluation loop
are doing non-trivial DB/CPU work only to be discarded; short-circuit that work
until the TODO (“use resourceMatchedIds to update deployment-resource
associations”) is implemented. Modify the reconciliation code so that before
compiling/evaluating the CEL expression and before calling the workspace
resources query, you check a guard (e.g., a boolean flag or feature check) and
return/skip early if the association update is not yet implemented; remove the
placeholder "_ = resourceMatchedIds" and either wire resourceMatchedIds into the
actual update logic or skip the entire CEL/evaluation loop and DB query until
the update logic (the TODO) is added. Ensure you reference the same symbols
(resourceMatchedIds, the CEL compile/evaluate block) so the change avoids
running that loop unnecessarily.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 80900fd and 80309dc.

📒 Files selected for processing (3)
  • apps/workspace-engine/pkg/reconcile/memory/memory.go
  • apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go
  • apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go

Comment on lines +153 to +184
func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()

var lastStatus tfe.RunStatus
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{
Include: []tfe.RunIncludeOpt{tfe.RunPlan},
})
if err != nil {
log.Error("Failed to poll TFC run status", "runId", runID, "error", err)
continue
}
if run.Status == lastStatus {
continue
}
lastStatus = run.Status

jobStatus, message := mapRunStatus(run)
if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil {
log.Error("Failed to send job event", "runId", runID, "error", err)
}

if isTerminalJobStatus(jobStatus) {
return
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unbounded polling on persistent API errors — leaked goroutine for unreachable/deleted runs.

If the TFC run is deleted or the token becomes invalid, Runs.ReadWithOptions will return an error on every tick forever. The goroutine never exits because errors are only logged and there's no terminal-error detection or retry cap.

Consider:

  1. Checking for tfe.ErrResourceNotFound and treating it as terminal (map to ExternalRunNotFound).
  2. Adding a consecutive-error counter and failing the job after a threshold (e.g., 10 consecutive failures).
🐛 Proposed sketch for error bounding
 func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) {
 	ticker := time.NewTicker(15 * time.Second)
 	defer ticker.Stop()
 
+	const maxConsecutiveErrors = 10
+	consecutiveErrors := 0
 	var lastStatus tfe.RunStatus
 	for {
 		select {
 		case <-ctx.Done():
 			return
 		case <-ticker.C:
 			run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{
 				Include: []tfe.RunIncludeOpt{tfe.RunPlan},
 			})
 			if err != nil {
+				if errors.Is(err, tfe.ErrResourceNotFound) {
+					t.sendJobEvent(job, oapi.JobStatusExternalRunNotFound, "TFC run not found", nil, address, organization, wsName)
+					return
+				}
+				consecutiveErrors++
+				if consecutiveErrors >= maxConsecutiveErrors {
+					t.sendJobEvent(job, oapi.JobStatusFailure, fmt.Sprintf("Giving up after %d consecutive poll errors: %s", consecutiveErrors, err), nil, address, organization, wsName)
+					return
+				}
 				log.Error("Failed to poll TFC run status", "runId", runID, "error", err)
 				continue
 			}
+			consecutiveErrors = 0
 			if run.Status == lastStatus {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()
var lastStatus tfe.RunStatus
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{
Include: []tfe.RunIncludeOpt{tfe.RunPlan},
})
if err != nil {
log.Error("Failed to poll TFC run status", "runId", runID, "error", err)
continue
}
if run.Status == lastStatus {
continue
}
lastStatus = run.Status
jobStatus, message := mapRunStatus(run)
if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil {
log.Error("Failed to send job event", "runId", runID, "error", err)
}
if isTerminalJobStatus(jobStatus) {
return
}
}
}
func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()
const maxConsecutiveErrors = 10
consecutiveErrors := 0
var lastStatus tfe.RunStatus
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{
Include: []tfe.RunIncludeOpt{tfe.RunPlan},
})
if err != nil {
if errors.Is(err, tfe.ErrResourceNotFound) {
t.sendJobEvent(job, oapi.JobStatusExternalRunNotFound, "TFC run not found", nil, address, organization, wsName)
return
}
consecutiveErrors++
if consecutiveErrors >= maxConsecutiveErrors {
t.sendJobEvent(job, oapi.JobStatusFailure, fmt.Sprintf("Giving up after %d consecutive poll errors: %s", consecutiveErrors, err), nil, address, organization, wsName)
return
}
log.Error("Failed to poll TFC run status", "runId", runID, "error", err)
continue
}
consecutiveErrors = 0
if run.Status == lastStatus {
continue
}
lastStatus = run.Status
jobStatus, message := mapRunStatus(run)
if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil {
log.Error("Failed to send job event", "runId", runID, "error", err)
}
if isTerminalJobStatus(jobStatus) {
return
}
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 153 - 184, The pollRunStatus loop in pollRunStatus currently logs all
ReadWithOptions errors and continues forever, leaking a goroutine on persistent
API failures; modify the loop to detect terminal error types (specifically check
for tfe.ErrResourceNotFound returned by Runs.ReadWithOptions and treat it as a
terminal condition by mapping to an ExternalRunNotFound job status via
mapRunStatus/sendJobEvent and then return) and also add a consecutive-error
counter (e.g., const maxConsecutiveErrors = 10) that increments on non-terminal
errors and resets on a successful run read; when the counter exceeds the
threshold, send a final failed job event with an explanatory message via
sendJobEvent and exit the goroutine, while preserving existing behavior for
successful reads and normal terminal statuses determined by isTerminalJobStatus.

Comment on lines +300 to +308
if run != nil && address != "" && organization != "" && wsName != "" {
runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID)
if !strings.HasPrefix(runUrl, "https://") {
runUrl = "https://" + runUrl
}
workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName)
if !strings.HasPrefix(workspaceUrl, "https://") {
workspaceUrl = "https://" + workspaceUrl
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

URL prefix logic breaks for http:// addresses.

If address is "http://my-tfe.internal", the fmt.Sprintf produces "http://my-tfe.internal/app/..." which doesn't start with "https://", so it gets prefixed to "https://http://my-tfe.internal/app/..." — a malformed URL. Private TFE instances may use HTTP.

Use strings.Contains(address, "://") or url.Parse to check for an existing scheme instead.

🐛 Proposed fix
-	runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID)
-	if !strings.HasPrefix(runUrl, "https://") {
-		runUrl = "https://" + runUrl
-	}
-	workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName)
-	if !strings.HasPrefix(workspaceUrl, "https://") {
-		workspaceUrl = "https://" + workspaceUrl
-	}
+	normalizedAddr := address
+	if !strings.Contains(normalizedAddr, "://") {
+		normalizedAddr = "https://" + normalizedAddr
+	}
+	runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", normalizedAddr, organization, wsName, run.ID)
+	workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", normalizedAddr, organization, wsName)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if run != nil && address != "" && organization != "" && wsName != "" {
runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID)
if !strings.HasPrefix(runUrl, "https://") {
runUrl = "https://" + runUrl
}
workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName)
if !strings.HasPrefix(workspaceUrl, "https://") {
workspaceUrl = "https://" + workspaceUrl
}
if run != nil && address != "" && organization != "" && wsName != "" {
normalizedAddr := address
if !strings.Contains(normalizedAddr, "://") {
normalizedAddr = "https://" + normalizedAddr
}
runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", normalizedAddr, organization, wsName, run.ID)
workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", normalizedAddr, organization, wsName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 300 - 308, The URL-prefix logic in the block that builds runUrl and
workspaceUrl incorrectly prepends "https://" even when address already contains
a scheme (e.g., "http://"), so update the check in the code that constructs
runUrl and workspaceUrl (using variables address, organization, wsName and
run.ID in tfe.go) to detect an existing scheme first — for example, call
url.Parse(address) and if parsedURL.Scheme is empty then prefix "https://",
otherwise use address as-is (or, alternatively, use strings.Contains(address,
"://") to decide); ensure both runUrl and workspaceUrl apply the same
scheme-aware logic to avoid producing malformed URLs like "https://http://...".

Comment on lines +329 to +333
// Set startedAt for first non-pending transition
if status == oapi.JobStatusInProgress {
eventJob.StartedAt = &now
fields = append(fields, oapi.JobUpdateEventFieldsToUpdateStartedAt)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

startedAt is overwritten on every InProgress transition, not just the first.

Each time mapRunStatus returns InProgress (which happens for ~20 different TFC states), startedAt is reset to now and included in fieldsToUpdate. The job's startedAt should reflect when it first became in-progress, not the latest state change.

Guard this behind a check on the job's existing StartedAt:

🔧 Proposed fix
-	if status == oapi.JobStatusInProgress {
+	if status == oapi.JobStatusInProgress && job.StartedAt == nil {
 		eventJob.StartedAt = &now
 		fields = append(fields, oapi.JobUpdateEventFieldsToUpdateStartedAt)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 329 - 333, The code sets eventJob.StartedAt and appends
oapi.JobUpdateEventFieldsToUpdateStartedAt every time mapRunStatus returns
InProgress, overwriting the original start time; change the logic in the block
that handles status == oapi.JobStatusInProgress (the snippet that sets
eventJob.StartedAt and appends to fields) to only set StartedAt and append the
field when eventJob.StartedAt is nil (i.e., it has not been set yet) so the
timestamp is recorded only on the first transition to InProgress.

Comment on lines +441 to +450
func (t *TFE) syncVariables(ctx context.Context, client *tfe.Client, workspaceID string, desiredVars []VariableTemplate) error {
existingVars, err := client.Variables.List(ctx, workspaceID, nil)
if err != nil {
return fmt.Errorf("failed to list variables: %w", err)
}

existingByKey := make(map[string]*tfe.Variable)
for _, v := range existingVars.Items {
existingByKey[v.Key] = v
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check go-tfe version in go.mod
cat apps/workspace-engine/go.mod | grep "go-tfe"

Repository: ctrlplanedev/ctrlplane

Length of output: 102


🏁 Script executed:

# Read the actual code around lines 441-450 with more context
cat -n apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go | sed -n '430,470p'

Repository: ctrlplanedev/ctrlplane

Length of output: 1550


🏁 Script executed:

# Check if there are other usages of Variables.List or pagination patterns in the codebase
rg "Variables\.List|ListAll" apps/workspace-engine/ -A 3 -B 3

Repository: ctrlplanedev/ctrlplane

Length of output: 4349


🌐 Web query:

go-tfe github Variables.List pagination ListAll method

💡 Result:

In hashicorp/go-tfe, Variables.ListAll is not a “fetch all pages” helper. It’s a different endpoint that returns workspace vars + vars inherited from variable sets:

  • List(...)GET workspaces/<id>/vars
  • ListAll(...)GET workspaces/<id>/all-vars (includes varset-inherited vars) [1]

Pagination

Both List and ListAll accept *VariableListOptions, which embeds ListOptions with:

  • page[number] (PageNumber)
  • page[size] (PageSize) [2]

So you paginate by looping and incrementing PageNumber until Pagination.CurrentPage == Pagination.TotalPages (or NextPage == 0, depending on what the API returns), accumulating VariableList.Items [2].

opts := &tfe.VariableListOptions{
  ListOptions: tfe.ListOptions{PageNumber: 1, PageSize: 100},
}

var all []*tfe.Variable
for {
  vl, err := client.Variables.ListAll(ctx, workspaceID, opts)
  if err != nil { return err }

  all = append(all, vl.Items...)

  if vl.Pagination == nil || vl.CurrentPage >= vl.TotalPages {
    break
  }
  opts.PageNumber++
}

Key point: “ListAll” = include varsets, not auto-pagination. [1]

Sources: [1] [2]


Implement pagination for Variables.List() to fetch all variables.

The call to client.Variables.List(ctx, workspaceID, nil) only retrieves the first page of results (default ~20 items). Workspaces with more variables will have missing entries in existingByKey, causing spurious Create calls for already-existing variables that fail with API conflict errors.

Paginate through all pages by looping with VariableListOptions and incrementing PageNumber until CurrentPage >= TotalPages. Note: ListAll() is not a pagination helper—it returns a different endpoint that includes varset-inherited variables. Use the appropriate method for your use case and paginate accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 441 - 450, The syncVariables function currently calls
client.Variables.List(ctx, workspaceID, nil) which only returns the first page;
update syncVariables to paginate using tfe.VariableListOptions (incrementing
PageNumber) and repeatedly call client.Variables.List until CurrentPage >=
TotalPages, appending each page's Items into existingVars (and building
existingByKey) so all workspace variables are collected before
comparing/creating; keep using the Variables.List endpoint (not ListAll) to
avoid inherited varsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants