Conversation
📝 WalkthroughWalkthroughAdds Terraform Cloud job agent implementation and wiring: new templating types and full dispatch flow (workspace upsert, variable sync, run creation/monitoring, event emission), registry change to pass a verification.Manager, an end-to-end test for config merge precedence, and small build and lint tweaks. Changes
Sequence Diagram(s)sequenceDiagram
participant JobSystem as Job System
participant TFE as TFE Dispatcher
participant Store as Store
participant TFC as Terraform Cloud API
participant Verifier as Verification Manager
participant Kafka as Kafka Producer
JobSystem->>TFE: Dispatch(job)
TFE->>Store: getTemplatableJob(job)
Store-->>TFE: templatable job + templates
TFE->>TFE: parseJobAgentConfig & render workspace template
TFE->>TFC: upsertWorkspace(workspace)
TFC-->>TFE: workspace (id/url)
TFE->>TFC: syncVariables(workspace, variables)
TFC-->>TFE: variables synced
TFE->>TFC: createRun(workspace)
TFC-->>TFE: run (externalId)
TFE->>Verifier: verify(run)
Verifier-->>TFE: verification result
TFE->>Kafka: sendJobEvent(status, metadata)
Kafka-->>TFE: ack
TFE->>JobSystem: async completion / status updates
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 95-129: The goroutine launching the TFE workflow ignores errors
from sendJobFailureEvent/sendJobUpdateEvent and has no panic recovery; modify
the anonymous goroutine to (1) add a deferred recover handler (defer func() { if
r := recover() { log the panic and call sendJobFailureEvent with a descriptive
message/error } }) so panics are caught and reported, and (2) check and handle
the error return values of every sendJobFailureEvent and sendJobUpdateEvent
invocation (e.g., after calling sendJobFailureEvent(...) capture err and if
non-nil attempt a fallback log and/or retry/report) so failures to publish
events are not silently dropped; apply these changes around calls in the
goroutine
(getClient/upsertWorkspace/syncVariables/createRun/verification.StartVerification/sendJobUpdateEvent).
- Around line 453-463: The code converts VariableTemplate.Category directly to
tfe.CategoryType in VariableTemplate.toCreateOptions which allows invalid
categories to reach the API; validate the Category string against the allowed
values ("terraform" and "env") and fail fast: either add validation in
parseJobAgentConfig (preferred) to return an error when an invalid category is
encountered, or change VariableTemplate.toCreateOptions to return
(tfe.VariableCreateOptions, error) and verify v.Category is one of the allowed
values before converting (use constants for "terraform"/"env" and return a clear
error mentioning the invalid value and the VariableTemplate.Key).
- Around line 199-201: The code currently checks the error by comparing
err.Error() to a string; replace that fragile comparison with a sentinel error
check using errors.Is(err, tfe.ErrResourceNotFound) (or err ==
tfe.ErrResourceNotFound) so wrapped errors are handled correctly: change the
condition to if err != nil && !errors.Is(err, tfe.ErrResourceNotFound) { return
nil, fmt.Errorf("failed to read workspace: %w", err) } and add the "errors"
import if missing; reference symbols: err, tfe.ErrResourceNotFound in tfe.go.
In `@apps/workspace-engine/test/e2e/engine_argocd_template_test.go`:
- Around line 255-263: The code block parsing jobAgentConfigJSON has multiple
statements concatenated on single lines (e.g., json.Unmarshal into
jobAgentConfig, require.NoError, t.Logf, template extraction and checks) causing
gofmt failures; split each statement onto its own line and ensure proper
statement separators so that json.Unmarshal(..., &jobAgentConfig) is followed by
require.NoError(t, err), then t.Logf(...) on its own line, then the template
lookup (template, ok := jobAgentConfig["template"]) and subsequent
require.True/require.NotEmpty/templateStr conversion and t.Logf or
assert.Contains calls are each on separate lines; adjust the sequence around
jobAgentConfig, template, templateStr, require/ assert and t.Logf to restore
correct formatting and compile cleanly.
🧹 Nitpick comments (6)
apps/workspace-engine/test/e2e/engine_argocd_template_test.go (2)
147-151: Consider simplifying map iteration for single job retrieval.The loop iterates over the map just to get the first (and only expected) job. A more direct approach would improve readability.
♻️ Suggested simplification
- var job *oapi.Job - for _, j := range pendingJobs { - job = j - break - } + // Get the first (and only) pending job + var job *oapi.Job + for _, j := range pendingJobs { + job = j + break + }Alternatively, if there's a helper to get jobs as a slice, that would be cleaner. The current approach works but the
breakafter assignment is a common Go pattern for "get any one element from map."
242-244: Unchecked type assertion could panic.The type assertion
template.(string)on line 244 could panic if the value is not a string. While this would just fail the test, using a checked assertion is safer.♻️ Safer type assertion pattern
template, ok := dv.JobAgentConfig["template"] require.True(t, ok, "template field should be present in JobAgentConfig") - assert.Contains(t, template.(string), "{{.Resource.Name}}", "template should contain resource name placeholder") + templateStr, ok := template.(string) + require.True(t, ok, "template field should be a string") + assert.Contains(t, templateStr, "{{.Resource.Name}}", "template should contain resource name placeholder")This same pattern applies to lines 261 and 294 as well.
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe_verifications.go (2)
10-16: Consider adding documentation for the exported type.
TFERunVerificationis an exported type but lacks a documentation comment explaining its purpose and usage.📝 Suggested documentation
+// TFERunVerification manages the verification workflow for Terraform Cloud runs, +// polling run status and reporting success/failure based on the run outcome. type TFERunVerification struct { verifications *verification.Manager job *oapi.Job address string token string runID string }As per coding guidelines: "Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods."
48-61: Consider extracting magic numbers as constants for clarity.The polling configuration uses hardcoded values (100 iterations, 60-second intervals) that could benefit from being named constants for self-documentation and easier tuning.
♻️ Suggested refactor with named constants
+const ( + // TFE run verification polls up to 100 times at 60-second intervals (~100 min max) + defaultVerificationCount = 100 + defaultVerificationInterval = 60 + defaultSuccessThreshold = 1 + defaultFailureThreshold = 1 +) + func (v *TFERunVerification) buildMetricSpec(provider oapi.MetricProvider) oapi.VerificationMetricSpec { failureCondition := "result.status == 'canceled' || result.status == 'discarded' || result.status == 'errored'" - successThreshold := 1 - failureThreshold := 1 + successThreshold := defaultSuccessThreshold + failureThreshold := defaultFailureThreshold return oapi.VerificationMetricSpec{ - Count: 100, - IntervalSeconds: 60, + Count: defaultVerificationCount, + IntervalSeconds: defaultVerificationInterval,apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (2)
366-374: Kafka producer created per event is inefficient.
getKafkaProducer()creates a new producer for each event, which is immediately closed after publishing. This adds connection overhead per message. Consider caching the producer at theTFEstruct level or using a shared producer pool.This pattern would be problematic at scale. For now, it works but consider initializing the producer once during
TFEconstruction and reusing it across events.
69-71: Consider adding documentation for exported constructor and types.
NewTFEand the exported template types (WorkspaceTemplate,VariableTemplate,VCSRepoTemplate) lack documentation comments. While not blocking, these would help consumers understand the API.As per coding guidelines: "document exported functions/types/methods."
| go func() { | ||
| ctx := context.WithoutCancel(ctx) | ||
| client, err := t.getClient(address, token) | ||
| if err != nil { | ||
| t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error())) | ||
| return | ||
| } | ||
|
|
||
| targetWorkspace, err := t.upsertWorkspace(ctx, client, organization, workspace) | ||
| if err != nil { | ||
| t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to upsert workspace: %s", err.Error())) | ||
| return | ||
| } | ||
|
|
||
| if len(workspace.Variables) > 0 { | ||
| if err := t.syncVariables(ctx, client, targetWorkspace.ID, workspace.Variables); err != nil { | ||
| t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to sync variables: %s", err.Error())) | ||
| return | ||
| } | ||
| } | ||
|
|
||
| run, err := t.createRun(ctx, client, targetWorkspace.ID, dispatchCtx.Job.Id) | ||
| if err != nil { | ||
| t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create run: %s", err.Error())) | ||
| return | ||
| } | ||
|
|
||
| verification := newTFERunVerification(t.verifications, dispatchCtx.Job, address, token, run.ID) | ||
| if err := verification.StartVerification(ctx); err != nil { | ||
| t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to start verification: %s", err.Error())) | ||
| return | ||
| } | ||
|
|
||
| t.sendJobUpdateEvent(address, organization, targetWorkspace.Name, run, dispatchCtx) | ||
| }() |
There was a problem hiding this comment.
Goroutine ignores errors from sendJobFailureEvent and lacks panic recovery.
The goroutine calls sendJobFailureEvent but ignores its return value. If event publishing fails, the job failure is silently lost with no logging or alerting. Additionally, any panic in this goroutine would crash silently.
🛡️ Proposed fix to handle errors and add recovery
go func() {
+ defer func() {
+ if r := recover(); r != nil {
+ // Log panic - consider using structured logging
+ fmt.Printf("panic in TFE dispatch goroutine: %v\n", r)
+ }
+ }()
+
ctx := context.WithoutCancel(ctx)
client, err := t.getClient(address, token)
if err != nil {
- t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error()))
+ if sendErr := t.sendJobFailureEvent(dispatchCtx, fmt.Sprintf("failed to create Terraform Cloud client: %s", err.Error())); sendErr != nil {
+ // Log the failure to send the event
+ fmt.Printf("failed to send job failure event: %v (original error: %v)\n", sendErr, err)
+ }
return
}Apply similar error handling to all sendJobFailureEvent and sendJobUpdateEvent calls within the goroutine.
🤖 Prompt for AI Agents
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 95 - 129, The goroutine launching the TFE workflow ignores errors from
sendJobFailureEvent/sendJobUpdateEvent and has no panic recovery; modify the
anonymous goroutine to (1) add a deferred recover handler (defer func() { if r
:= recover() { log the panic and call sendJobFailureEvent with a descriptive
message/error } }) so panics are caught and reported, and (2) check and handle
the error return values of every sendJobFailureEvent and sendJobUpdateEvent
invocation (e.g., after calling sendJobFailureEvent(...) capture err and if
non-nil attempt a fallback log and/or retry/report) so failures to publish
events are not silently dropped; apply these changes around calls in the
goroutine
(getClient/upsertWorkspace/syncVariables/createRun/verification.StartVerification/sendJobUpdateEvent).
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go
Outdated
Show resolved
Hide resolved
| }` // Parse it like the CLI does | ||
| var jobAgentConfig map[string]interface{} | ||
| err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig) | ||
| require.NoError(t, err) t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) // Check template is present | ||
| template, ok := jobAgentConfig["template"] | ||
| require.True(t, ok, "template field should be present") | ||
| require.NotEmpty(t, template, "template should not be empty") templateStr := template.(string) | ||
| t.Logf("Template length: %d", len(templateStr)) | ||
| assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder") |
There was a problem hiding this comment.
Formatting issues: missing newlines between statements.
Lines 255-263 have multiple statements concatenated on the same line, which violates gofmt compliance. This appears to be a copy-paste or merge artifact.
🔧 Proposed fix to add proper newlines
-}` // Parse it like the CLI does
- var jobAgentConfig map[string]interface{}
- err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
- require.NoError(t, err) t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) // Check template is present
- template, ok := jobAgentConfig["template"]
- require.True(t, ok, "template field should be present")
- require.NotEmpty(t, template, "template should not be empty") templateStr := template.(string)
+}
+
+ // Parse it like the CLI does
+ var jobAgentConfig map[string]interface{}
+ err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig)
+ require.NoError(t, err)
+
+ t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig)
+
+ // Check template is present
+ template, ok := jobAgentConfig["template"]
+ require.True(t, ok, "template field should be present")
+ require.NotEmpty(t, template, "template should not be empty")
+
+ templateStr := template.(string)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| }` // Parse it like the CLI does | |
| var jobAgentConfig map[string]interface{} | |
| err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig) | |
| require.NoError(t, err) t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) // Check template is present | |
| template, ok := jobAgentConfig["template"] | |
| require.True(t, ok, "template field should be present") | |
| require.NotEmpty(t, template, "template should not be empty") templateStr := template.(string) | |
| t.Logf("Template length: %d", len(templateStr)) | |
| assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder") | |
| } | |
| // Parse it like the CLI does | |
| var jobAgentConfig map[string]interface{} | |
| err := json.Unmarshal([]byte(jobAgentConfigJSON), &jobAgentConfig) | |
| require.NoError(t, err) | |
| t.Logf("Parsed jobAgentConfig from CLI: %+v", jobAgentConfig) | |
| // Check template is present | |
| template, ok := jobAgentConfig["template"] | |
| require.True(t, ok, "template field should be present") | |
| require.NotEmpty(t, template, "template should not be empty") | |
| templateStr := template.(string) | |
| t.Logf("Template length: %d", len(templateStr)) | |
| assert.Contains(t, templateStr, "{{.Resource.Name}}-console", "template should contain the expected placeholder") |
🤖 Prompt for AI Agents
In `@apps/workspace-engine/test/e2e/engine_argocd_template_test.go` around lines
255 - 263, The code block parsing jobAgentConfigJSON has multiple statements
concatenated on single lines (e.g., json.Unmarshal into jobAgentConfig,
require.NoError, t.Logf, template extraction and checks) causing gofmt failures;
split each statement onto its own line and ensure proper statement separators so
that json.Unmarshal(..., &jobAgentConfig) is followed by require.NoError(t,
err), then t.Logf(...) on its own line, then the template lookup (template, ok
:= jobAgentConfig["template"]) and subsequent
require.True/require.NotEmpty/templateStr conversion and t.Logf or
assert.Contains calls are each on separate lines; adjust the sequence around
jobAgentConfig, template, templateStr, require/ assert and t.Logf to restore
correct formatting and compile cleanly.
|
@adityachoudhari26 can you review this. Also I dont think we need a post verification |
|
|
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (1)
87-116: Goroutine still lacks panic recovery and ignoressendJobEventerrors.The fire-and-forget goroutine discards
sendJobEventreturn values at lines 91, 97, 103, 110, and 114. A panic anywhere in this goroutine will crash silently. This was flagged in a prior review.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around lines 87 - 116, The goroutine that calls t.getClient, t.upsertWorkspace, t.syncVariables, t.createRun, t.sendJobEvent and t.pollRunStatus must be made resilient: wrap the goroutine body with a defer that recovers from panics and sends a failure job event if a panic occurs, and check/handle the returned error from every t.sendJobEvent call (at the spots after client creation, upsertWorkspace, syncVariables, createRun and before pollRunStatus) instead of discarding it; ensure sendJobEvent errors are logged/propagated (e.g., call t.sendJobEvent and if it returns an error, log/send another failure event and return), and keep the rest of the control flow (calling t.pollRunStatus) unchanged.
🧹 Nitpick comments (2)
apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go (1)
196-202:RunPendinginrunStatusMapis dead code.
mapRunStatushandlestfe.RunPendingin itsswitch(line 240) before consulting the map, so the entry at line 202 is never reached.♻️ Remove the dead entry
var runStatusMap = map[tfe.RunStatus]struct { status oapi.JobStatus message string }{ - // Pending - tfe.RunPending: {oapi.JobStatusPending, "Run pending in queue"}, - // In-progress phases tfe.RunFetching: {oapi.JobStatusInProgress, "Fetching configuration..."},Also applies to: 239-241
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around lines 196 - 202, The runStatusMap contains a dead entry for tfe.RunPending because mapRunStatus already handles tfe.RunPending before consulting the map; remove the tfe.RunPending element from the runStatusMap declaration (and any other duplicated entries in the same map) so the mapping only contains statuses that mapRunStatus does not already short-circuit; locate the map by the runStatusMap variable and the mapRunStatus function to make this change.apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go (1)
93-94: Non-trivial computation is discarded — consider short-circuiting until the TODO is addressed.The function queries all workspace resources (line 68), compiles a CEL expression (line 61), and evaluates it per resource (lines 77-91) — all to produce
resourceMatchedIdswhich is then thrown away. This work runs on every reconciliation cycle. Consider returning early or skipping the CEL evaluation loop until the association update is implemented, to avoid wasted DB and CPU overhead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go` around lines 93 - 94, The computed resourceMatchedIds and the CEL evaluation loop are doing non-trivial DB/CPU work only to be discarded; short-circuit that work until the TODO (“use resourceMatchedIds to update deployment-resource associations”) is implemented. Modify the reconciliation code so that before compiling/evaluating the CEL expression and before calling the workspace resources query, you check a guard (e.g., a boolean flag or feature check) and return/skip early if the association update is not yet implemented; remove the placeholder "_ = resourceMatchedIds" and either wire resourceMatchedIds into the actual update logic or skip the entire CEL/evaluation loop and DB query until the update logic (the TODO) is added. Ensure you reference the same symbols (resourceMatchedIds, the CEL compile/evaluate block) so the change avoids running that loop unnecessarily.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 329-333: The code sets eventJob.StartedAt and appends
oapi.JobUpdateEventFieldsToUpdateStartedAt every time mapRunStatus returns
InProgress, overwriting the original start time; change the logic in the block
that handles status == oapi.JobStatusInProgress (the snippet that sets
eventJob.StartedAt and appends to fields) to only set StartedAt and append the
field when eventJob.StartedAt is nil (i.e., it has not been set yet) so the
timestamp is recorded only on the first transition to InProgress.
- Around line 300-308: The URL-prefix logic in the block that builds runUrl and
workspaceUrl incorrectly prepends "https://" even when address already contains
a scheme (e.g., "http://"), so update the check in the code that constructs
runUrl and workspaceUrl (using variables address, organization, wsName and
run.ID in tfe.go) to detect an existing scheme first — for example, call
url.Parse(address) and if parsedURL.Scheme is empty then prefix "https://",
otherwise use address as-is (or, alternatively, use strings.Contains(address,
"://") to decide); ensure both runUrl and workspaceUrl apply the same
scheme-aware logic to avoid producing malformed URLs like "https://http://...".
- Around line 153-184: The pollRunStatus loop in pollRunStatus currently logs
all ReadWithOptions errors and continues forever, leaking a goroutine on
persistent API failures; modify the loop to detect terminal error types
(specifically check for tfe.ErrResourceNotFound returned by Runs.ReadWithOptions
and treat it as a terminal condition by mapping to an ExternalRunNotFound job
status via mapRunStatus/sendJobEvent and then return) and also add a
consecutive-error counter (e.g., const maxConsecutiveErrors = 10) that
increments on non-terminal errors and resets on a successful run read; when the
counter exceeds the threshold, send a final failed job event with an explanatory
message via sendJobEvent and exit the goroutine, while preserving existing
behavior for successful reads and normal terminal statuses determined by
isTerminalJobStatus.
- Around line 441-450: The syncVariables function currently calls
client.Variables.List(ctx, workspaceID, nil) which only returns the first page;
update syncVariables to paginate using tfe.VariableListOptions (incrementing
PageNumber) and repeatedly call client.Variables.List until CurrentPage >=
TotalPages, appending each page's Items into existingVars (and building
existingByKey) so all workspace variables are collected before
comparing/creating; keep using the Variables.List endpoint (not ListAll) to
avoid inherited varsets.
---
Duplicate comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 87-116: The goroutine that calls t.getClient, t.upsertWorkspace,
t.syncVariables, t.createRun, t.sendJobEvent and t.pollRunStatus must be made
resilient: wrap the goroutine body with a defer that recovers from panics and
sends a failure job event if a panic occurs, and check/handle the returned error
from every t.sendJobEvent call (at the spots after client creation,
upsertWorkspace, syncVariables, createRun and before pollRunStatus) instead of
discarding it; ensure sendJobEvent errors are logged/propagated (e.g., call
t.sendJobEvent and if it returns an error, log/send another failure event and
return), and keep the rest of the control flow (calling t.pollRunStatus)
unchanged.
---
Nitpick comments:
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go`:
- Around line 196-202: The runStatusMap contains a dead entry for tfe.RunPending
because mapRunStatus already handles tfe.RunPending before consulting the map;
remove the tfe.RunPending element from the runStatusMap declaration (and any
other duplicated entries in the same map) so the mapping only contains statuses
that mapRunStatus does not already short-circuit; locate the map by the
runStatusMap variable and the mapRunStatus function to make this change.
In
`@apps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go`:
- Around line 93-94: The computed resourceMatchedIds and the CEL evaluation loop
are doing non-trivial DB/CPU work only to be discarded; short-circuit that work
until the TODO (“use resourceMatchedIds to update deployment-resource
associations”) is implemented. Modify the reconciliation code so that before
compiling/evaluating the CEL expression and before calling the workspace
resources query, you check a guard (e.g., a boolean flag or feature check) and
return/skip early if the association update is not yet implemented; remove the
placeholder "_ = resourceMatchedIds" and either wire resourceMatchedIds into the
actual update logic or skip the entire CEL/evaluation loop and DB query until
the update logic (the TODO) is added. Ensure you reference the same symbols
(resourceMatchedIds, the CEL compile/evaluate block) so the change avoids
running that loop unnecessarily.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
apps/workspace-engine/pkg/reconcile/memory/memory.goapps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.goapps/workspace-engine/svc/controllers/deploymentresourceselectoreval/controller.go
| func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) { | ||
| ticker := time.NewTicker(15 * time.Second) | ||
| defer ticker.Stop() | ||
|
|
||
| var lastStatus tfe.RunStatus | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case <-ticker.C: | ||
| run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{ | ||
| Include: []tfe.RunIncludeOpt{tfe.RunPlan}, | ||
| }) | ||
| if err != nil { | ||
| log.Error("Failed to poll TFC run status", "runId", runID, "error", err) | ||
| continue | ||
| } | ||
| if run.Status == lastStatus { | ||
| continue | ||
| } | ||
| lastStatus = run.Status | ||
|
|
||
| jobStatus, message := mapRunStatus(run) | ||
| if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil { | ||
| log.Error("Failed to send job event", "runId", runID, "error", err) | ||
| } | ||
|
|
||
| if isTerminalJobStatus(jobStatus) { | ||
| return | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Unbounded polling on persistent API errors — leaked goroutine for unreachable/deleted runs.
If the TFC run is deleted or the token becomes invalid, Runs.ReadWithOptions will return an error on every tick forever. The goroutine never exits because errors are only logged and there's no terminal-error detection or retry cap.
Consider:
- Checking for
tfe.ErrResourceNotFoundand treating it as terminal (map toExternalRunNotFound). - Adding a consecutive-error counter and failing the job after a threshold (e.g., 10 consecutive failures).
🐛 Proposed sketch for error bounding
func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) {
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()
+ const maxConsecutiveErrors = 10
+ consecutiveErrors := 0
var lastStatus tfe.RunStatus
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{
Include: []tfe.RunIncludeOpt{tfe.RunPlan},
})
if err != nil {
+ if errors.Is(err, tfe.ErrResourceNotFound) {
+ t.sendJobEvent(job, oapi.JobStatusExternalRunNotFound, "TFC run not found", nil, address, organization, wsName)
+ return
+ }
+ consecutiveErrors++
+ if consecutiveErrors >= maxConsecutiveErrors {
+ t.sendJobEvent(job, oapi.JobStatusFailure, fmt.Sprintf("Giving up after %d consecutive poll errors: %s", consecutiveErrors, err), nil, address, organization, wsName)
+ return
+ }
log.Error("Failed to poll TFC run status", "runId", runID, "error", err)
continue
}
+ consecutiveErrors = 0
if run.Status == lastStatus {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) { | |
| ticker := time.NewTicker(15 * time.Second) | |
| defer ticker.Stop() | |
| var lastStatus tfe.RunStatus | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| return | |
| case <-ticker.C: | |
| run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{ | |
| Include: []tfe.RunIncludeOpt{tfe.RunPlan}, | |
| }) | |
| if err != nil { | |
| log.Error("Failed to poll TFC run status", "runId", runID, "error", err) | |
| continue | |
| } | |
| if run.Status == lastStatus { | |
| continue | |
| } | |
| lastStatus = run.Status | |
| jobStatus, message := mapRunStatus(run) | |
| if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil { | |
| log.Error("Failed to send job event", "runId", runID, "error", err) | |
| } | |
| if isTerminalJobStatus(jobStatus) { | |
| return | |
| } | |
| } | |
| } | |
| func (t *TFE) pollRunStatus(ctx context.Context, client *tfe.Client, runID string, job *oapi.Job, address, organization, wsName string) { | |
| ticker := time.NewTicker(15 * time.Second) | |
| defer ticker.Stop() | |
| const maxConsecutiveErrors = 10 | |
| consecutiveErrors := 0 | |
| var lastStatus tfe.RunStatus | |
| for { | |
| select { | |
| case <-ctx.Done(): | |
| return | |
| case <-ticker.C: | |
| run, err := client.Runs.ReadWithOptions(ctx, runID, &tfe.RunReadOptions{ | |
| Include: []tfe.RunIncludeOpt{tfe.RunPlan}, | |
| }) | |
| if err != nil { | |
| if errors.Is(err, tfe.ErrResourceNotFound) { | |
| t.sendJobEvent(job, oapi.JobStatusExternalRunNotFound, "TFC run not found", nil, address, organization, wsName) | |
| return | |
| } | |
| consecutiveErrors++ | |
| if consecutiveErrors >= maxConsecutiveErrors { | |
| t.sendJobEvent(job, oapi.JobStatusFailure, fmt.Sprintf("Giving up after %d consecutive poll errors: %s", consecutiveErrors, err), nil, address, organization, wsName) | |
| return | |
| } | |
| log.Error("Failed to poll TFC run status", "runId", runID, "error", err) | |
| continue | |
| } | |
| consecutiveErrors = 0 | |
| if run.Status == lastStatus { | |
| continue | |
| } | |
| lastStatus = run.Status | |
| jobStatus, message := mapRunStatus(run) | |
| if err := t.sendJobEvent(job, jobStatus, message, run, address, organization, wsName); err != nil { | |
| log.Error("Failed to send job event", "runId", runID, "error", err) | |
| } | |
| if isTerminalJobStatus(jobStatus) { | |
| return | |
| } | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 153 - 184, The pollRunStatus loop in pollRunStatus currently logs all
ReadWithOptions errors and continues forever, leaking a goroutine on persistent
API failures; modify the loop to detect terminal error types (specifically check
for tfe.ErrResourceNotFound returned by Runs.ReadWithOptions and treat it as a
terminal condition by mapping to an ExternalRunNotFound job status via
mapRunStatus/sendJobEvent and then return) and also add a consecutive-error
counter (e.g., const maxConsecutiveErrors = 10) that increments on non-terminal
errors and resets on a successful run read; when the counter exceeds the
threshold, send a final failed job event with an explanatory message via
sendJobEvent and exit the goroutine, while preserving existing behavior for
successful reads and normal terminal statuses determined by isTerminalJobStatus.
| if run != nil && address != "" && organization != "" && wsName != "" { | ||
| runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID) | ||
| if !strings.HasPrefix(runUrl, "https://") { | ||
| runUrl = "https://" + runUrl | ||
| } | ||
| workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName) | ||
| if !strings.HasPrefix(workspaceUrl, "https://") { | ||
| workspaceUrl = "https://" + workspaceUrl | ||
| } |
There was a problem hiding this comment.
URL prefix logic breaks for http:// addresses.
If address is "http://my-tfe.internal", the fmt.Sprintf produces "http://my-tfe.internal/app/..." which doesn't start with "https://", so it gets prefixed to "https://http://my-tfe.internal/app/..." — a malformed URL. Private TFE instances may use HTTP.
Use strings.Contains(address, "://") or url.Parse to check for an existing scheme instead.
🐛 Proposed fix
- runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID)
- if !strings.HasPrefix(runUrl, "https://") {
- runUrl = "https://" + runUrl
- }
- workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName)
- if !strings.HasPrefix(workspaceUrl, "https://") {
- workspaceUrl = "https://" + workspaceUrl
- }
+ normalizedAddr := address
+ if !strings.Contains(normalizedAddr, "://") {
+ normalizedAddr = "https://" + normalizedAddr
+ }
+ runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", normalizedAddr, organization, wsName, run.ID)
+ workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", normalizedAddr, organization, wsName)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if run != nil && address != "" && organization != "" && wsName != "" { | |
| runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", address, organization, wsName, run.ID) | |
| if !strings.HasPrefix(runUrl, "https://") { | |
| runUrl = "https://" + runUrl | |
| } | |
| workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", address, organization, wsName) | |
| if !strings.HasPrefix(workspaceUrl, "https://") { | |
| workspaceUrl = "https://" + workspaceUrl | |
| } | |
| if run != nil && address != "" && organization != "" && wsName != "" { | |
| normalizedAddr := address | |
| if !strings.Contains(normalizedAddr, "://") { | |
| normalizedAddr = "https://" + normalizedAddr | |
| } | |
| runUrl := fmt.Sprintf("%s/app/%s/workspaces/%s/runs/%s", normalizedAddr, organization, wsName, run.ID) | |
| workspaceUrl := fmt.Sprintf("%s/app/%s/workspaces/%s", normalizedAddr, organization, wsName) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 300 - 308, The URL-prefix logic in the block that builds runUrl and
workspaceUrl incorrectly prepends "https://" even when address already contains
a scheme (e.g., "http://"), so update the check in the code that constructs
runUrl and workspaceUrl (using variables address, organization, wsName and
run.ID in tfe.go) to detect an existing scheme first — for example, call
url.Parse(address) and if parsedURL.Scheme is empty then prefix "https://",
otherwise use address as-is (or, alternatively, use strings.Contains(address,
"://") to decide); ensure both runUrl and workspaceUrl apply the same
scheme-aware logic to avoid producing malformed URLs like "https://http://...".
| // Set startedAt for first non-pending transition | ||
| if status == oapi.JobStatusInProgress { | ||
| eventJob.StartedAt = &now | ||
| fields = append(fields, oapi.JobUpdateEventFieldsToUpdateStartedAt) | ||
| } |
There was a problem hiding this comment.
startedAt is overwritten on every InProgress transition, not just the first.
Each time mapRunStatus returns InProgress (which happens for ~20 different TFC states), startedAt is reset to now and included in fieldsToUpdate. The job's startedAt should reflect when it first became in-progress, not the latest state change.
Guard this behind a check on the job's existing StartedAt:
🔧 Proposed fix
- if status == oapi.JobStatusInProgress {
+ if status == oapi.JobStatusInProgress && job.StartedAt == nil {
eventJob.StartedAt = &now
fields = append(fields, oapi.JobUpdateEventFieldsToUpdateStartedAt)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 329 - 333, The code sets eventJob.StartedAt and appends
oapi.JobUpdateEventFieldsToUpdateStartedAt every time mapRunStatus returns
InProgress, overwriting the original start time; change the logic in the block
that handles status == oapi.JobStatusInProgress (the snippet that sets
eventJob.StartedAt and appends to fields) to only set StartedAt and append the
field when eventJob.StartedAt is nil (i.e., it has not been set yet) so the
timestamp is recorded only on the first transition to InProgress.
| func (t *TFE) syncVariables(ctx context.Context, client *tfe.Client, workspaceID string, desiredVars []VariableTemplate) error { | ||
| existingVars, err := client.Variables.List(ctx, workspaceID, nil) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to list variables: %w", err) | ||
| } | ||
|
|
||
| existingByKey := make(map[string]*tfe.Variable) | ||
| for _, v := range existingVars.Items { | ||
| existingByKey[v.Key] = v | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check go-tfe version in go.mod
cat apps/workspace-engine/go.mod | grep "go-tfe"Repository: ctrlplanedev/ctrlplane
Length of output: 102
🏁 Script executed:
# Read the actual code around lines 441-450 with more context
cat -n apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go | sed -n '430,470p'Repository: ctrlplanedev/ctrlplane
Length of output: 1550
🏁 Script executed:
# Check if there are other usages of Variables.List or pagination patterns in the codebase
rg "Variables\.List|ListAll" apps/workspace-engine/ -A 3 -B 3Repository: ctrlplanedev/ctrlplane
Length of output: 4349
🌐 Web query:
go-tfe github Variables.List pagination ListAll method
💡 Result:
In hashicorp/go-tfe, Variables.ListAll is not a “fetch all pages” helper. It’s a different endpoint that returns workspace vars + vars inherited from variable sets:
List(...)→GET workspaces/<id>/varsListAll(...)→GET workspaces/<id>/all-vars(includes varset-inherited vars) [1]
Pagination
Both List and ListAll accept *VariableListOptions, which embeds ListOptions with:
page[number](PageNumber)page[size](PageSize) [2]
So you paginate by looping and incrementing PageNumber until Pagination.CurrentPage == Pagination.TotalPages (or NextPage == 0, depending on what the API returns), accumulating VariableList.Items [2].
opts := &tfe.VariableListOptions{
ListOptions: tfe.ListOptions{PageNumber: 1, PageSize: 100},
}
var all []*tfe.Variable
for {
vl, err := client.Variables.ListAll(ctx, workspaceID, opts)
if err != nil { return err }
all = append(all, vl.Items...)
if vl.Pagination == nil || vl.CurrentPage >= vl.TotalPages {
break
}
opts.PageNumber++
}Key point: “ListAll” = include varsets, not auto-pagination. [1]
Sources: [1] [2]
Implement pagination for Variables.List() to fetch all variables.
The call to client.Variables.List(ctx, workspaceID, nil) only retrieves the first page of results (default ~20 items). Workspaces with more variables will have missing entries in existingByKey, causing spurious Create calls for already-existing variables that fail with API conflict errors.
Paginate through all pages by looping with VariableListOptions and incrementing PageNumber until CurrentPage >= TotalPages. Note: ListAll() is not a pagination helper—it returns a different endpoint that includes varset-inherited variables. Use the appropriate method for your use case and paginate accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/workspace-engine/pkg/workspace/jobagents/terraformcloud/tfe.go` around
lines 441 - 450, The syncVariables function currently calls
client.Variables.List(ctx, workspaceID, nil) which only returns the first page;
update syncVariables to paginate using tfe.VariableListOptions (incrementing
PageNumber) and repeatedly call client.Variables.List until CurrentPage >=
TotalPages, appending each page's Items into existingVars (and building
existingByKey) so all workspace variables are collected before
comparing/creating; keep using the Variables.List endpoint (not ListAll) to
avoid inherited varsets.
Summary
Refactor the TFE job agent to own the run lifecycle directly instead of delegating to a verification provider.
TFE has definitive run states and doesn't need continuous reconciliation — the dispatcher now polls TFC every 15s, maps ~25 run states to ctrlplane job statuses, and publishes events on each state transition.
Changes
tfe_verifications.goentirely; the TFE dispatcher handles its own polling and state mappingRestorableinterface so jobs with anExternalId(TFC run ID) resume polling after restart; orphaned jobs are marked as failedRunPendingtopendinginstead ofinProgressMessagefield toJobSummaryresponse so the release targets page can show status contextTest plan
tfe_test.go(go test ./pkg/workspace/jobagents/terraformcloud/...)pnpm buildpasses for@ctrlplane/workspace-engineSummary by CodeRabbit
New Features
Tests