fix(e2e): Fix 409 on label, dispatch ref, cleanup gaps for nightly/push/release runs #88
fix(e2e): Fix 409 on label, dispatch ref, cleanup gaps for nightly/push/release runs #88yasserfaraazkhan merged 7 commits intomasterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds tracking-key propagation for E2E workflows, per-PR+platform in-memory deduplication and generation-based cancellation for in-flight provisioning, configurable non-PR instance max-age and periodic stale-instance cleanup, and a label-triggered E2E reset path. Changes
Sequence DiagramsequenceDiagram
participant Webhook as "Webhook (Push/PR)"
participant Server as "Matterwick Server"
participant GH as "GitHub API / Actions"
participant Workflow as "Workflow Run"
participant Cleanup as "Cleanup Handler"
Webhook->>Server: Receive push or PR (repo, branch, sha)
Server->>Server: Compute trackingKey (repo-push-branch-sha)
Server->>Server: Check e2eInProgress (dedupe per PR+platform)
Server->>GH: Dispatch workflow (ref=branch) with input mw_tracking_key
GH->>Workflow: Start workflow run
Workflow->>Server: Send workflow_run completion (includes mw_tracking_key)
Server->>Server: Lookup & remove trackingKey under lock
Server->>Cleanup: Destroy associated E2E instances
Server->>Cleanup: Fallback: scan & destroy by SHA if mw_tracking_key missing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/e2e_tests.go`:
- Around line 100-115: The current in-progress guard uses s.e2eInProgress keyed
only by repo/pr (variable key), causing label-specific runs to be dropped;
update the locking logic to distinguish platforms by including the effective
test platform in the key (e.g., build a composite key like repo/pr/platform or
repo/pr/labelVariant) and use that composite when checking/setting
s.e2eInProgress, s.e2eInProgressLock, and when deleting in the deferred cleanup;
alternatively, if you prefer one PR-wide slot, change the map to track a set of
pending platforms per PR (e.g., map[string]map[string]bool) and enqueue the
incoming platform when it differs so it will run after the current provisioning
completes.
- Around line 37-42: The function e2eUniqueSuffix currently truncates
time.Now().UnixNano() to uint32 which wraps every ~4.3s; replace that with a
non-truncating random source: import crypto/rand, read 4 bytes into a slice,
convert to a uint32 (e.g. binary.BigEndian.Uint32) and return
fmt.Sprintf("%08x", value) so the suffix is derived from
cryptographically-random bytes instead of the low 32 bits of the timestamp;
update e2eUniqueSuffix to use this random-read approach and handle any read
error (propagate or panic/log) as appropriate.
In `@server/push_events.go`:
- Around line 125-127: Add a fail-fast check for an empty commit SHA before
generating tracking keys or calling s.triggerE2EWorkflowForPushEvent: if sha is
empty (zero-length), log an error/warning and return without creating the
tracking key or dispatching (so MOBILE_VERSION won't be sent empty) instead of
proceeding to call triggerE2EWorkflowForPushEvent(repoName, instanceType,
branch, sha, key, instances); ensure the guard is placed immediately before the
tracking/dispatch logic that builds the key and invokes
triggerE2EWorkflowForPushEvent.
In `@server/server.go`:
- Around line 123-139: The periodic cleanup goroutine uses a ticker but never
observes shutdown; change it to listen to a server-scoped stop signal (e.g., an
existing s.stopCh or a context.Context on the server) so it exits on Stop().
Specifically, run the ticker loop with a select that waits on ticker.C and the
server stop signal, call ticker.Stop() when the stop signal is received, and
ensure Stop() closes/signals the same stop channel or cancels the context so the
goroutine ends; the relevant symbols are s.cleanupStaleNonPRE2EInstances(),
s.e2eInstanceMaxAge(), and Stop().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d7ee7b83-0ebe-448d-9395-54511cdf8d62
📒 Files selected for processing (5)
server/config.goserver/e2e_tests.goserver/push_events.goserver/server.goserver/workflow_run.go
…r reliable instance lifecycle - Fix 409 on PR E2E label: per-PR in-progress guard + nanosecond uid prevents duplicate webhook deliveries from racing to create conflicting installations - Fix nightly/push/release E2E not running: dispatch ref was commit SHA; GitHub workflow_dispatch requires branch name — all non-PR dispatch paths updated - Fix cleanup gaps: embed mw_tracking_key in dispatch inputs for direct key-based cleanup on workflow_run completion, eliminating SHA mismatch failures - Add E2EResetServersLabel: single label to destroy existing PR servers so developer can add E2E/Run to get a clean-state environment - Add E2EInstanceMaxAge config (default 3h) and periodic orphan scan (every maxAge/2) replacing startup-only 8h hardcoded threshold Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- e2eUniqueSuffix: replace uint32(UnixNano) with cloudModel.NewID()[:8] (crypto/rand-based) to eliminate ~4.3s wrap-around collision risk - in-progress guard: include testPlatform in key so different mobile platform labels (ios vs android) are not incorrectly dropped as duplicates - push_events: fail fast and destroy instances when push event SHA is empty to avoid malformed tracking keys and empty MOBILE_VERSION dispatch - server: add stopCh to Server struct; periodic cleanup goroutine now observes Stop() via select on stopCh, exiting cleanly on shutdown Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
server/e2e_tests.go (2)
98-115:⚠️ Potential issue | 🟠 MajorDon’t use one in-progress slot for every mobile label on the PR.
The guard is still keyed only by
repo/pr, but mobile runs are label-specific (ios,android,both). If a second mobile label arrives while the first request is provisioning, it is dropped as a duplicate and that platform never runs. Key the guard byrepo/pr/testPlatform, or track pending platforms separately while keepings.e2eInstancesPR-scoped for reuse.Suggested fix
key := fmt.Sprintf("%s-pr-%d", pr.RepoName, pr.Number) + inProgressKey := key + if instanceType == "mobile" { + inProgressKey = fmt.Sprintf("%s-%s", key, testPlatform) + } // Guard against duplicate webhook deliveries: if another goroutine is already // handling this PR key (check, cloud lookup, or 30-min creation), drop the // duplicate rather than racing to create conflicting cloud installations. s.e2eInProgressLock.Lock() - if s.e2eInProgress[key] { + if s.e2eInProgress[inProgressKey] { s.e2eInProgressLock.Unlock() logger.Warn("E2E instance creation already in progress for this PR, skipping duplicate request") return } - s.e2eInProgress[key] = true + s.e2eInProgress[inProgressKey] = true s.e2eInProgressLock.Unlock() defer func() { s.e2eInProgressLock.Lock() - delete(s.e2eInProgress, key) + delete(s.e2eInProgress, inProgressKey) s.e2eInProgressLock.Unlock() }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/e2e_tests.go` around lines 98 - 115, The current in-progress guard uses a PR-scoped key (key := fmt.Sprintf("%s-pr-%d", pr.RepoName, pr.Number)) so s.e2eInProgress blocks all platforms for that PR; change the guard to be platform-aware by including the test platform (e.g., ios/android/both) in the key (e.g., fmt.Sprintf("%s-pr-%d-%s", pr.RepoName, pr.Number, testPlatform)) or alternatively change s.e2eInProgress to map PR -> set of pending platforms and check/add the specific platform there; update all references to key and the defer cleanup to remove only the platform entry while keeping s.e2eInstances PR-scoped for reuse (symbols: key, s.e2eInProgress, s.e2eInstances).
37-42:⚠️ Potential issue | 🟠 MajorUse a non-wrapping unique source for the suffix.
Casting
UnixNano()touint32still repeats the suffix space every ~4.29 seconds, so identical PR/platform name patterns can collide again on delayed redeliveries and reintroduce the 409s this change is meant to avoid.Suggested fix
-// e2eUniqueSuffix returns an 8-character hex suffix for instance name uniqueness. -// Uses the lower 32 bits of the nanosecond clock so that two calls within the -// same second produce different values, preventing 409 conflicts when duplicate -// webhook deliveries trigger concurrent instance creation. +// e2eUniqueSuffix returns a short random suffix for instance name uniqueness. func e2eUniqueSuffix() string { - return fmt.Sprintf("%08x", uint32(time.Now().UnixNano())) + return cloudModel.NewID()[:8] }#!/bin/bash python - <<'PY' base = 1_700_000_000_000_000_000 a = base & 0xffffffff b = (base + 2**32) & 0xffffffff print({"same_low_32_bits": a == b, "wrap_seconds": (2**32) / 1e9}) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/e2e_tests.go` around lines 37 - 42, The suffix currently uses uint32(time.Now().UnixNano()) in e2eUniqueSuffix which wraps every ~4.29s and can reintroduce collisions; change e2eUniqueSuffix to produce 4 random bytes instead (use crypto/rand to Read 4 bytes and encoding/binary to convert to uint32) and fmt.Sprintf("%08x", ...) that value so the 8-char hex is drawn from a non-wrapping random source rather than the low 32 bits of the clock. Ensure you add the crypto/rand and encoding/binary imports and handle/read errors appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/pull_request.go`:
- Around line 49-52: The reset path races with in-flight provisioning:
handleE2ECleanup can destroy instances while handleE2ETestRequest later writes
new ones into s.e2eInstances, undoing the reset. Fix by serializing per-PR E2E
state (e.g., introduce or use a per-PR struct like an E2EState with a mutex and
a cancel/reset flag) and update handleE2ECleanup to acquire that lock and set
the reset/canceled flag before destroying instances; likewise modify
handleE2ETestRequest to acquire the same lock (or check the flag under the lock)
before storing into s.e2eInstances or dispatching workflows so provisioning
aborts when reset is active. Ensure both paths reference the same per-PR key (PR
identifier) in s.e2eInstances and that handleE2ECleanup waits or short-circuits
provisioning based on the flag.
---
Duplicate comments:
In `@server/e2e_tests.go`:
- Around line 98-115: The current in-progress guard uses a PR-scoped key (key :=
fmt.Sprintf("%s-pr-%d", pr.RepoName, pr.Number)) so s.e2eInProgress blocks all
platforms for that PR; change the guard to be platform-aware by including the
test platform (e.g., ios/android/both) in the key (e.g.,
fmt.Sprintf("%s-pr-%d-%s", pr.RepoName, pr.Number, testPlatform)) or
alternatively change s.e2eInProgress to map PR -> set of pending platforms and
check/add the specific platform there; update all references to key and the
defer cleanup to remove only the platform entry while keeping s.e2eInstances
PR-scoped for reuse (symbols: key, s.e2eInProgress, s.e2eInstances).
- Around line 37-42: The suffix currently uses uint32(time.Now().UnixNano()) in
e2eUniqueSuffix which wraps every ~4.29s and can reintroduce collisions; change
e2eUniqueSuffix to produce 4 random bytes instead (use crypto/rand to Read 4
bytes and encoding/binary to convert to uint32) and fmt.Sprintf("%08x", ...)
that value so the 8-char hex is drawn from a non-wrapping random source rather
than the low 32 bits of the clock. Ensure you add the crypto/rand and
encoding/binary imports and handle/read errors appropriately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 31e6afb1-002e-406a-a48e-c95c90665bfc
📒 Files selected for processing (3)
server/config.goserver/e2e_tests.goserver/pull_request.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/config.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/push_events.go (1)
118-125: Move the empty SHA check before instance creation to avoid wasted cloud resources.The fail-fast guard correctly prevents dispatch with an empty SHA and cleans up instances. However, the check occurs after
createMultipleE2EInstancesForPushEvent(line 105), meaning cloud instances are provisioned and then immediately destroyed when SHA is empty. Consider moving this validation earlier in the function, before line 102.♻️ Suggested refactor
Move the check to just before instance creation:
if !isDesktop && !isMobile { logger.Warn("Repository is neither desktop nor mobile, skipping E2E tests") return } + // Fail fast if the push event carried no commit SHA — an empty SHA would + // produce a malformed tracking key and send an empty MOBILE_VERSION to the + // test workflow. + if sha == "" { + logger.Error("Push event has no commit SHA, skipping E2E dispatch") + return + } + // Create E2E instances for testing instanceType := "desktop"Then remove the check from lines 118-125.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/push_events.go` around lines 118 - 125, The empty-commit SHA guard should be moved so we validate sha before provisioning instances: check the sha variable (the current if sha == "" branch) before calling createMultipleE2EInstancesForPushEvent, log the error with logger and return early, and then remove the duplicate check and destroyE2EInstances call that currently exists after instance creation; this ensures we never call createMultipleE2EInstancesForPushEvent when sha is empty and avoids unnecessary calls to destroyE2EInstances.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/push_events.go`:
- Around line 118-125: The empty-commit SHA guard should be moved so we validate
sha before provisioning instances: check the sha variable (the current if sha ==
"" branch) before calling createMultipleE2EInstancesForPushEvent, log the error
with logger and return early, and then remove the duplicate check and
destroyE2EInstances call that currently exists after instance creation; this
ensures we never call createMultipleE2EInstancesForPushEvent when sha is empty
and avoids unnecessary calls to destroyE2EInstances.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 14e7fa98-2627-439e-b95f-bd66f7cbb691
📒 Files selected for processing (3)
server/e2e_tests.goserver/push_events.goserver/server.go
🚧 Files skipped from review as they are similar to previous changes (2)
- server/e2e_tests.go
- server/server.go
Adds a per-PR cleanup generation counter to prevent a race where handleE2ECleanup (triggered by E2EResetServersLabel) destroys cloud installations while handleE2ETestRequest is still provisioning, causing the provisioning goroutine to later store deleted instances in the tracking map and dispatch a workflow against non-existent servers. handleE2ECleanup increments e2ePRCleanupGeneration[key] before destroying. handleE2ETestRequest snapshots the counter before provisioning starts and checks it after creation completes — if the counter advanced, the fresh instances are discarded instead of being stored and dispatched. Also skips duplicate-comment findings (uid and in-progress guard) which were already fixed in the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ventE2E Previously the check fired after createMultipleE2EInstancesForPushEvent, meaning 3 cloud installations were provisioned and then immediately destroyed when the push event carried no commit SHA. Move the guard to before the provisioning call so no instances are ever created for empty-SHA events, and remove the now-unnecessary destroyE2EInstances call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
server/server.go (1)
173-177: Consider guarding against double-close ifStop()could be called multiple times.Closing an already-closed channel panics. If
Stop()is guaranteed to be called exactly once, this is fine. If there's any possibility of multiple calls (e.g., signal handlers, test teardown), consider usingsync.Onceto guard the close.Optional defensive pattern
type Server struct { + stopOnce sync.Once stopCh chan struct{} } func (s *Server) Stop() { s.Logger.Info("Stopping MatterWick") - close(s.stopCh) + s.stopOnce.Do(func() { close(s.stopCh) }) manners.Close() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/server.go` around lines 173 - 177, The Stop method can panic if called more than once because it closes s.stopCh unconditionally; protect the close by adding a guard (e.g., a sync.Once field on Server or an isStopped bool protected by a mutex) and use that guard when invoking close(s.stopCh) inside Server.Stop, leaving s.Logger.Info("Stopping MatterWick") and manners.Close() intact; update the Server struct to include the guard (e.g., stopOnce sync.Once) and call stopOnce.Do(func(){ close(s.stopCh) }) from Stop.server/e2e_tests.go (1)
58-63: Consider periodic cleanup ofe2ePRCleanupGenerationentries.The
e2ePRCleanupGenerationmap accumulates entries (one per PR key) that are never removed. While each entry is small (~40 bytes), over very long server uptimes with many PRs this could grow unboundedly.Consider clearing entries in
handleE2ECleanupafter the cleanup completes, or implementing periodic pruning of keys no longer ine2eInstances.Optional cleanup in handleE2ECleanup
func (s *Server) handleE2ECleanup(pr *model.PullRequest) { // ... existing cleanup logic ... // Fallback: catch orphans from restarts, map overwrites, or failed goroutines s.cleanupOrphanedE2EInstances(pr, logger) + + // Remove the generation entry now that cleanup is complete. + // Any in-flight provisioning will have already captured its snapshot. + s.e2ePRCleanupGenerationLock.Lock() + delete(s.e2ePRCleanupGeneration, key) + s.e2ePRCleanupGenerationLock.Unlock() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/e2e_tests.go` around lines 58 - 63, The e2ePRCleanupGeneration map grows unbounded because entries are never removed; update handleE2ECleanup to remove the PR key from e2ePRCleanupGeneration after a successful cleanup (or implement a periodic pruning pass that iterates e2ePRCleanupGeneration and deletes keys that no longer exist in e2eInstances), ensuring any necessary synchronization around e2ePRCleanupGeneration/e2eInstances is respected (use the existing mutex or add one if needed) and keeping the cleanup logic bounded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/e2e_tests.go`:
- Around line 58-63: The e2ePRCleanupGeneration map grows unbounded because
entries are never removed; update handleE2ECleanup to remove the PR key from
e2ePRCleanupGeneration after a successful cleanup (or implement a periodic
pruning pass that iterates e2ePRCleanupGeneration and deletes keys that no
longer exist in e2eInstances), ensuring any necessary synchronization around
e2ePRCleanupGeneration/e2eInstances is respected (use the existing mutex or add
one if needed) and keeping the cleanup logic bounded.
In `@server/server.go`:
- Around line 173-177: The Stop method can panic if called more than once
because it closes s.stopCh unconditionally; protect the close by adding a guard
(e.g., a sync.Once field on Server or an isStopped bool protected by a mutex)
and use that guard when invoking close(s.stopCh) inside Server.Stop, leaving
s.Logger.Info("Stopping MatterWick") and manners.Close() intact; update the
Server struct to include the guard (e.g., stopOnce sync.Once) and call
stopOnce.Do(func(){ close(s.stopCh) }) from Stop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9782ff5-cde1-48a4-9795-11ba3b5d015d
📒 Files selected for processing (3)
server/e2e_tests.goserver/push_events.goserver/server.go
✅ Files skipped from review due to trivial changes (1)
- server/push_events.go
…ouble-close Delete the PR key from e2ePRCleanupGeneration at the end of handleE2ECleanup to prevent the map growing unbounded (one entry per cleaned-up PR forever). Guard Server.Stop against a second call by routing close(s.stopCh) through a sync.Once field, eliminating the panic on "close of closed channel". Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/server.go`:
- Around line 58-63: The cleanup-generation map e2ePRCleanupGeneration suffers
from zero-value ambiguity when handleE2ECleanup increments then deletes a PR
key: reads in handleE2ETestRequest see 0 and cannot distinguish "never ran" from
"was deleted", breaking resetDuringProvisioning (which compares
e2ePRCleanupGeneration[key] != startGeneration). Fix by switching to a
non-ambiguous scheme: either (1) use a tombstone sentinel value (e.g., -1 or a
dedicated constant) and have handleE2ECleanup write that sentinel instead of
deleting the key until all in-flight provisioning (startGeneration checks) can
observe the change, or (2) make e2ePRCleanupGeneration a monotonic counter that
is only incremented (never deleted) during the provisioning window so
startGeneration comparisons remain valid; update handleE2ECleanup and
handleE2ETestRequest logic to set/interpret the new sentinel or monotonic
semantics accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bb4b0708-96de-4502-a452-eb3390edfe02
📒 Files selected for processing (2)
server/e2e_tests.goserver/server.go
✅ Files skipped from review due to trivial changes (1)
- server/e2e_tests.go
…antics Deleting the key after cleanup reintroduces the zero-value ambiguity: a provisioning run that started before any cleanup captures startGeneration=0, then cleanup increments to 1 and deletes the key (value silently back to 0), so the final check 0!=0 is false and stale deleted instances get stored. Keep the counter monotonic (never delete) so the != comparison always fires when cleanup ran during provisioning. The map holds at most one int64 per PR key ever cleaned up, which is negligible. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
E2E server creation affecting desktop and mobile repos.
E2E/Reset-Serverslabel. PRs that are Up for longer tend to accumulate test data. This Lebel will Destroy the servers and we can AddE2E/Runlabel again to run tests on fresh servers.