test(proxy): retry on lost freePort/freeAddr bind race to de-flake lifecycle tests#142
Open
iamclaude697 wants to merge 1 commit into
Open
test(proxy): retry on lost freePort/freeAddr bind race to de-flake lifecycle tests#142iamclaude697 wants to merge 1 commit into
iamclaude697 wants to merge 1 commit into
Conversation
…fecycle tests (#140) freePort and freeAddr hand out a port number, not a reservation: the probe listener is closed before the consumer re-binds the port, so any concurrent binder can steal it in between (TOCTOU). Under ephemeral-port contention this intermittently killed the helper subprocess with 'bind: address already in use', flaking TestMain_HelperProcess_Lifecycle. - TestMain_HelperProcess_Lifecycle: reap the subprocess via a Wait goroutine so an early exit is observed immediately (instead of eating the 10s log timeout), classify EADDRINUSE exits as a lost race, and relaunch on fresh ports (bounded by maxBindAttempts). A stolen management port can also kill the process around the SIGTERM window; those paths retry the same way. - Port-conflict fatal tests: share testHelperPortConflict, which retries only when the freePort-allocated (non-pinned) port was the one stolen (wrong subsystem tag + EADDRINUSE); genuine mismatches still fail. - TestRunServiceLifecycle_StopGracefully / _InterrogateReEmits: bind "127.0.0.1:0" directly — these tests never dial the server, so the close-then-rebind window is eliminated outright. - TestRunServiceLifecycle_ShutdownTimeoutLogged: must dial the server, so it keeps freeAddr but confirms ownership via a GET /ready == 204 probe (a foreign response or an early lifecycle exit means the bind race was lost) and retries on a fresh port. The probe also removes a silent-coverage hazard where the /hold request could race ahead of ListenAndServe and the timeout branch went unexercised. Fixes #140 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_016NJWfHPx1kThRwHqAmPnTs
Benchmark results
Budget: regex/cache < 2.5ms (0.5% of 500ms baseline), streaming < 20ms (4% of 500ms baseline) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
What
Fixes the flaky
TestMain_HelperProcess_Lifecycle(#140).freePort(and its in-process siblingfreeAddr) hand out a port number, not a reservation: the probe listener is closed before the consumer re-binds the port, so any concurrent binder can steal it in between (TOCTOU). Under ephemeral-port contention the helper subprocess intermittently exited withbind: address already in useand the readiness wait timed out. The fix eliminates the close-then-rebind window where possible (bind"127.0.0.1:0"directly) and, where the port must be known up front (subprocess env vars, dialable server), detects a lost race deterministically and relaunches on fresh ports a bounded number of times (maxBindAttempts = 5). Test-only change; no production code is touched.Changes
cmd/proxy/main_test.goTestMain_HelperProcess_Lifecycle→ bounded retry loop around a newlifecycleAttempthelper. The subprocess is now reaped via aWaitgoroutine so an early exit is observed immediately (previously a lost race silently ate the full 10s log timeout). An EADDRINUSE exit at any point — before readiness, or around the SIGTERM window when the concurrently-binding management listener loses its port — is classified as a lost race and relaunched on fresh ports; every other failure still fails in place with the subprocess log.TestMain_HelperProcess_ProxyPortConflict_Fatal/_MgmtPortConflict_Fatal→ sharedtestHelperPortConflicthelper. These tests deliberately conflict one port but take the other fromfreePort; they retry only when the output shows the wrong subsystem's fatal tag plus EADDRINUSE (the non-pinned port was stolen). A genuine wrong-tag failure still fails.maxBindAttempts/bindConflictconstants; removed the now-unusedwaitForLog.cmd/proxy/service_lifecycle_test.goTestRunServiceLifecycle_StopGracefully/_InterrogateReEmitsbind"127.0.0.1:0"directly — they never dial the server, so the close-then-rebind window is eliminated, not retried (this matches the[SERVICE] HTTP server exited: … address already in useline in the issue's failure log).TestRunServiceLifecycle_ShutdownTimeoutLoggedmust dial the server, so it keepsfreeAddrbut now confirms ownership via a newwaitServiceReadyprobe (GET /ready→ 204 proves the port is served by our mux; an early lifecycle exit means the bind race was lost) and retries on a fresh port. The probe also removes a silent-coverage hazard: previously the/holdrequest could race ahead ofListenAndServe, be refused, and leave the Shutdown-timeout branch unexercised while the test still passed.cmd/proxy/startup_test.go—freePortdoc comment now states the TOCTOU contract and the retry expectation for consumers.TestStartManagementAPI_ServesRequestscannot retry (a lost race hitsrunManagementAPI'slog.Fatalf), but its window is in-process (microseconds, no fork/exec) rather than the subprocess window this issue is about; making it fully race-free would need a listener-injection seam ininternal/management, which is out of scope for a test-only fix.Verification of the retry paths themselves (mutation-style, not committed):
freePort/freeAddrwere temporarily instrumented to return ports kept bound by the parent process, simulating a thief winning the race. All three retry paths engaged (attempt 1/5 … relaunchinglogged) and recovered to PASS: lifecycle (both ports stolen), proxy-port-conflict (mgmt port stolen, 6/6 runs), shutdown-timeout (addr stolen).Contended reproduction from the issue: 9 full
go test -race -count=1 ./...runs in 3 concurrent batches of 3 — 0 failures (the issue measured ~8% failure per contended run onmain, so ~0.5 failures were expected from 9 runs if unfixed).Quality Gates
Check each only after the gate is satisfied with concrete evidence. See
CLAUDE.mdfor the full spec.make checkpassed locally (lint, test, security, vulncheck — all four sub-gates exit 0). Last line of output:All checks passed.(golangci-lint v2.10.1, gosec, govulncheck rebuilt with go1.26.4 — same toolchain note as chore: replace nolint:errcheck with explicit error handling (#102) #139;golangci-lint run ./...reports0 issues., govulncheck: 0 vulnerabilities affecting this code)go test -race ./...passed (plus-count=5on the affected tests and the 9 contended full-suite runs above)internal/anonymizer97.2%,internal/config100%,internal/anonymizer/packs97.6% (≥95% gate); overall 98.3% (≥85% gate)//nolint— the two pre-existing annotated ones in touched files are unchanged; not.Skip(); no// coverage-ignore; no hardcoded values whose only purpose is to satisfy a specific test assertion; no newgosecexclusions)6c3a4eb) — 21/21 checks completed: Test, Lint, Security Scan, Benchmark, Build (amd64/arm64/macOS), Build MSI, CodeQL, Analyze (go/actions), and 6 install round-trips all ✅; Sign + publish skipped (release-only)Delta Coverage Report
Per
.github/scripts/delta-coverage.sh— every function in changed/new.gosource files (excluding_test.go,_generated.go,mock_*) must be ≥90%.Command:
go test -race -count=1 -coverprofile=coverage.out -covermode=atomic ./... bash .github/scripts/delta-coverage.sh coverage.out 90.0 origin/mainRaw script output:
Per-function table — one row per function in every changed
.gosource file:_test.gofiles, which the script excludes by design)§6 Test Inventory — baseline vs head
Method: checkout
main, rungo test -race -count=1 -v <pkg>and count--- PASS/--- FAILlines (top-level + subtests). Repeat on the PR head SHA. Executed onmain(aa9ae3f) in a separate worktree and on head6c3a4eb.aa9ae3f) PASS / FAIL6c3a4eb) PASS / FAIL./cmd/proxy/./internal/anonymizer/./internal/anonymizer/packs/./internal/config/./internal/domainmatch/./internal/envfile/./internal/logger/./internal/management/./internal/metrics/./internal/mitm/./internal/proxy/Zero failures on either side; test count is unchanged because the fix restructures existing tests around retry helpers without adding or removing top-level tests.
Test Plan
This PR is test changes; each was verified to still pin its original branch, and the new retry logic was itself exercised:
TestMain_HelperProcess_Lifecycle— still pins main()'s full startup→SIGTERM→clean-exit lifecycle (readiness log, exit code 0). Retry path proven by temporarily instrumentingfreePortto hand out parent-held ports: attempt 1 dies with EADDRINUSE, attempt 2 passes,attempt 1/5 … relaunchinglogged.TestMain_HelperProcess_ProxyPortConflict_Fatal/_MgmtPortConflict_Fatal— still pinrunHTTPServer/runManagementAPI'slog.Fatalfbranches via the pinned-port conflict; stolen-port retry proven the same way (6/6 instrumented runs relaunched then passed).TestRunServiceLifecycle_StopGracefully/_InterrogateReEmits— unchanged assertions (status transitions, exit code); now bind-race-free by construction (":0").TestRunServiceLifecycle_ShutdownTimeoutLogged— still pins the Shutdown-timeout branch (exit code 0 with a held request pastshutdownDeadline), now deterministically:waitServiceReadyguarantees the server is up before/holdis sent, so the held-connection precondition can no longer be silently skipped. Lost-race retry proven by binding a thief listener on attempt 1's address.go test -race ./...green; 9/9 concurrent contended full-suite runs green (the issue's reproduction).Linked Issues
Closes #140
Generated by Claude Code