Skip to content

test(proxy): retry on lost freePort/freeAddr bind race to de-flake lifecycle tests#142

Open
iamclaude697 wants to merge 1 commit into
mainfrom
claude/issue-140-fix-u5lyj7
Open

test(proxy): retry on lost freePort/freeAddr bind race to de-flake lifecycle tests#142
iamclaude697 wants to merge 1 commit into
mainfrom
claude/issue-140-fix-u5lyj7

Conversation

@iamclaude697

@iamclaude697 iamclaude697 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Pull Request

What

Fixes the flaky TestMain_HelperProcess_Lifecycle (#140). freePort (and its in-process sibling freeAddr) hand out a port number, not a reservation: the probe listener is closed before the consumer re-binds the port, so any concurrent binder can steal it in between (TOCTOU). Under ephemeral-port contention the helper subprocess intermittently exited with bind: address already in use and the readiness wait timed out. The fix eliminates the close-then-rebind window where possible (bind "127.0.0.1:0" directly) and, where the port must be known up front (subprocess env vars, dialable server), detects a lost race deterministically and relaunches on fresh ports a bounded number of times (maxBindAttempts = 5). Test-only change; no production code is touched.

Changes

  • cmd/proxy/main_test.go
    • TestMain_HelperProcess_Lifecycle → bounded retry loop around a new lifecycleAttempt helper. The subprocess is now reaped via a Wait goroutine so an early exit is observed immediately (previously a lost race silently ate the full 10s log timeout). An EADDRINUSE exit at any point — before readiness, or around the SIGTERM window when the concurrently-binding management listener loses its port — is classified as a lost race and relaunched on fresh ports; every other failure still fails in place with the subprocess log.
    • TestMain_HelperProcess_ProxyPortConflict_Fatal / _MgmtPortConflict_Fatal → shared testHelperPortConflict helper. These tests deliberately conflict one port but take the other from freePort; they retry only when the output shows the wrong subsystem's fatal tag plus EADDRINUSE (the non-pinned port was stolen). A genuine wrong-tag failure still fails.
    • New shared maxBindAttempts / bindConflict constants; removed the now-unused waitForLog.
  • cmd/proxy/service_lifecycle_test.go
    • TestRunServiceLifecycle_StopGracefully / _InterrogateReEmits bind "127.0.0.1:0" directly — they never dial the server, so the close-then-rebind window is eliminated, not retried (this matches the [SERVICE] HTTP server exited: … address already in use line in the issue's failure log).
    • TestRunServiceLifecycle_ShutdownTimeoutLogged must dial the server, so it keeps freeAddr but now confirms ownership via a new waitServiceReady probe (GET /ready → 204 proves the port is served by our mux; an early lifecycle exit means the bind race was lost) and retries on a fresh port. The probe also removes a silent-coverage hazard: previously the /hold request could race ahead of ListenAndServe, be refused, and leave the Shutdown-timeout branch unexercised while the test still passed.
  • cmd/proxy/startup_test.gofreePort doc comment now states the TOCTOU contract and the retry expectation for consumers. TestStartManagementAPI_ServesRequests cannot retry (a lost race hits runManagementAPI's log.Fatalf), but its window is in-process (microseconds, no fork/exec) rather than the subprocess window this issue is about; making it fully race-free would need a listener-injection seam in internal/management, which is out of scope for a test-only fix.

Verification of the retry paths themselves (mutation-style, not committed): freePort/freeAddr were temporarily instrumented to return ports kept bound by the parent process, simulating a thief winning the race. All three retry paths engaged (attempt 1/5 … relaunching logged) and recovered to PASS: lifecycle (both ports stolen), proxy-port-conflict (mgmt port stolen, 6/6 runs), shutdown-timeout (addr stolen).

Contended reproduction from the issue: 9 full go test -race -count=1 ./... runs in 3 concurrent batches of 3 — 0 failures (the issue measured ~8% failure per contended run on main, so ~0.5 failures were expected from 9 runs if unfixed).

Quality Gates

Check each only after the gate is satisfied with concrete evidence. See CLAUDE.md for the full spec.

  • make check passed locally (lint, test, security, vulncheck — all four sub-gates exit 0). Last line of output:
    All checks passed. (golangci-lint v2.10.1, gosec, govulncheck rebuilt with go1.26.4 — same toolchain note as chore: replace nolint:errcheck with explicit error handling (#102) #139; golangci-lint run ./... reports 0 issues., govulncheck: 0 vulnerabilities affecting this code)
  • go test -race ./... passed (plus -count=5 on the affected tests and the 9 contended full-suite runs above)
  • Coverage minimums met: internal/anonymizer 97.2%, internal/config 100%, internal/anonymizer/packs 97.6% (≥95% gate); overall 98.3% (≥85% gate)
  • Delta coverage ≥90% on all changed/new files — see Delta Coverage Report below
  • §6 Test Inventory baseline-vs-head completed below
  • No hacks introduced (no new //nolint — the two pre-existing annotated ones in touched files are unchanged; no t.Skip(); no // coverage-ignore; no hardcoded values whose only purpose is to satisfy a specific test assertion; no new gosec exclusions)
  • All CI jobs green at the PR head SHA (6c3a4eb) — 21/21 checks completed: Test, Lint, Security Scan, Benchmark, Build (amd64/arm64/macOS), Build MSI, CodeQL, Analyze (go/actions), and 6 install round-trips all ✅; Sign + publish skipped (release-only)

Delta Coverage Report

Per .github/scripts/delta-coverage.sh — every function in changed/new .go source files (excluding _test.go, _generated.go, mock_*) must be ≥90%.

Command:

go test -race -count=1 -coverprofile=coverage.out -covermode=atomic ./...
bash .github/scripts/delta-coverage.sh coverage.out 90.0 origin/main

Raw script output:

No changed Go source files — delta coverage check skipped.

Per-function table — one row per function in every changed .go source file:

File Function Coverage % ≥90%
(none — this PR changes only _test.go files, which the script excludes by design)

§6 Test Inventory — baseline vs head

Method: checkout main, run go test -race -count=1 -v <pkg> and count --- PASS / --- FAIL lines (top-level + subtests). Repeat on the PR head SHA. Executed on main (aa9ae3f) in a separate worktree and on head 6c3a4eb.

Package main (aa9ae3f) PASS / FAIL head (6c3a4eb) PASS / FAIL Delta
./cmd/proxy/ 34 / 0 34 / 0 0
./internal/anonymizer/ 393 / 0 393 / 0 0
./internal/anonymizer/packs/ 177 / 0 177 / 0 0
./internal/config/ 39 / 0 39 / 0 0
./internal/domainmatch/ 44 / 0 44 / 0 0
./internal/envfile/ 12 / 0 12 / 0 0
./internal/logger/ 15 / 0 15 / 0 0
./internal/management/ 61 / 0 61 / 0 0
./internal/metrics/ 17 / 0 17 / 0 0
./internal/mitm/ 43 / 0 43 / 0 0
./internal/proxy/ 122 / 0 122 / 0 0
TOTAL 957 / 0 957 / 0 0

Zero failures on either side; test count is unchanged because the fix restructures existing tests around retry helpers without adding or removing top-level tests.

Test Plan

This PR is test changes; each was verified to still pin its original branch, and the new retry logic was itself exercised:

  • TestMain_HelperProcess_Lifecycle — still pins main()'s full startup→SIGTERM→clean-exit lifecycle (readiness log, exit code 0). Retry path proven by temporarily instrumenting freePort to hand out parent-held ports: attempt 1 dies with EADDRINUSE, attempt 2 passes, attempt 1/5 … relaunching logged.
  • TestMain_HelperProcess_ProxyPortConflict_Fatal / _MgmtPortConflict_Fatal — still pin runHTTPServer / runManagementAPI's log.Fatalf branches via the pinned-port conflict; stolen-port retry proven the same way (6/6 instrumented runs relaunched then passed).
  • TestRunServiceLifecycle_StopGracefully / _InterrogateReEmits — unchanged assertions (status transitions, exit code); now bind-race-free by construction (":0").
  • TestRunServiceLifecycle_ShutdownTimeoutLogged — still pins the Shutdown-timeout branch (exit code 0 with a held request past shutdownDeadline), now deterministically: waitServiceReady guarantees the server is up before /hold is sent, so the held-connection precondition can no longer be silently skipped. Lost-race retry proven by binding a thief listener on attempt 1's address.
  • Full suite: go test -race ./... green; 9/9 concurrent contended full-suite runs green (the issue's reproduction).

Linked Issues

Closes #140


Generated by Claude Code

…fecycle tests (#140)

freePort and freeAddr hand out a port number, not a reservation: the
probe listener is closed before the consumer re-binds the port, so any
concurrent binder can steal it in between (TOCTOU). Under ephemeral-port
contention this intermittently killed the helper subprocess with
'bind: address already in use', flaking TestMain_HelperProcess_Lifecycle.

- TestMain_HelperProcess_Lifecycle: reap the subprocess via a Wait
  goroutine so an early exit is observed immediately (instead of eating
  the 10s log timeout), classify EADDRINUSE exits as a lost race, and
  relaunch on fresh ports (bounded by maxBindAttempts). A stolen
  management port can also kill the process around the SIGTERM window;
  those paths retry the same way.
- Port-conflict fatal tests: share testHelperPortConflict, which retries
  only when the freePort-allocated (non-pinned) port was the one stolen
  (wrong subsystem tag + EADDRINUSE); genuine mismatches still fail.
- TestRunServiceLifecycle_StopGracefully / _InterrogateReEmits: bind
  "127.0.0.1:0" directly — these tests never dial the server, so the
  close-then-rebind window is eliminated outright.
- TestRunServiceLifecycle_ShutdownTimeoutLogged: must dial the server,
  so it keeps freeAddr but confirms ownership via a GET /ready == 204
  probe (a foreign response or an early lifecycle exit means the bind
  race was lost) and retries on a fresh port. The probe also removes a
  silent-coverage hazard where the /hold request could race ahead of
  ListenAndServe and the timeout branch went unexercised.

Fixes #140

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_016NJWfHPx1kThRwHqAmPnTs
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

Benchmark results

Benchmark ns/op Budget Headroom Status
BenchmarkRegexPassEmail 1.6µs 2.5ms 99% ✅ PASS
BenchmarkRegexPassMultiple 2.4µs 2.5ms 99% ✅ PASS

Budget: regex/cache < 2.5ms (0.5% of 500ms baseline), streaming < 20ms (4% of 500ms baseline)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TOCTOU port race in freePort causes intermittent TestMain_HelperProcess_Lifecycle failures

2 participants