Fix 8 bugs: TOCTOU races, panics, and missed cases from PR #150#151
Merged
Conversation
- fix: use CAS in openCircuit/close to emit Opened()/Closed() exactly once - fix: copy config in ConsecutiveErrOpenerFactory closure (race hygiene, matches PR #150) - fix: read Fallback.MaxConcurrentRequests once in throttle check (matches PR #150) - fix: use nil-safe props.now() in Opener.SetConfigNotThreadSafe - fix: guard NumBuckets==0 in RollingCounter.GetBuckets (fixes String() panic) - fix: nil-check pointer fields in RollingCounter.UnmarshalJSON - fix: MockClock.AfterFunc fires immediately for d<=0 (matches real time.AfterFunc) All fixes are backwards-compatible (panic->works, race->clean, or strictly additive guards). Regression tests added for each. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g inconsistent state
The previous fix (nil-check each pointer field and skip) made zero-value
receivers safe but created a new deferred-panic vector: on a pre-initialized
counter, UnmarshalJSON({}) would set buckets=nil while leaving NumBuckets
unchanged, causing GetBuckets to panic later with index out of range.
UnmarshalJSON is a round-trip API — MarshalJSON always emits all fields.
Incomplete JSON is corrupt/truncated input. Return an error (receiver
unmodified) rather than silently partial-apply.
Backwards-compatible vs released master: panic -> error (strictly better).
Tests added:
- IncompleteInput: 5 partial-JSON variants x 2 receiver states (all must error, receiver preserved)
- RoundTrip: 3 counter states marshal+unmarshal cleanly (the BC guarantee)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Systematic audit for timing-sensitive tests that can flake on loaded CI runners. - TestCircuitAttemptsToReopenOnlyOnce: convert to MockClock for deterministic timing. The previous real-clock version with SleepWindow: 1ms flaked when CI scheduling delays exceeded 1ms (same pattern PR #141 fixed in the sister test TestCircuitAttemptsToReopen). This is the flake that caused the CI failure. - TestThrottled, TestFallbackCircuitConcurrency: replace time.Sleep-based concurrency with a barrier pattern. Wait until ALL goroutines have either entered the throttled section or been rejected before releasing — guarantees the over-limit goroutine attempts while slots are occupied, regardless of scheduling. Also fixed: errCount++ was not atomic. - TestRollingCounter_Inc: use StringAt(now)/RollingSumAt(now) instead of String()/RollingSum(). The latter use real time.Now() which can advance past the 10ms rolling window on a slow runner, rolling out buckets before later Inc(now) calls. - TestTickUntil: register AfterFunc callbacks BEFORE starting the TickUntil goroutine. If TickUntil advances mock time first, callbacks get scheduled relative to the advanced time, breaking the final m.Now() == now+3h assertion. - TestManagerConcurrentFactoryConfiguration: drop require.NoError on Execute. With 300 concurrent executions under -race and only a 2x sleep/timeout margin, CI jitter can cause spurious timeouts. The test is about concurrent factory creation, not execution timing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cep21
added a commit
that referenced
this pull request
Feb 28, 2026
…efile ## Test infrastructure - circuit_property_test.go: invariant-asserting stress tests + deterministic state-machine property test (testing/quick + MockClock + reference model). Verified catches the TOCTOU bugs from PR #151 by temporarily reverting CAS. - TestCircuit_RunMetricsExactlyOnce_Concurrent: verifies the exactly-once RunMetrics contract under 100K concurrent Execute calls - TestCircuit_TransitionAlternation_Concurrent: verifies Opened/Closed alternate correctly (closed ≤ opened ≤ closed+1) under contention - TestCircuit_FallbackMetricsExactlyOnce_Concurrent: same for fallback path - TestCircuit_StateMachineProperty: 500 random op sequences replayed against real circuit + trivial reference model, 6 invariants checked per step, fully deterministic via MockClock - faststats/fuzz_test.go: 5 Go-native fuzz targets + naiveRollingCounter differential oracle. Seeds run as unit tests in ~5ms. - FuzzRollingCounterOps: Inc/RollingSumAt/GetBuckets vs naive oracle - FuzzRollingCounterJSON: round-trip + hostile-input safety - FuzzSortedDurationsPercentile: bounds + monotonicity - FuzzRollingBucketAdvance: index bounds + LastAbsIndex monotonicity - FuzzTimedCheckJSON: round-trip + hostile-input safety - Tightened existing stress tests (were assertion-free, just t.Logf): - circuit_stress_test.go: TestCircuitStateTransitionRace now verifies Opened/Closed alternation + ConcurrentCommands balance - faststats/rolling_stress_test.go: deleted 2 permanent-skip tests, added real TestTimedCheckConcurrency, added RollingSum≤TotalSum check ## Bug fixes (found by fuzzers in ~10s) - faststats/rolling_percentile.go: Percentile(NaN) panicked — NaN falls through both p<=0 and p>=100 guards, int(math.Floor(NaN)) is INT64_MIN on amd64, index-out-of-range. Now returns -1. - faststats/rolling_counter.go: UnmarshalJSON accepted hostile JSON with NumBuckets > len(Buckets), then GetBuckets panicked. Now rejected at unmarshal time with a descriptive error. ## Build tooling - Makefile: restored after Jan 2024 deletion. make (fast dev loop), make ci (pre-PR gate mirroring GitHub Actions), make fuzz, make bench, make cover, make fix, make help. - .github/workflows/build.yml: Build/Test steps now call make build / make test-race (same commands, now DRY). - .github/workflows/fuzz.yml: daily scheduled fuzzing, 5min per target, uploads crash inputs as artifacts on failure. Manual dispatch enabled. - README.md: Development section points to make targets. Fixed dead Makefile link. - .gitignore: added /coverage.html (from make cover).
cep21
added a commit
that referenced
this pull request
Feb 28, 2026
…efile (#152) (#152) * Add property/fuzz/invariant tests; fix 2 fuzz-found bugs; restore Makefile ## Test infrastructure - circuit_property_test.go: invariant-asserting stress tests + deterministic state-machine property test (testing/quick + MockClock + reference model). Verified catches the TOCTOU bugs from PR #151 by temporarily reverting CAS. - TestCircuit_RunMetricsExactlyOnce_Concurrent: verifies the exactly-once RunMetrics contract under 100K concurrent Execute calls - TestCircuit_TransitionAlternation_Concurrent: verifies Opened/Closed alternate correctly (closed ≤ opened ≤ closed+1) under contention - TestCircuit_FallbackMetricsExactlyOnce_Concurrent: same for fallback path - TestCircuit_StateMachineProperty: 500 random op sequences replayed against real circuit + trivial reference model, 6 invariants checked per step, fully deterministic via MockClock - faststats/fuzz_test.go: 5 Go-native fuzz targets + naiveRollingCounter differential oracle. Seeds run as unit tests in ~5ms. - FuzzRollingCounterOps: Inc/RollingSumAt/GetBuckets vs naive oracle - FuzzRollingCounterJSON: round-trip + hostile-input safety - FuzzSortedDurationsPercentile: bounds + monotonicity - FuzzRollingBucketAdvance: index bounds + LastAbsIndex monotonicity - FuzzTimedCheckJSON: round-trip + hostile-input safety - Tightened existing stress tests (were assertion-free, just t.Logf): - circuit_stress_test.go: TestCircuitStateTransitionRace now verifies Opened/Closed alternation + ConcurrentCommands balance - faststats/rolling_stress_test.go: deleted 2 permanent-skip tests, added real TestTimedCheckConcurrency, added RollingSum≤TotalSum check ## Bug fixes (found by fuzzers in ~10s) - faststats/rolling_percentile.go: Percentile(NaN) panicked — NaN falls through both p<=0 and p>=100 guards, int(math.Floor(NaN)) is INT64_MIN on amd64, index-out-of-range. Now returns -1. - faststats/rolling_counter.go: UnmarshalJSON accepted hostile JSON with NumBuckets > len(Buckets), then GetBuckets panicked. Now rejected at unmarshal time with a descriptive error. ## Build tooling - Makefile: restored after Jan 2024 deletion. make (fast dev loop), make ci (pre-PR gate mirroring GitHub Actions), make fuzz, make bench, make cover, make fix, make help. - .github/workflows/build.yml: Build/Test steps now call make build / make test-race (same commands, now DRY). - .github/workflows/fuzz.yml: daily scheduled fuzzing, 5min per target, uploads crash inputs as artifacts on failure. Manual dispatch enabled. - README.md: Development section points to make targets. Fixed dead Makefile link. - .gitignore: added /coverage.html (from make cover). * Fix pre-existing flake: RollingSumAt < 0 check must be post-quiescence TestRollingCounterBucketRolloverRace had a rollingSum < 0 check inside the hot concurrent loop since PR #139/#150. This check was always wrong for the lock-free RollingCounter design (documented at rolling_bucket.go): Thread A (Inc): buckets[idx].Add(1) Thread B (rollover): Swap(0) returns the incremented value, rollingSum.Add(-toDec) — briefly over-decrements Thread A (Inc cont): rollingSum.Add(1) — corrects A reader between B and A's rollingSum write observes a transient negative. The value is correct once writers quiesce. Moved the < 0 check to after wg.Wait(). The rollingSum <= TotalSum check stays in the hot loop (safe: reads are correctly ordered).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #150 — 8 additional bugs found across the same areas. All fixes are backwards-compatible (panic→works or panic→error; no behavior changes for existing valid usage).
Fixes
ConsecutiveErrOpenerFactoryconfig raceclosers/simplelogic/closers.goMerge(same pattern as #150 for hystrix factories)openCircuitcan emitOpened()2-3× under concurrent callscircuit.goCompareAndSwapbefore emitting metricclosecan emitClosed()multiple timescircuit.goCompareAndSwapbefore emitting metricOpener.SetConfigNotThreadSafenil-deref onprops.Nowclosers/hystrix/opener.goprops.now()helper (already existed at line 53)RollingCounter.GetBuckets/.String()divide-by-zero on zero valuefaststats/rolling_counter.goNumBuckets == 0RollingCounter.UnmarshalJSONnil-deref on incomplete JSONfaststats/rolling_counter.goMaxConcurrentRequestsdouble-readcircuit.goMockClock.AfterFunc— negative duration doesn't fire immediatelyinternal/clock/clock.god == 0→d <= 0(matches realtime.AfterFunc)Backwards compatibility
Opened()/Closed()now fire afterisOpenis set (previously before). Audited all in-repoMetricsimplementations — none callIsOpen()from inside the callback. The old ordering was non-deterministic under the TOCTOU anyway.UnmarshalJSON— incomplete JSON (e.g.{}) now returns an error instead of panicking. Round-trip (Marshal→Unmarshal) is unaffected and fully covered by new tests.GetBuckets— returnsnilfor zero-value counter instead of panicking.range nilis a no-op so callers are safe.Test coverage
Every fix has a regression test:
TestOpenCircuit_EmitsOpenedExactlyOnce/TestCloseCircuit_EmitsClosedExactlyOnce— 100 concurrent goroutinesTestConsecutiveErrOpenerFactory_ConcurrentCreation— 100 concurrent factory calls under-raceTestOpener_SetConfigNotThreadSafe_NilNow— direct call withoutNowsetTestRollingCounter_Empty— extended for zero-valueGetBuckets/StringTestRollingCounter_UnmarshalJSON_IncompleteInput— 5 partial-JSON variants × 2 receiver statesTestRollingCounter_UnmarshalJSON_RoundTrip— key BC test: 3 counter states marshal→unmarshal cleanlyTestMockClock_AfterFunc— extended for negative durationVerification