Skip to content

Fix 8 bugs: TOCTOU races, panics, and missed cases from PR #150#151

Merged
cep21 merged 3 commits into
masterfrom
fix-toctou-panics-and-missed-cases
Feb 27, 2026
Merged

Fix 8 bugs: TOCTOU races, panics, and missed cases from PR #150#151
cep21 merged 3 commits into
masterfrom
fix-toctou-panics-and-missed-cases

Conversation

@cep21
Copy link
Copy Markdown
Owner

@cep21 cep21 commented Feb 26, 2026

Summary

Follow-up to #150 — 8 additional bugs found across the same areas. All fixes are backwards-compatible (panic→works or panic→error; no behavior changes for existing valid usage).

Fixes

# Bug File Fix
1 ConsecutiveErrOpenerFactory config race closers/simplelogic/closers.go Copy config before Merge (same pattern as #150 for hystrix factories)
2 openCircuit can emit Opened() 2-3× under concurrent calls circuit.go CompareAndSwap before emitting metric
3 close can emit Closed() multiple times circuit.go CompareAndSwap before emitting metric
4 Opener.SetConfigNotThreadSafe nil-deref on props.Now closers/hystrix/opener.go Use nil-safe props.now() helper (already existed at line 53)
5 RollingCounter.GetBuckets/.String() divide-by-zero on zero value faststats/rolling_counter.go Guard NumBuckets == 0
6 RollingCounter.UnmarshalJSON nil-deref on incomplete JSON faststats/rolling_counter.go Return error on missing required fields (panic→error)
7 Fallback MaxConcurrentRequests double-read circuit.go Read once into local (same pattern as #150 for main throttle)
8 MockClock.AfterFunc — negative duration doesn't fire immediately internal/clock/clock.go d == 0d <= 0 (matches real time.AfterFunc)

Backwards compatibility

  • CAS in open/closeOpened()/Closed() now fire after isOpen is set (previously before). Audited all in-repo Metrics implementations — none call IsOpen() from inside the callback. The old ordering was non-deterministic under the TOCTOU anyway.
  • UnmarshalJSON — incomplete JSON (e.g. {}) now returns an error instead of panicking. Round-trip (MarshalUnmarshal) is unaffected and fully covered by new tests.
  • GetBuckets — returns nil for zero-value counter instead of panicking. range nil is a no-op so callers are safe.

Test coverage

Every fix has a regression test:

  • TestOpenCircuit_EmitsOpenedExactlyOnce / TestCloseCircuit_EmitsClosedExactlyOnce — 100 concurrent goroutines
  • TestConsecutiveErrOpenerFactory_ConcurrentCreation — 100 concurrent factory calls under -race
  • TestOpener_SetConfigNotThreadSafe_NilNow — direct call without Now set
  • TestRollingCounter_Empty — extended for zero-value GetBuckets/String
  • TestRollingCounter_UnmarshalJSON_IncompleteInput — 5 partial-JSON variants × 2 receiver states
  • TestRollingCounter_UnmarshalJSON_RoundTripkey BC test: 3 counter states marshal→unmarshal cleanly
  • TestMockClock_AfterFunc — extended for negative duration

Verification

go test -race -count=10 ./...   # all pass
go vet ./...                    # clean
golangci-lint run ./...         # 0 issues

cep21 and others added 3 commits February 26, 2026 20:21
- fix: use CAS in openCircuit/close to emit Opened()/Closed() exactly once
- fix: copy config in ConsecutiveErrOpenerFactory closure (race hygiene, matches PR #150)
- fix: read Fallback.MaxConcurrentRequests once in throttle check (matches PR #150)
- fix: use nil-safe props.now() in Opener.SetConfigNotThreadSafe
- fix: guard NumBuckets==0 in RollingCounter.GetBuckets (fixes String() panic)
- fix: nil-check pointer fields in RollingCounter.UnmarshalJSON
- fix: MockClock.AfterFunc fires immediately for d<=0 (matches real time.AfterFunc)

All fixes are backwards-compatible (panic->works, race->clean, or strictly
additive guards). Regression tests added for each.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…g inconsistent state

The previous fix (nil-check each pointer field and skip) made zero-value
receivers safe but created a new deferred-panic vector: on a pre-initialized
counter, UnmarshalJSON({}) would set buckets=nil while leaving NumBuckets
unchanged, causing GetBuckets to panic later with index out of range.

UnmarshalJSON is a round-trip API — MarshalJSON always emits all fields.
Incomplete JSON is corrupt/truncated input. Return an error (receiver
unmodified) rather than silently partial-apply.

Backwards-compatible vs released master: panic -> error (strictly better).

Tests added:
- IncompleteInput: 5 partial-JSON variants x 2 receiver states (all must error, receiver preserved)
- RoundTrip: 3 counter states marshal+unmarshal cleanly (the BC guarantee)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Systematic audit for timing-sensitive tests that can flake on loaded CI runners.

- TestCircuitAttemptsToReopenOnlyOnce: convert to MockClock for deterministic
  timing. The previous real-clock version with SleepWindow: 1ms flaked when CI
  scheduling delays exceeded 1ms (same pattern PR #141 fixed in the sister test
  TestCircuitAttemptsToReopen). This is the flake that caused the CI failure.

- TestThrottled, TestFallbackCircuitConcurrency: replace time.Sleep-based
  concurrency with a barrier pattern. Wait until ALL goroutines have either
  entered the throttled section or been rejected before releasing — guarantees
  the over-limit goroutine attempts while slots are occupied, regardless of
  scheduling. Also fixed: errCount++ was not atomic.

- TestRollingCounter_Inc: use StringAt(now)/RollingSumAt(now) instead of
  String()/RollingSum(). The latter use real time.Now() which can advance past
  the 10ms rolling window on a slow runner, rolling out buckets before later
  Inc(now) calls.

- TestTickUntil: register AfterFunc callbacks BEFORE starting the TickUntil
  goroutine. If TickUntil advances mock time first, callbacks get scheduled
  relative to the advanced time, breaking the final m.Now() == now+3h assertion.

- TestManagerConcurrentFactoryConfiguration: drop require.NoError on Execute.
  With 300 concurrent executions under -race and only a 2x sleep/timeout margin,
  CI jitter can cause spurious timeouts. The test is about concurrent factory
  creation, not execution timing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cep21 cep21 marked this pull request as ready for review February 27, 2026 19:37
@cep21 cep21 merged commit 1224f2f into master Feb 27, 2026
6 checks passed
@cep21 cep21 deleted the fix-toctou-panics-and-missed-cases branch February 27, 2026 19:37
cep21 added a commit that referenced this pull request Feb 28, 2026
…efile

## Test infrastructure

- circuit_property_test.go: invariant-asserting stress tests + deterministic
  state-machine property test (testing/quick + MockClock + reference model).
  Verified catches the TOCTOU bugs from PR #151 by temporarily reverting CAS.
    - TestCircuit_RunMetricsExactlyOnce_Concurrent: verifies the exactly-once
      RunMetrics contract under 100K concurrent Execute calls
    - TestCircuit_TransitionAlternation_Concurrent: verifies Opened/Closed
      alternate correctly (closed ≤ opened ≤ closed+1) under contention
    - TestCircuit_FallbackMetricsExactlyOnce_Concurrent: same for fallback path
    - TestCircuit_StateMachineProperty: 500 random op sequences replayed
      against real circuit + trivial reference model, 6 invariants checked
      per step, fully deterministic via MockClock

- faststats/fuzz_test.go: 5 Go-native fuzz targets + naiveRollingCounter
  differential oracle. Seeds run as unit tests in ~5ms.
    - FuzzRollingCounterOps: Inc/RollingSumAt/GetBuckets vs naive oracle
    - FuzzRollingCounterJSON: round-trip + hostile-input safety
    - FuzzSortedDurationsPercentile: bounds + monotonicity
    - FuzzRollingBucketAdvance: index bounds + LastAbsIndex monotonicity
    - FuzzTimedCheckJSON: round-trip + hostile-input safety

- Tightened existing stress tests (were assertion-free, just t.Logf):
    - circuit_stress_test.go: TestCircuitStateTransitionRace now verifies
      Opened/Closed alternation + ConcurrentCommands balance
    - faststats/rolling_stress_test.go: deleted 2 permanent-skip tests,
      added real TestTimedCheckConcurrency, added RollingSum≤TotalSum check

## Bug fixes (found by fuzzers in ~10s)

- faststats/rolling_percentile.go: Percentile(NaN) panicked — NaN falls
  through both p<=0 and p>=100 guards, int(math.Floor(NaN)) is INT64_MIN
  on amd64, index-out-of-range. Now returns -1.

- faststats/rolling_counter.go: UnmarshalJSON accepted hostile JSON with
  NumBuckets > len(Buckets), then GetBuckets panicked. Now rejected at
  unmarshal time with a descriptive error.

## Build tooling

- Makefile: restored after Jan 2024 deletion. make (fast dev loop),
  make ci (pre-PR gate mirroring GitHub Actions), make fuzz, make bench,
  make cover, make fix, make help.

- .github/workflows/build.yml: Build/Test steps now call make build /
  make test-race (same commands, now DRY).

- .github/workflows/fuzz.yml: daily scheduled fuzzing, 5min per target,
  uploads crash inputs as artifacts on failure. Manual dispatch enabled.

- README.md: Development section points to make targets. Fixed dead
  Makefile link.

- .gitignore: added /coverage.html (from make cover).
cep21 added a commit that referenced this pull request Feb 28, 2026
…efile (#152) (#152)

* Add property/fuzz/invariant tests; fix 2 fuzz-found bugs; restore Makefile

## Test infrastructure

- circuit_property_test.go: invariant-asserting stress tests + deterministic
  state-machine property test (testing/quick + MockClock + reference model).
  Verified catches the TOCTOU bugs from PR #151 by temporarily reverting CAS.
    - TestCircuit_RunMetricsExactlyOnce_Concurrent: verifies the exactly-once
      RunMetrics contract under 100K concurrent Execute calls
    - TestCircuit_TransitionAlternation_Concurrent: verifies Opened/Closed
      alternate correctly (closed ≤ opened ≤ closed+1) under contention
    - TestCircuit_FallbackMetricsExactlyOnce_Concurrent: same for fallback path
    - TestCircuit_StateMachineProperty: 500 random op sequences replayed
      against real circuit + trivial reference model, 6 invariants checked
      per step, fully deterministic via MockClock

- faststats/fuzz_test.go: 5 Go-native fuzz targets + naiveRollingCounter
  differential oracle. Seeds run as unit tests in ~5ms.
    - FuzzRollingCounterOps: Inc/RollingSumAt/GetBuckets vs naive oracle
    - FuzzRollingCounterJSON: round-trip + hostile-input safety
    - FuzzSortedDurationsPercentile: bounds + monotonicity
    - FuzzRollingBucketAdvance: index bounds + LastAbsIndex monotonicity
    - FuzzTimedCheckJSON: round-trip + hostile-input safety

- Tightened existing stress tests (were assertion-free, just t.Logf):
    - circuit_stress_test.go: TestCircuitStateTransitionRace now verifies
      Opened/Closed alternation + ConcurrentCommands balance
    - faststats/rolling_stress_test.go: deleted 2 permanent-skip tests,
      added real TestTimedCheckConcurrency, added RollingSum≤TotalSum check

## Bug fixes (found by fuzzers in ~10s)

- faststats/rolling_percentile.go: Percentile(NaN) panicked — NaN falls
  through both p<=0 and p>=100 guards, int(math.Floor(NaN)) is INT64_MIN
  on amd64, index-out-of-range. Now returns -1.

- faststats/rolling_counter.go: UnmarshalJSON accepted hostile JSON with
  NumBuckets > len(Buckets), then GetBuckets panicked. Now rejected at
  unmarshal time with a descriptive error.

## Build tooling

- Makefile: restored after Jan 2024 deletion. make (fast dev loop),
  make ci (pre-PR gate mirroring GitHub Actions), make fuzz, make bench,
  make cover, make fix, make help.

- .github/workflows/build.yml: Build/Test steps now call make build /
  make test-race (same commands, now DRY).

- .github/workflows/fuzz.yml: daily scheduled fuzzing, 5min per target,
  uploads crash inputs as artifacts on failure. Manual dispatch enabled.

- README.md: Development section points to make targets. Fixed dead
  Makefile link.

- .gitignore: added /coverage.html (from make cover).

* Fix pre-existing flake: RollingSumAt < 0 check must be post-quiescence

TestRollingCounterBucketRolloverRace had a rollingSum < 0 check inside
the hot concurrent loop since PR #139/#150. This check was always wrong
for the lock-free RollingCounter design (documented at rolling_bucket.go):

  Thread A (Inc):      buckets[idx].Add(1)
  Thread B (rollover): Swap(0) returns the incremented value,
                       rollingSum.Add(-toDec) — briefly over-decrements
  Thread A (Inc cont): rollingSum.Add(1) — corrects

A reader between B and A's rollingSum write observes a transient negative.
The value is correct once writers quiesce.

Moved the < 0 check to after wg.Wait(). The rollingSum <= TotalSum check
stays in the hot loop (safe: reads are correctly ordered).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant