Skip to content

feat(intergral/alerting): HA partitioning#57

Merged
jhawksley-intergral merged 11 commits into12.3.x-intergralfrom
feat/alerting-partitioner
Mar 29, 2026
Merged

feat(intergral/alerting): HA partitioning#57
jhawksley-intergral merged 11 commits into12.3.x-intergralfrom
feat/alerting-partitioner

Conversation

@jhawksley-intergral
Copy link
Copy Markdown

@jhawksley-intergral jhawksley-intergral commented Feb 27, 2026

Summary

Adds HA (high-availability) partitioning for the unified alerting scheduler so that alert rule evaluation is distributed across cluster members using consistent hashing.

  • Rule partitioning: Each instance evaluates only its share of rules, determined by FNV-64a(orgID:ruleUID) % clusterSize. Partitioning is gated on a minimum cluster size and automatically falls back to full evaluation if the cluster is unhealthy.
  • Remote state sync: A background loop periodically loads alert state from the database for rules owned by other instances, so the API can return correct LastEvaluation / rule status regardless of which instance serves the request.
  • Topology-aware re-fetch: The fetcher detects cluster membership changes and forces an immediate rule re-fetch, redistributing rules without waiting for the next tick.
  • Configuration: New [unified_alerting.ha_scheduler] section in defaults.ini with enabled, min_cluster_size, and remote_state_sync_interval options.

Files changed

Area Files What
Core schedule/partitioner.go FNV-based rule partitioner with memberlist/Redis peer support
Scheduler schedule/schedule.go, schedule/fetcher.go Wire partitioner into fetch loop, topology-change detection
State state/manager.go Remote state sync loop, RuleFilter interface, stale cleanup
API api/prometheus/api_prometheus.go Fall back to cached state for remote rule status
Config conf/defaults.ini, setting/setting_unified_alerting.go HA scheduler settings
Wiring ngalert.go, notifier/multiorg_alertmanager.go Plumb partitioner through DI
Tests partitioner_test.go, fetcher_test.go, manager_remote_state_test.go Unit + integration tests for partitioning, topology changes, remote sync
Other expr/mathexp/parse/lex_test.go Fix unrelated flaky race condition in lexer test
CI docker_build.yml Sanitize branch names for Docker tags

Test plan

  • Unit tests for partitioner: deterministic assignment, position-based filtering, unhealthy cluster fallback
  • Topology-change tests: member join/leave, healthy↔unhealthy transitions, rapid membership changes
  • Fetcher integration tests: topology change triggers re-fetch, stable topology skips
  • Remote state sync tests: periodic refresh, stale state cleanup, context cancellation
  • Manual verification: rule status API returns correct data from non-owning instance

🤖 Generated with Claude Code

jhawksley-intergral and others added 9 commits March 20, 2026 13:28
- Add dynamicMockPeer with mutable membership for topology change simulation
- Add unit tests: member joins, member leaves, healthy↔unhealthy transitions, rapid changes
- Add fetcher integration tests: topology change triggers re-fetch, stable topology skips
- Add withPartitioner option to setupScheduler for injecting partitioner in tests
- Fix docker_build.yml: sanitize ref_name for valid Docker tags (replace / with -)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove extra field alignment spaces in SchedulerCfg (goimports)
- Remove ineffectual assignment to peers in topology-change test (ineffassign)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed rules

- Add RuleFilter interface to state package, satisfied by schedule.RulePartitioner
- Add refreshRemoteStates() to periodically load DB state for non-local rules into cache
- Start background sync loop in Manager.Run() when partitioning is enabled
- Add ha_scheduler_remote_state_sync_interval config option (default 30s)
- Wire partitioner as RuleFilter in ngalert.go
- Add tests for remote refresh, stale cleanup, topology change, and context cancellation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add package doc comment to state manager

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip orgs with no alert rules in refreshRemoteStates loop
- Downgrade partitioner "HA partitioning applied" log from Info to Debug

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…te rules

When HA partitioning is enabled, the scheduler only knows about locally
assigned rules. API requests hitting a non-owning instance returned blank
LastEvaluation because the scheduler had no status for remote rules.

Now falls back to StatesToRuleStatus() using cached state from the remote
state sync, so the UI shows correct "Next Evaluation" regardless of which
instance serves the request.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drain in-flight items after Close() before asserting channel closure.
The emit select can non-deterministically deliver a pending item even
after done is closed, causing spurious test failures under -race.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jhawksley-intergral jhawksley-intergral marked this pull request as ready for review March 29, 2026 17:01
@jhawksley-intergral jhawksley-intergral merged commit 836d9d0 into 12.3.x-intergral Mar 29, 2026
18 of 20 checks passed
@jhawksley-intergral jhawksley-intergral deleted the feat/alerting-partitioner branch March 29, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant