[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well by hawkowl · Pull Request #4718 · Azure/ARO-RP

hawkowl · 2026-03-27T05:56:31Z

Which issue this PR addresses:

https://redhat.atlassian.net/browse/ARO-25464

Fixes some of the problems we've noticed in recent incident where MIMO works against clusters it doesn't have to.

What this PR does / why we need it:

Genericises the Monitor logic for master and bucket allocation, moves it to another name, and then reuses that in MIMO.

Test plan for issue:

CI, E2E

Is there any documentation that needs to be updated for this PR?

Possibly wrt the Monitor dbs?

How do you know this will function as expected in production?

E2E, hopefully :)

Copilot

Pull request overview

This PR replaces the monitor-specific master/worker bucket allocation with a generic “pool worker” leasing + bucket-balancing mechanism, then reuses it for MIMO (scheduler/actuator) so instances only act on buckets they own.

Changes:

Introduces PoolWorkerDocument + database.PoolWorkers and a shared bucket balancer loop in pkg/util/buckets.
Migrates pkg/monitor bucket coordination from Monitors DB to the new PoolWorkers DB.
Updates MIMO scheduler/actuator to use bucket coordination instead of hostname-based static partitioning, and adds Cosmos container + trigger deployments for PoolWorkers.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`pkg/util/buckets/balancer.go`	New generic bucket coordination loop + balancing logic for pool workers.
`pkg/util/buckets/cache.go` / `pkg/util/buckets/buckets.go`	Bucket worker behavior tweaks (e.g., bucket `-1` served by all) and bucket change handling.
`pkg/database/poolworkers.go`	New PoolWorkers DB wrapper and Cosmos queries for leasing/workers/buckets.
`pkg/api/poolworker*.go`	New API types for PoolWorker documents and worker types.
`pkg/monitor/*`	Migrates monitor bucket ownership to PoolWorkers and wires into generic loop.
`pkg/mimo/scheduler/*`	Switches scheduler to coordinated buckets; adds bucket selector data and filtering.
`pkg/mimo/actuator/*`	Switches actuator to coordinated buckets; removes hostname partitioning.
`cmd/aro/*`	Wires PoolWorkers DB into monitor / mimo services.
`pkg/deploy/*`	Adds PoolWorkers Cosmos container + renewLease trigger to generator and baked assets.
`test/database/*`	Adds fake PoolWorkers client wiring for tests.
`pkg/util/buckets/balancer_test.go`	Moves/updates balancing tests to validate PoolWorker bucket balancing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ang and do nothing

…ketworkers fail

github-actions · 2026-04-15T01:36:30Z

Please rebase pull request.

Copilot

Pull request overview

Copilot reviewed 42 out of 42 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (4)

test/database/poolworkers.go:1

q.Parameters[0].Value is an any/interface{}; comparing it directly with string (r.ID / string(r.WorkerType)) will not compile. Extract the parameter as a string (e.g., type assert once at the top of the handler) and compare against that typed value.
test/database/poolworkers.go:1
q.Parameters[0].Value is an any/interface{}; comparing it directly with string (r.ID / string(r.WorkerType)) will not compile. Extract the parameter as a string (e.g., type assert once at the top of the handler) and compare against that typed value.
pkg/util/buckets/balancer.go:1
Comparing err.Error() to "lost lease" is brittle and can break if wrapping or message text changes. Prefer a sentinel error (e.g., var ErrLostLease = errors.New("lost lease")) returned from the DB layer and check with errors.Is(err, ErrLostLease).
pkg/monitor/worker.go:1
This computes jitter via Seconds() and truncates to whole seconds (and relies on the global math/rand source, which is deterministic unless seeded elsewhere). Prefer computing from the time.Duration directly (avoids truncation) and consider using a locally-seeded RNG (or math/rand/v2 if available in your Go version) to reduce correlated startup jitter across processes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-15T03:07:07Z


 const (
 	SelectorDataKeyResourceID         SelectorDataKey = "resourceID"
+	SelectorDataBucketID              SelectorDataKey = "bucketID"


SelectorDataBucketID breaks the SelectorDataKey... naming pattern used by the other constants, which makes it harder to scan and grep consistently. Rename it to SelectorDataKeyBucketID to match the established convention.

Suggested change

SelectorDataBucketID SelectorDataKey = "bucketID"

SelectorDataKeyBucketID SelectorDataKey = "bucketID"

Copilot · 2026-04-15T03:07:08Z

+	}...)
+}
+
+func TestSchedlerStopsIfBucketFailure(t *testing.T) {


Typo in test name: TestSchedlerStopsIfBucketFailure should be TestSchedulerStopsIfBucketFailure for clarity and consistency.

Suggested change

func TestSchedlerStopsIfBucketFailure(t *testing.T) {

func TestSchedulerStopsIfBucketFailure(t *testing.T) {

Copilot · 2026-04-15T03:07:08Z

+						ID: pointerutils.ToPtr("PoolWorkers"),
 						PartitionKey: &sdkcosmos.ContainerPartitionKey{
 							Paths: []*string{
-								pointerutils.ToPtr("/id"),
+								pointerutils.ToPtr("/workerType"),
 							},


This changes the Cosmos container name and partition key (Monitors /id → PoolWorkers /workerType). Partition keys can’t be changed in-place, so upgrades need a safe rollout plan: create the new container alongside the old one (for a transition window) and ensure the service deploy order prevents a version mismatch (new code expecting PoolWorkers before the container exists). If Monitors can be dropped without migration because it’s ephemeral, it’s still worth ensuring the templates keep backward compatibility long enough to avoid downtime during rolling deployments.

Copilot · 2026-04-15T03:07:08Z

+func (c *poolWorkers) ListBuckets(ctx context.Context, poolWorkerType api.PoolWorkerType) (buckets []int, err error) {
+	doc, err := c.get(ctx, poolWorkerType, string(poolWorkerType))
+	if err != nil {
+		return nil, err
+	} else if doc == nil || doc.PoolWorker == nil {
+		return nil, ErrPoolWorkersBucketAllocationNotInitialized
+	}
+
+	for i, poolworker := range doc.PoolWorker.Buckets {
+		if poolworker == c.uuid {
+			buckets = append(buckets, i)
+		}
+	}
+
+	return buckets, nil
+}


If doc.PoolWorker exists but doc.PoolWorker.Buckets is empty/nil (e.g., before the master has ever balanced), this returns (nil, nil), which can mask “not initialized” as success and leads to confusing downstream handling. Consider returning ErrPoolWorkersBucketAllocationNotInitialized when len(doc.PoolWorker.Buckets) == 0 as well, so callers can distinguish “no allocation yet” from a valid empty result.

hawkowl added the ready-for-review label Mar 27, 2026

Copilot AI review requested due to automatic review settings March 27, 2026 05:56

hawkowl added next-release To be included in the next RP release rollout go Pull requests that update Go code skippy pull requests raised by member of Team Skippy labels Mar 27, 2026

hawkowl requested review from alcasim, bennerv, cadenmarchese, hlipsig, jharrington22, kevinobriendotca, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, ventifus, wanghaoran1988 and yjst2012 as code owners March 27, 2026 05:56

Copilot started reviewing on behalf of hawkowl March 27, 2026 05:57 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 27, 2026 06:10

Copilot started reviewing on behalf of hawkowl March 27, 2026 06:11 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

Comment thread pkg/mimo/scheduler/service.go

Copilot AI review requested due to automatic review settings March 30, 2026 01:30

hawkowl force-pushed the hawkowl/mimo-bucketing-from-monitor branch from 27c9366 to 9db8036 Compare March 30, 2026 01:30

Copilot started reviewing on behalf of hawkowl March 30, 2026 01:31 View session

hawkowl added 21 commits April 15, 2026 10:14

use these dbs in the actuator/scheduler cmd

b2444d2

unneeded code from the monitor

87c238e

remove old monitor code

ca2286e

regen

4517b0c

move the master doc priming into the buckets code

9730ca9

nits

6153bf9

ignore empty bucket allocations

1bbe4ed

fixes

f471e2f

add some logs

bffb101

correct some names

5962e42

this should not have quotation marks

c119405

we want these metrics as well

28c2a76

cleanups

ae19315

minor correction

c785731

make the monitor startup configurable instead of fixed, oops

8defb06

shift some now()s around

79b823a

add more tests for the scheduler working

6db836a

clean up some variables

ad548bd

clean up the startup smearing and make it consistent

4262bb5

make it so that failing poolworker doc doesn't cause the process to h…

b27d007

…ang and do nothing

review comments, make the actuator also not spin uselessly if the buc…

c0197d4

…ketworkers fail

github-actions bot added the needs-rebase branch needs a rebase label Apr 15, 2026

cleanup

9bed190

Copilot AI review requested due to automatic review settings April 15, 2026 02:59

hawkowl force-pushed the hawkowl/mimo-bucketing-from-monitor branch from a7cc8d7 to 9bed190 Compare April 15, 2026 02:59

Copilot AI reviewed Apr 15, 2026

View reviewed changes

github-actions bot removed the needs-rebase branch needs a rebase label Apr 15, 2026

Copilot started reviewing on behalf of hawkowl April 15, 2026 04:38 View session

cleanups

2fd471d

	SelectorDataBucketID SelectorDataKey = "bucketID"
	SelectorDataKeyBucketID SelectorDataKey = "bucketID"

	func TestSchedlerStopsIfBucketFailure(t *testing.T) {
	func TestSchedulerStopsIfBucketFailure(t *testing.T) {

Conversation

hawkowl commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hawkowl commented Mar 27, 2026 •

edited

Loading