[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718
[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the monitor-specific master/worker bucket allocation with a generic “pool worker” leasing + bucket-balancing mechanism, then reuses it for MIMO (scheduler/actuator) so instances only act on buckets they own.
Changes:
- Introduces
PoolWorkerDocument+database.PoolWorkersand a shared bucket balancer loop inpkg/util/buckets. - Migrates
pkg/monitorbucket coordination fromMonitorsDB to the newPoolWorkersDB. - Updates MIMO scheduler/actuator to use bucket coordination instead of hostname-based static partitioning, and adds Cosmos container + trigger deployments for PoolWorkers.
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/util/buckets/balancer.go |
New generic bucket coordination loop + balancing logic for pool workers. |
pkg/util/buckets/cache.go / pkg/util/buckets/buckets.go |
Bucket worker behavior tweaks (e.g., bucket -1 served by all) and bucket change handling. |
pkg/database/poolworkers.go |
New PoolWorkers DB wrapper and Cosmos queries for leasing/workers/buckets. |
pkg/api/poolworker*.go |
New API types for PoolWorker documents and worker types. |
pkg/monitor/* |
Migrates monitor bucket ownership to PoolWorkers and wires into generic loop. |
pkg/mimo/scheduler/* |
Switches scheduler to coordinated buckets; adds bucket selector data and filtering. |
pkg/mimo/actuator/* |
Switches actuator to coordinated buckets; removes hostname partitioning. |
cmd/aro/* |
Wires PoolWorkers DB into monitor / mimo services. |
pkg/deploy/* |
Adds PoolWorkers Cosmos container + renewLease trigger to generator and baked assets. |
test/database/* |
Adds fake PoolWorkers client wiring for tests. |
pkg/util/buckets/balancer_test.go |
Moves/updates balancing tests to validate PoolWorker bucket balancing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
27c9366 to
9db8036
Compare
…ang and do nothing
|
Please rebase pull request. |
a7cc8d7 to
9bed190
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 42 out of 42 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (4)
test/database/poolworkers.go:1
q.Parameters[0].Valueis anany/interface{}; comparing it directly withstring(r.ID/string(r.WorkerType)) will not compile. Extract the parameter as astring(e.g., type assert once at the top of the handler) and compare against that typed value.
test/database/poolworkers.go:1q.Parameters[0].Valueis anany/interface{}; comparing it directly withstring(r.ID/string(r.WorkerType)) will not compile. Extract the parameter as astring(e.g., type assert once at the top of the handler) and compare against that typed value.
pkg/util/buckets/balancer.go:1- Comparing
err.Error()to"lost lease"is brittle and can break if wrapping or message text changes. Prefer a sentinel error (e.g.,var ErrLostLease = errors.New("lost lease")) returned from the DB layer and check witherrors.Is(err, ErrLostLease).
pkg/monitor/worker.go:1 - This computes jitter via
Seconds()and truncates to whole seconds (and relies on the globalmath/randsource, which is deterministic unless seeded elsewhere). Prefer computing from thetime.Durationdirectly (avoids truncation) and consider using a locally-seeded RNG (ormath/rand/v2if available in your Go version) to reduce correlated startup jitter across processes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| const ( | ||
| SelectorDataKeyResourceID SelectorDataKey = "resourceID" | ||
| SelectorDataBucketID SelectorDataKey = "bucketID" |
There was a problem hiding this comment.
SelectorDataBucketID breaks the SelectorDataKey... naming pattern used by the other constants, which makes it harder to scan and grep consistently. Rename it to SelectorDataKeyBucketID to match the established convention.
| SelectorDataBucketID SelectorDataKey = "bucketID" | |
| SelectorDataKeyBucketID SelectorDataKey = "bucketID" |
| }...) | ||
| } | ||
|
|
||
| func TestSchedlerStopsIfBucketFailure(t *testing.T) { |
There was a problem hiding this comment.
Typo in test name: TestSchedlerStopsIfBucketFailure should be TestSchedulerStopsIfBucketFailure for clarity and consistency.
| func TestSchedlerStopsIfBucketFailure(t *testing.T) { | |
| func TestSchedulerStopsIfBucketFailure(t *testing.T) { |
| ID: pointerutils.ToPtr("PoolWorkers"), | ||
| PartitionKey: &sdkcosmos.ContainerPartitionKey{ | ||
| Paths: []*string{ | ||
| pointerutils.ToPtr("/id"), | ||
| pointerutils.ToPtr("/workerType"), | ||
| }, |
There was a problem hiding this comment.
This changes the Cosmos container name and partition key (Monitors /id → PoolWorkers /workerType). Partition keys can’t be changed in-place, so upgrades need a safe rollout plan: create the new container alongside the old one (for a transition window) and ensure the service deploy order prevents a version mismatch (new code expecting PoolWorkers before the container exists). If Monitors can be dropped without migration because it’s ephemeral, it’s still worth ensuring the templates keep backward compatibility long enough to avoid downtime during rolling deployments.
| func (c *poolWorkers) ListBuckets(ctx context.Context, poolWorkerType api.PoolWorkerType) (buckets []int, err error) { | ||
| doc, err := c.get(ctx, poolWorkerType, string(poolWorkerType)) | ||
| if err != nil { | ||
| return nil, err | ||
| } else if doc == nil || doc.PoolWorker == nil { | ||
| return nil, ErrPoolWorkersBucketAllocationNotInitialized | ||
| } | ||
|
|
||
| for i, poolworker := range doc.PoolWorker.Buckets { | ||
| if poolworker == c.uuid { | ||
| buckets = append(buckets, i) | ||
| } | ||
| } | ||
|
|
||
| return buckets, nil | ||
| } |
There was a problem hiding this comment.
If doc.PoolWorker exists but doc.PoolWorker.Buckets is empty/nil (e.g., before the master has ever balanced), this returns (nil, nil), which can mask “not initialized” as success and leads to confusing downstream handling. Consider returning ErrPoolWorkersBucketAllocationNotInitialized when len(doc.PoolWorker.Buckets) == 0 as well, so callers can distinguish “no allocation yet” from a valid empty result.
Which issue this PR addresses:
https://redhat.atlassian.net/browse/ARO-25464
Fixes some of the problems we've noticed in recent incident where MIMO works against clusters it doesn't have to.
What this PR does / why we need it:
Genericises the Monitor logic for master and bucket allocation, moves it to another name, and then reuses that in MIMO.
Test plan for issue:
CI, E2E
Is there any documentation that needs to be updated for this PR?
Possibly wrt the Monitor dbs?
How do you know this will function as expected in production?
E2E, hopefully :)