Skip to content

[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718

Open
hawkowl wants to merge 31 commits intomasterfrom
hawkowl/mimo-bucketing-from-monitor
Open

[ARO-25464] Genericise the Monitor bucketing to work for MIMO as well#4718
hawkowl wants to merge 31 commits intomasterfrom
hawkowl/mimo-bucketing-from-monitor

Conversation

@hawkowl
Copy link
Copy Markdown
Collaborator

@hawkowl hawkowl commented Mar 27, 2026

Which issue this PR addresses:

https://redhat.atlassian.net/browse/ARO-25464

Fixes some of the problems we've noticed in recent incident where MIMO works against clusters it doesn't have to.

What this PR does / why we need it:

Genericises the Monitor logic for master and bucket allocation, moves it to another name, and then reuses that in MIMO.

Test plan for issue:

CI, E2E

Is there any documentation that needs to be updated for this PR?

Possibly wrt the Monitor dbs?

How do you know this will function as expected in production?

E2E, hopefully :)

Copilot AI review requested due to automatic review settings March 27, 2026 05:56
@hawkowl hawkowl added next-release To be included in the next RP release rollout go Pull requests that update Go code skippy pull requests raised by member of Team Skippy labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the monitor-specific master/worker bucket allocation with a generic “pool worker” leasing + bucket-balancing mechanism, then reuses it for MIMO (scheduler/actuator) so instances only act on buckets they own.

Changes:

  • Introduces PoolWorkerDocument + database.PoolWorkers and a shared bucket balancer loop in pkg/util/buckets.
  • Migrates pkg/monitor bucket coordination from Monitors DB to the new PoolWorkers DB.
  • Updates MIMO scheduler/actuator to use bucket coordination instead of hostname-based static partitioning, and adds Cosmos container + trigger deployments for PoolWorkers.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/util/buckets/balancer.go New generic bucket coordination loop + balancing logic for pool workers.
pkg/util/buckets/cache.go / pkg/util/buckets/buckets.go Bucket worker behavior tweaks (e.g., bucket -1 served by all) and bucket change handling.
pkg/database/poolworkers.go New PoolWorkers DB wrapper and Cosmos queries for leasing/workers/buckets.
pkg/api/poolworker*.go New API types for PoolWorker documents and worker types.
pkg/monitor/* Migrates monitor bucket ownership to PoolWorkers and wires into generic loop.
pkg/mimo/scheduler/* Switches scheduler to coordinated buckets; adds bucket selector data and filtering.
pkg/mimo/actuator/* Switches actuator to coordinated buckets; removes hostname partitioning.
cmd/aro/* Wires PoolWorkers DB into monitor / mimo services.
pkg/deploy/* Adds PoolWorkers Cosmos container + renewLease trigger to generator and baked assets.
test/database/* Adds fake PoolWorkers client wiring for tests.
pkg/util/buckets/balancer_test.go Moves/updates balancing tests to validate PoolWorker bucket balancing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/mimo/scheduler/service.go
Comment thread pkg/mimo/scheduler/service.go Outdated
Comment thread pkg/mimo/actuator/service.go
Comment thread pkg/mimo/actuator/service.go Outdated
Comment thread pkg/monitor/monitor.go Outdated
Comment thread pkg/monitor/monitor.go Outdated
Comment thread test/database/poolworkers.go Outdated
Comment thread pkg/database/poolworkers.go Outdated
Comment thread pkg/mimo/scheduler/service.go Outdated
Copilot AI review requested due to automatic review settings March 27, 2026 06:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/mimo/scheduler/service.go
Copilot AI review requested due to automatic review settings March 30, 2026 01:30
@hawkowl hawkowl force-pushed the hawkowl/mimo-bucketing-from-monitor branch from 27c9366 to 9db8036 Compare March 30, 2026 01:30
@github-actions github-actions bot added the needs-rebase branch needs a rebase label Apr 15, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

Copilot AI review requested due to automatic review settings April 15, 2026 02:59
@hawkowl hawkowl force-pushed the hawkowl/mimo-bucketing-from-monitor branch from a7cc8d7 to 9bed190 Compare April 15, 2026 02:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 42 out of 42 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (4)

test/database/poolworkers.go:1

  • q.Parameters[0].Value is an any/interface{}; comparing it directly with string (r.ID / string(r.WorkerType)) will not compile. Extract the parameter as a string (e.g., type assert once at the top of the handler) and compare against that typed value.
    test/database/poolworkers.go:1
  • q.Parameters[0].Value is an any/interface{}; comparing it directly with string (r.ID / string(r.WorkerType)) will not compile. Extract the parameter as a string (e.g., type assert once at the top of the handler) and compare against that typed value.
    pkg/util/buckets/balancer.go:1
  • Comparing err.Error() to "lost lease" is brittle and can break if wrapping or message text changes. Prefer a sentinel error (e.g., var ErrLostLease = errors.New("lost lease")) returned from the DB layer and check with errors.Is(err, ErrLostLease).
    pkg/monitor/worker.go:1
  • This computes jitter via Seconds() and truncates to whole seconds (and relies on the global math/rand source, which is deterministic unless seeded elsewhere). Prefer computing from the time.Duration directly (avoids truncation) and consider using a locally-seeded RNG (or math/rand/v2 if available in your Go version) to reduce correlated startup jitter across processes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/mimo/scheduler/selectors.go Outdated

const (
SelectorDataKeyResourceID SelectorDataKey = "resourceID"
SelectorDataBucketID SelectorDataKey = "bucketID"
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectorDataBucketID breaks the SelectorDataKey... naming pattern used by the other constants, which makes it harder to scan and grep consistently. Rename it to SelectorDataKeyBucketID to match the established convention.

Suggested change
SelectorDataBucketID SelectorDataKey = "bucketID"
SelectorDataKeyBucketID SelectorDataKey = "bucketID"

Copilot uses AI. Check for mistakes.
Comment thread pkg/mimo/scheduler/service_test.go Outdated
}...)
}

func TestSchedlerStopsIfBucketFailure(t *testing.T) {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in test name: TestSchedlerStopsIfBucketFailure should be TestSchedulerStopsIfBucketFailure for clarity and consistency.

Suggested change
func TestSchedlerStopsIfBucketFailure(t *testing.T) {
func TestSchedulerStopsIfBucketFailure(t *testing.T) {

Copilot uses AI. Check for mistakes.
Comment on lines +992 to 996
ID: pointerutils.ToPtr("PoolWorkers"),
PartitionKey: &sdkcosmos.ContainerPartitionKey{
Paths: []*string{
pointerutils.ToPtr("/id"),
pointerutils.ToPtr("/workerType"),
},
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the Cosmos container name and partition key (Monitors /id → PoolWorkers /workerType). Partition keys can’t be changed in-place, so upgrades need a safe rollout plan: create the new container alongside the old one (for a transition window) and ensure the service deploy order prevents a version mismatch (new code expecting PoolWorkers before the container exists). If Monitors can be dropped without migration because it’s ephemeral, it’s still worth ensuring the templates keep backward compatibility long enough to avoid downtime during rolling deployments.

Copilot uses AI. Check for mistakes.
Comment on lines +150 to +165
func (c *poolWorkers) ListBuckets(ctx context.Context, poolWorkerType api.PoolWorkerType) (buckets []int, err error) {
doc, err := c.get(ctx, poolWorkerType, string(poolWorkerType))
if err != nil {
return nil, err
} else if doc == nil || doc.PoolWorker == nil {
return nil, ErrPoolWorkersBucketAllocationNotInitialized
}

for i, poolworker := range doc.PoolWorker.Buckets {
if poolworker == c.uuid {
buckets = append(buckets, i)
}
}

return buckets, nil
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If doc.PoolWorker exists but doc.PoolWorker.Buckets is empty/nil (e.g., before the master has ever balanced), this returns (nil, nil), which can mask “not initialized” as success and leads to confusing downstream handling. Consider returning ErrPoolWorkersBucketAllocationNotInitialized when len(doc.PoolWorker.Buckets) == 0 as well, so callers can distinguish “no allocation yet” from a valid empty result.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code skippy pull requests raised by member of Team Skippy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants