ROX-32316: Rate limit VM index reports before being queued #18692

guzalv · 2026-01-27T10:56:40Z

Description

While looking into increasing the VM index reports rate limiter default bucket size from 5 to a higher number (to increase burst handling capabilities), a test run with bucket size 500 and high load showed central crashing shortly after the load started. This was unexpected, because a bucket capacity of 500 should lead to max 500 reports being held in memory, which would mean ~1.5-2 GiB, and instead the memory usage quickly reached the container's limit of 16 GiB.

I iterated with an AI agent looking into the problem, and identified a likely root cause: the reports are put into the queues before checking the rate limiter, and the rate limiter drops due to an empty bucket only happen after the pipeline runs $bucket_capacity times and consumes those tokens.

Since this pipeline is slow (the overall report processing rate is under 1 report per second in a default deployment), it takes ~500 seconds for all the tokens to get consumed. During this time all new reports accumulate in memory, causing the crash.

Whether or not the crash happens depends on:

bucket size: buckets with many tokens take longer to empty, allowing more reports to accumulate.
incoming report rate: more frequent reports result in faster memory growth, leading to crashes with smaller bucket sizes.
report size and vulnerability count: reports with more packages and especially with more vulnerabilities take significantly longer to scan (see VM scale test report for details), causing smaller bucket sizes to drain slower.

This PR is moves the limiter to act before the reports are enqueued, which should prevent the issue.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

We deployed the modified central to a cluster, set bucket capacity to 500 and generated the same load that previously made central crash (see the load profile below); this time central shows a stable memory usage pattern, as expected:

deploymentWorkload:
- deploymentType: Deployment
  lifecycleDuration: 30m0s
  numDeployments: 500
  numLifecycles: 0
  numLabels: 10
  podWorkload:
    containerWorkload:
      numImages: 0
    lifecycleDuration: 10m0s
    numContainers: 3
    numPods: 5
    processWorkload:
      alertRate: 0.001
      processInterval: 1m0s
  updateInterval: 10m0s
- deploymentType: Deployment
  lifecycleDuration: 30m0s
  numDeployments: 500
  numLifecycles: 0
  podWorkload:
    containerWorkload:
      numImages: 0
    lifecycleDuration: 10m0s
    numContainers: 3
    numPods: 5
    processWorkload:
      alertRate: 0.001
      processInterval: 1m0s
  updateInterval: 10m0s
networkPolicyWorkload:
- lifecycleDuration: 30m0s
  numNetworkPolicies: 1000
  numLifecycles: 0
  numLabels: 5
  updateInterval: 10m0s
networkWorkload:
  batchSize: 100
  flowInterval: 1s
  generateUnclosedEndpoints: true
  openPortReuseProbability: 0.05
nodeWorkload:
  numNodes: 1000
rbacWorkload:
  numBindings: 1000
  numRoles: 1000
  numServiceAccounts: 1000
virtualMachineWorkload:
  poolSize: 20000
  updateInterval: 5m
  lifecycleDuration: 2h
  numLifecycles: 0
  reportInterval: 5m # ±5% max deviation applied
  numPackages: 700
  initialReportDelay: 30s # ±20% max deviation applied

Central metrics prove that under very high load the overwhelming majority of requests are dropped:

# HELP rox_central_rate_limiter_active_clients Current number of active clients being rate limited
# TYPE rox_central_rate_limiter_active_clients gauge
rox_central_rate_limiter_active_clients{workload="vm_index_reports"} 1
# HELP rox_central_rate_limiter_per_client_bucket_max_tokens Current per-client token bucket capacity (max>
# TYPE rox_central_rate_limiter_per_client_bucket_max_tokens gauge
rox_central_rate_limiter_per_client_bucket_max_tokens{workload="vm_index_reports"} 10
# HELP rox_central_rate_limiter_per_client_bucket_refill_rate_per_second Current per-client rate limit in >
# TYPE rox_central_rate_limiter_per_client_bucket_refill_rate_per_second gauge
rox_central_rate_limiter_per_client_bucket_refill_rate_per_second{workload="vm_index_reports"} 0.3
# HELP rox_central_rate_limiter_per_client_bucket_tokens Current available tokens per client bucket
# TYPE rox_central_rate_limiter_per_client_bucket_tokens gauge
rox_central_rate_limiter_per_client_bucket_tokens{client="2c8bbbd6-e1de-4a90-a4e6-0aa1e7abf3fa",workload=">
# HELP rox_central_rate_limiter_requests_total Total requests received by the rate limiter
# TYPE rox_central_rate_limiter_requests_total counter
rox_central_rate_limiter_requests_total{outcome="accepted",workload="vm_index_reports"} 65
rox_central_rate_limiter_requests_total{outcome="rejected",workload="vm_index_reports"} 34475

We verified that the rate limiter uses a fair distribution between several secured clusters and updates it when sensors disconnect, by adding several clusters to a single central and testing different scenarios of sensors being killed/scaled down.

Admin events are shown when index reports are rate limited:

openshift-ci · 2026-01-27T10:56:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2026-01-27T11:37:53Z

Images are ready for the commit at 05e4bd4.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.10.x-921-g05e4bd4e9d.

codecov · 2026-01-27T11:42:18Z

Codecov Report

❌ Patch coverage is 45.79439% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.32%. Comparing base (1cecc4b) to head (05e4bd4).
⚠️ Report is 21 commits behind head on master.

Files with missing lines	Patch %	Lines
...ntral/sensor/service/connection/connection_impl.go	0.00%	35 Missing ⚠️
central/sensor/service/connection/manager_impl.go	0.00%	20 Missing ⚠️
pkg/rate/limiter.go	93.75%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #18692      +/-   ##
==========================================
+ Coverage   49.29%   49.32%   +0.02%     
==========================================
  Files        2663     2663              
  Lines      200433   200466      +33     
==========================================
+ Hits        98813    98885      +72     
+ Misses      94179    94137      -42     
- Partials     7441     7444       +3

Flag	Coverage Δ
go-unit-tests	`49.32% <45.79%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

central/sensor/service/connection/connection_impl.go

lvalerom

I don't see the NACK logic anymore in this PR. Are planning on adding it in a follow-up? ~~Otherwise I'd recommend to just drop messages in the DedupingQueue~~
Never-mind, I just realize that the DedupingQueue does not have the dropping-on-full implemented.

vikin91 · 2026-01-27T13:39:58Z

I don't see the NACK logic anymore in this PR. Are planning on adding it in a follow-up?

Yeah, we think, we need to skip that for now. Not sure whether we can reply with NACK from the new place. We want to get the rate limiting working first and if time allows, we would add the NACKs (they are anyway not connected yet in Sensor, so there is no functionality loss when we skip that).

vikin91

Approving after a self-review, but I am the co-author, so I am biased.

I would like to see more unit tests at the level of sensorConnection, but I am okay merging it without them as we have tested it on clusters in various combinations of rate limit parameters, intensities of the workloads, numbers of clusters etc.

lvalerom

LGTM.

I still think the rate-limiter should be part of the DedupingQueue or wrapping up the individual queues. Unless I'm missing something, this is a global rate-limiter. If we start adding different types of messages, all the pipelines will share the same rate-limiter. I think we should definitely have different rate-limiters per queue.

pkg/rate/limiter.go

vikin91 · 2026-01-27T16:21:40Z

I still think the rate-limiter should be part of the DedupingQueue or wrapping up the individual queues. Unless I'm missing something, this is a global rate-limiter. If we start adding different types of messages, all the pipelines will share the same rate-limiter. I think we should definitely have different rate-limiters per queue.

Yes, this is correct. The goal was to have []rateLimiter instead of rl in the manager. That would allow us to define multiple rate-limiters for different queues. However, we thought that YAGNI and we stayed with a single one for now.

The idea of having that per-queue is valid, however, we also wanted to have a possibility to use one and the same limiter for node inventories, node indexes, vm indexes. So that is why we placed it before the queues. I hope we can refactor and improve the design in the next days (next release).

vikin91 · 2026-01-27T16:23:34Z

/test gke-nongroovy-e2e-tests

(known flake TestPod)

vikin91 · 2026-01-27T21:21:49Z

/retest

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In sensorConnection.multiplexedPush, c.rl.TryConsume is called unconditionally on an interface-typed rateLimiter; if newVMIndexReportRateLimiter returns nil (e.g. config error), this will panic because the interface itself is nil, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with a nil check.
The new TestOnClientDisconnect_Nil relies on calling methods on a nil *Limiter, but in real code you usually hold it via the rateLimiter interface where nil would not have methods; it might be clearer to test the nil-handling behavior via a rateLimiter-typed variable to match production usage.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `sensorConnection.multiplexedPush`, `c.rl.TryConsume` is called unconditionally on an interface-typed `rateLimiter`; if `newVMIndexReportRateLimiter` returns `nil` (e.g. config error), this will panic because the interface itself is `nil`, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with a `nil` check.
- The new `TestOnClientDisconnect_Nil` relies on calling methods on a `nil` `*Limiter`, but in real code you usually hold it via the `rateLimiter` interface where `nil` would not have methods; it might be clearer to test the `nil`-handling behavior via a `rateLimiter`-typed variable to match production usage.

## Individual Comments

### Comment 1
<location> `central/sensor/service/connection/connection_impl.go:88-89` </location>
<code_context>

 	hashDeduper hashManager.Deduper
+
+	rl                rateLimiter
+	adminEventsStream events.Stream
+}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard against nil rate limiter in `multiplexedPush` to avoid panics

Because `sensorConnection.rl` is an interface, calling `TryConsume` when it is nil will panic; `newVMIndexReportRateLimiter()` can already return nil (and tests may pass nil), so the code must guard against nil before invoking the limiter.
</issue_to_address>

### Comment 2
<location> `pkg/rate/limiter.go:67` </location>
<code_context>

+// limiterOption is an intermediary type for creating a rate limiter.
+// It forces the caller to define the workload type to which the rate limiter would react.
+type limiterOption struct {
+	l   *Limiter
+	err error
</code_context>

<issue_to_address>
**issue (complexity):** Consider keeping NewLimiter/NewLimiterWithClock as direct constructors that accept optional functional options so callers retain the original single-step usage.

Replace the `limiterOption` builder with a direct constructor that still returns `(*Limiter, error)` and accepts an optional workload filter. This keeps the new filtering capability without forcing every caller through a two-step construction or nil checks. For example:

```go
type LimiterOption func(*Limiter)

func WithWorkloadFilter(fn func(*central.MsgFromSensor) bool) LimiterOption {
	return func(l *Limiter) { l.acceptsFn = fn }
}

func NewLimiterWithClock(workloadName string, globalRate float64, bucketCapacity int, clock Clock, opts ...LimiterOption) (*Limiter, error) {
	// existing validation...
	l := &Limiter{
		workloadName:   workloadName,
		globalRate:     globalRate,
		bucketCapacity: bucketCapacity,
		buckets:        make(map[string]*gorate.Limiter),
		clock:          clock,
	}
	for _, opt := range opts {
		opt(l)
	}
	return l, nil
}
```

Callers that need filtering pass `WithWorkloadFilter`, while others keep their existing single-call construction and error handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

central/sensor/service/connection/connection_impl.go

pkg/rate/limiter.go

openshift-ci · 2026-01-28T00:06:30Z

@guzalv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-20-ui-e2e-tests	`05e4bd4`	link	false	`/test ocp-4-20-ui-e2e-tests`
ci/prow/ocp-4-12-nongroovy-e2e-tests	`05e4bd4`	link	false	`/test ocp-4-12-nongroovy-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Jan 27, 2026

guzalv force-pushed the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch from fea3a38 to 3a4e55d Compare January 27, 2026 10:58

ROX-32731: Rate limit VM index reports before being queued

667a4d1

guzalv force-pushed the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch from 3a4e55d to 667a4d1 Compare January 27, 2026 11:04

github-actions bot added the area/central label Jan 27, 2026

vikin91 reviewed Jan 27, 2026

View reviewed changes

central/sensor/service/connection/connection_impl.go Outdated Show resolved Hide resolved

vikin91 added 3 commits January 27, 2026 13:28

Make the rate-limiter react on configured messages only

aa98ccc

Avoid panicking due to missing rate limiter

af21d4d

Add unit test coverage & fix nil-check

ec6b92e

lvalerom reviewed Jan 27, 2026

View reviewed changes

vikin91 added 2 commits January 27, 2026 14:19

Emit administrative event when rate limited

1d9e69d

Add unit test for nil rate-limiter

3430b13

vikin91 added 2 commits January 27, 2026 15:29

Change error to warning log when mesages are rate-limited

39cc75e

Initialize adminEventsStream

05e4bd4

vikin91 marked this pull request as ready for review January 27, 2026 14:47

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 27, 2026

vikin91 approved these changes Jan 27, 2026

View reviewed changes

lvalerom approved these changes Jan 27, 2026

View reviewed changes

pkg/rate/limiter.go Show resolved Hide resolved

guzalv added the ai-review label Jan 27, 2026

sourcery-ai bot reviewed Jan 27, 2026

View reviewed changes

central/sensor/service/connection/connection_impl.go Show resolved Hide resolved

pkg/rate/limiter.go Show resolved Hide resolved

vikin91 changed the title ~~ROX-32731: Rate limit VM index reports before being queued~~ ROX-32316: Rate limit VM index reports before being queued Jan 28, 2026

guzalv merged commit 93212ec into master Jan 28, 2026
114 of 117 checks passed

guzalv deleted the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch January 28, 2026 08:42

guzalv mentioned this pull request Jan 28, 2026

ROX-32731: Rate limit VM index reports before being queued #18595

Closed

9 tasks

ROX-32316: Rate limit VM index reports before being queued #18692

ROX-32316: Rate limit VM index reports before being queued #18692

Conversation

guzalv commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

rhacs-bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

lvalerom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vikin91 commented Jan 27, 2026

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

lvalerom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vikin91 commented Jan 27, 2026

Uh oh!

vikin91 commented Jan 27, 2026

Uh oh!

vikin91 commented Jan 27, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guzalv commented Jan 27, 2026 •

edited

Loading

rhacs-bot commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading

lvalerom left a comment •

edited

Loading