Skip to content

Conversation

@guzalv
Copy link
Contributor

@guzalv guzalv commented Jan 27, 2026

Description

While looking into increasing the VM index reports rate limiter default bucket size from 5 to a higher number (to increase burst handling capabilities), a test run with bucket size 500 and high load showed central crashing shortly after the load started. This was unexpected, because a bucket capacity of 500 should lead to max 500 reports being held in memory, which would mean ~1.5-2 GiB, and instead the memory usage quickly reached the container's limit of 16 GiB.

I iterated with an AI agent looking into the problem, and identified a likely root cause: the reports are put into the queues before checking the rate limiter, and the rate limiter drops due to an empty bucket only happen after the pipeline runs $bucket_capacity times and consumes those tokens.

Since this pipeline is slow (the overall report processing rate is under 1 report per second in a default deployment), it takes ~500 seconds for all the tokens to get consumed. During this time all new reports accumulate in memory, causing the crash.

Whether or not the crash happens depends on:

  • bucket size: buckets with many tokens take longer to empty, allowing more reports to accumulate.
  • incoming report rate: more frequent reports result in faster memory growth, leading to crashes with smaller bucket sizes.
  • report size and vulnerability count: reports with more packages and especially with more vulnerabilities take significantly longer to scan (see VM scale test report for details), causing smaller bucket sizes to drain slower.

This PR is moves the limiter to act before the reports are enqueued, which should prevent the issue.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

We deployed the modified central to a cluster, set bucket capacity to 500 and generated the same load that previously made central crash (see the load profile below); this time central shows a stable memory usage pattern, as expected:

SCR-20260127-qjyd
deploymentWorkload:
- deploymentType: Deployment
  lifecycleDuration: 30m0s
  numDeployments: 500
  numLifecycles: 0
  numLabels: 10
  podWorkload:
    containerWorkload:
      numImages: 0
    lifecycleDuration: 10m0s
    numContainers: 3
    numPods: 5
    processWorkload:
      alertRate: 0.001
      processInterval: 1m0s
  updateInterval: 10m0s
- deploymentType: Deployment
  lifecycleDuration: 30m0s
  numDeployments: 500
  numLifecycles: 0
  podWorkload:
    containerWorkload:
      numImages: 0
    lifecycleDuration: 10m0s
    numContainers: 3
    numPods: 5
    processWorkload:
      alertRate: 0.001
      processInterval: 1m0s
  updateInterval: 10m0s
networkPolicyWorkload:
- lifecycleDuration: 30m0s
  numNetworkPolicies: 1000
  numLifecycles: 0
  numLabels: 5
  updateInterval: 10m0s
networkWorkload:
  batchSize: 100
  flowInterval: 1s
  generateUnclosedEndpoints: true
  openPortReuseProbability: 0.05
nodeWorkload:
  numNodes: 1000
rbacWorkload:
  numBindings: 1000
  numRoles: 1000
  numServiceAccounts: 1000
virtualMachineWorkload:
  poolSize: 20000
  updateInterval: 5m
  lifecycleDuration: 2h
  numLifecycles: 0
  reportInterval: 5m # ±5% max deviation applied
  numPackages: 700
  initialReportDelay: 30s # ±20% max deviation applied

Central metrics prove that under very high load the overwhelming majority of requests are dropped:

# HELP rox_central_rate_limiter_active_clients Current number of active clients being rate limited
# TYPE rox_central_rate_limiter_active_clients gauge
rox_central_rate_limiter_active_clients{workload="vm_index_reports"} 1
# HELP rox_central_rate_limiter_per_client_bucket_max_tokens Current per-client token bucket capacity (max>
# TYPE rox_central_rate_limiter_per_client_bucket_max_tokens gauge
rox_central_rate_limiter_per_client_bucket_max_tokens{workload="vm_index_reports"} 10
# HELP rox_central_rate_limiter_per_client_bucket_refill_rate_per_second Current per-client rate limit in >
# TYPE rox_central_rate_limiter_per_client_bucket_refill_rate_per_second gauge
rox_central_rate_limiter_per_client_bucket_refill_rate_per_second{workload="vm_index_reports"} 0.3
# HELP rox_central_rate_limiter_per_client_bucket_tokens Current available tokens per client bucket
# TYPE rox_central_rate_limiter_per_client_bucket_tokens gauge
rox_central_rate_limiter_per_client_bucket_tokens{client="2c8bbbd6-e1de-4a90-a4e6-0aa1e7abf3fa",workload=">
# HELP rox_central_rate_limiter_requests_total Total requests received by the rate limiter
# TYPE rox_central_rate_limiter_requests_total counter
rox_central_rate_limiter_requests_total{outcome="accepted",workload="vm_index_reports"} 65
rox_central_rate_limiter_requests_total{outcome="rejected",workload="vm_index_reports"} 34475

We verified that the rate limiter uses a fair distribution between several secured clusters and updates it when sensors disconnect, by adding several clusters to a single central and testing different scenarios of sensors being killed/scaled down.

Admin events are shown when index reports are rate limited:
Screenshot 2026-01-27 at 15 49 01

@openshift-ci
Copy link

openshift-ci bot commented Jan 27, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@guzalv guzalv force-pushed the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch from fea3a38 to 3a4e55d Compare January 27, 2026 10:58
@guzalv guzalv force-pushed the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch from 3a4e55d to 667a4d1 Compare January 27, 2026 11:04
@rhacs-bot
Copy link
Contributor

rhacs-bot commented Jan 27, 2026

Images are ready for the commit at 05e4bd4.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.10.x-921-g05e4bd4e9d.

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 45.79439% with 58 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.32%. Comparing base (1cecc4b) to head (05e4bd4).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...ntral/sensor/service/connection/connection_impl.go 0.00% 35 Missing ⚠️
central/sensor/service/connection/manager_impl.go 0.00% 20 Missing ⚠️
pkg/rate/limiter.go 93.75% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18692      +/-   ##
==========================================
+ Coverage   49.29%   49.32%   +0.02%     
==========================================
  Files        2663     2663              
  Lines      200433   200466      +33     
==========================================
+ Hits        98813    98885      +72     
+ Misses      94179    94137      -42     
- Partials     7441     7444       +3     
Flag Coverage Δ
go-unit-tests 49.32% <45.79%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the NACK logic anymore in this PR. Are planning on adding it in a follow-up? Otherwise I'd recommend to just drop messages in the DedupingQueue
Never-mind, I just realize that the DedupingQueue does not have the dropping-on-full implemented.

@vikin91
Copy link
Contributor

vikin91 commented Jan 27, 2026

I don't see the NACK logic anymore in this PR. Are planning on adding it in a follow-up?

Yeah, we think, we need to skip that for now. Not sure whether we can reply with NACK from the new place. We want to get the rate limiting working first and if time allows, we would add the NACKs (they are anyway not connected yet in Sensor, so there is no functionality loss when we skip that).

@vikin91 vikin91 marked this pull request as ready for review January 27, 2026 14:47
Copy link
Contributor

@vikin91 vikin91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving after a self-review, but I am the co-author, so I am biased.

I would like to see more unit tests at the level of sensorConnection, but I am okay merging it without them as we have tested it on clusters in various combinations of rate limit parameters, intensities of the workloads, numbers of clusters etc.

Copy link
Contributor

@lvalerom lvalerom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I still think the rate-limiter should be part of the DedupingQueue or wrapping up the individual queues. Unless I'm missing something, this is a global rate-limiter. If we start adding different types of messages, all the pipelines will share the same rate-limiter. I think we should definitely have different rate-limiters per queue.

@vikin91
Copy link
Contributor

vikin91 commented Jan 27, 2026

I still think the rate-limiter should be part of the DedupingQueue or wrapping up the individual queues. Unless I'm missing something, this is a global rate-limiter. If we start adding different types of messages, all the pipelines will share the same rate-limiter. I think we should definitely have different rate-limiters per queue.

Yes, this is correct. The goal was to have []rateLimiter instead of rl in the manager. That would allow us to define multiple rate-limiters for different queues. However, we thought that YAGNI and we stayed with a single one for now.

The idea of having that per-queue is valid, however, we also wanted to have a possibility to use one and the same limiter for node inventories, node indexes, vm indexes. So that is why we placed it before the queues. I hope we can refactor and improve the design in the next days (next release).

@vikin91
Copy link
Contributor

vikin91 commented Jan 27, 2026

/test gke-nongroovy-e2e-tests

(known flake TestPod)

@vikin91
Copy link
Contributor

vikin91 commented Jan 27, 2026

/retest

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In sensorConnection.multiplexedPush, c.rl.TryConsume is called unconditionally on an interface-typed rateLimiter; if newVMIndexReportRateLimiter returns nil (e.g. config error), this will panic because the interface itself is nil, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with a nil check.
  • The new TestOnClientDisconnect_Nil relies on calling methods on a nil *Limiter, but in real code you usually hold it via the rateLimiter interface where nil would not have methods; it might be clearer to test the nil-handling behavior via a rateLimiter-typed variable to match production usage.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `sensorConnection.multiplexedPush`, `c.rl.TryConsume` is called unconditionally on an interface-typed `rateLimiter`; if `newVMIndexReportRateLimiter` returns `nil` (e.g. config error), this will panic because the interface itself is `nil`, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with a `nil` check.
- The new `TestOnClientDisconnect_Nil` relies on calling methods on a `nil` `*Limiter`, but in real code you usually hold it via the `rateLimiter` interface where `nil` would not have methods; it might be clearer to test the `nil`-handling behavior via a `rateLimiter`-typed variable to match production usage.

## Individual Comments

### Comment 1
<location> `central/sensor/service/connection/connection_impl.go:88-89` </location>
<code_context>

 	hashDeduper hashManager.Deduper
+
+	rl                rateLimiter
+	adminEventsStream events.Stream
+}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard against nil rate limiter in `multiplexedPush` to avoid panics

Because `sensorConnection.rl` is an interface, calling `TryConsume` when it is nil will panic; `newVMIndexReportRateLimiter()` can already return nil (and tests may pass nil), so the code must guard against nil before invoking the limiter.
</issue_to_address>

### Comment 2
<location> `pkg/rate/limiter.go:67` </location>
<code_context>

+// limiterOption is an intermediary type for creating a rate limiter.
+// It forces the caller to define the workload type to which the rate limiter would react.
+type limiterOption struct {
+	l   *Limiter
+	err error
</code_context>

<issue_to_address>
**issue (complexity):** Consider keeping NewLimiter/NewLimiterWithClock as direct constructors that accept optional functional options so callers retain the original single-step usage.

Replace the `limiterOption` builder with a direct constructor that still returns `(*Limiter, error)` and accepts an optional workload filter. This keeps the new filtering capability without forcing every caller through a two-step construction or nil checks. For example:

```go
type LimiterOption func(*Limiter)

func WithWorkloadFilter(fn func(*central.MsgFromSensor) bool) LimiterOption {
	return func(l *Limiter) { l.acceptsFn = fn }
}

func NewLimiterWithClock(workloadName string, globalRate float64, bucketCapacity int, clock Clock, opts ...LimiterOption) (*Limiter, error) {
	// existing validation...
	l := &Limiter{
		workloadName:   workloadName,
		globalRate:     globalRate,
		bucketCapacity: bucketCapacity,
		buckets:        make(map[string]*gorate.Limiter),
		clock:          clock,
	}
	for _, opt := range opts {
		opt(l)
	}
	return l, nil
}
```

Callers that need filtering pass `WithWorkloadFilter`, while others keep their existing single-call construction and error handling.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@openshift-ci
Copy link

openshift-ci bot commented Jan 28, 2026

@guzalv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-20-ui-e2e-tests 05e4bd4 link false /test ocp-4-20-ui-e2e-tests
ci/prow/ocp-4-12-nongroovy-e2e-tests 05e4bd4 link false /test ocp-4-12-nongroovy-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@vikin91 vikin91 changed the title ROX-32731: Rate limit VM index reports before being queued ROX-32316: Rate limit VM index reports before being queued Jan 28, 2026
@guzalv guzalv merged commit 93212ec into master Jan 28, 2026
114 of 117 checks passed
@guzalv guzalv deleted the master-base/gualvare/fix-central-rate-limit-index-reports-before-queues-alt branch January 28, 2026 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants