-
Notifications
You must be signed in to change notification settings - Fork 172
ROX-32316: Rate limit VM index reports before being queued #18692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-32316: Rate limit VM index reports before being queued #18692
Conversation
|
Skipping CI for Draft Pull Request. |
fea3a38 to
3a4e55d
Compare
3a4e55d to
667a4d1
Compare
|
Images are ready for the commit at 05e4bd4. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18692 +/- ##
==========================================
+ Coverage 49.29% 49.32% +0.02%
==========================================
Files 2663 2663
Lines 200433 200466 +33
==========================================
+ Hits 98813 98885 +72
+ Misses 94179 94137 -42
- Partials 7441 7444 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the NACK logic anymore in this PR. Are planning on adding it in a follow-up? Otherwise I'd recommend to just drop messages in the DedupingQueue
Never-mind, I just realize that the DedupingQueue does not have the dropping-on-full implemented.
Yeah, we think, we need to skip that for now. Not sure whether we can reply with NACK from the new place. We want to get the rate limiting working first and if time allows, we would add the NACKs (they are anyway not connected yet in Sensor, so there is no functionality loss when we skip that). |
vikin91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving after a self-review, but I am the co-author, so I am biased.
I would like to see more unit tests at the level of sensorConnection, but I am okay merging it without them as we have tested it on clusters in various combinations of rate limit parameters, intensities of the workloads, numbers of clusters etc.
lvalerom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I still think the rate-limiter should be part of the DedupingQueue or wrapping up the individual queues. Unless I'm missing something, this is a global rate-limiter. If we start adding different types of messages, all the pipelines will share the same rate-limiter. I think we should definitely have different rate-limiters per queue.
Yes, this is correct. The goal was to have The idea of having that per-queue is valid, however, we also wanted to have a possibility to use one and the same limiter for node inventories, node indexes, vm indexes. So that is why we placed it before the queues. I hope we can refactor and improve the design in the next days (next release). |
|
/test gke-nongroovy-e2e-tests (known flake TestPod) |
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- In
sensorConnection.multiplexedPush,c.rl.TryConsumeis called unconditionally on an interface-typedrateLimiter; ifnewVMIndexReportRateLimiterreturnsnil(e.g. config error), this will panic because the interface itself isnil, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with anilcheck. - The new
TestOnClientDisconnect_Nilrelies on calling methods on anil*Limiter, but in real code you usually hold it via therateLimiterinterface wherenilwould not have methods; it might be clearer to test thenil-handling behavior via arateLimiter-typed variable to match production usage.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `sensorConnection.multiplexedPush`, `c.rl.TryConsume` is called unconditionally on an interface-typed `rateLimiter`; if `newVMIndexReportRateLimiter` returns `nil` (e.g. config error), this will panic because the interface itself is `nil`, so either ensure a non-nil no-op limiter is always constructed on error or guard the call with a `nil` check.
- The new `TestOnClientDisconnect_Nil` relies on calling methods on a `nil` `*Limiter`, but in real code you usually hold it via the `rateLimiter` interface where `nil` would not have methods; it might be clearer to test the `nil`-handling behavior via a `rateLimiter`-typed variable to match production usage.
## Individual Comments
### Comment 1
<location> `central/sensor/service/connection/connection_impl.go:88-89` </location>
<code_context>
hashDeduper hashManager.Deduper
+
+ rl rateLimiter
+ adminEventsStream events.Stream
+}
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against nil rate limiter in `multiplexedPush` to avoid panics
Because `sensorConnection.rl` is an interface, calling `TryConsume` when it is nil will panic; `newVMIndexReportRateLimiter()` can already return nil (and tests may pass nil), so the code must guard against nil before invoking the limiter.
</issue_to_address>
### Comment 2
<location> `pkg/rate/limiter.go:67` </location>
<code_context>
+// limiterOption is an intermediary type for creating a rate limiter.
+// It forces the caller to define the workload type to which the rate limiter would react.
+type limiterOption struct {
+ l *Limiter
+ err error
</code_context>
<issue_to_address>
**issue (complexity):** Consider keeping NewLimiter/NewLimiterWithClock as direct constructors that accept optional functional options so callers retain the original single-step usage.
Replace the `limiterOption` builder with a direct constructor that still returns `(*Limiter, error)` and accepts an optional workload filter. This keeps the new filtering capability without forcing every caller through a two-step construction or nil checks. For example:
```go
type LimiterOption func(*Limiter)
func WithWorkloadFilter(fn func(*central.MsgFromSensor) bool) LimiterOption {
return func(l *Limiter) { l.acceptsFn = fn }
}
func NewLimiterWithClock(workloadName string, globalRate float64, bucketCapacity int, clock Clock, opts ...LimiterOption) (*Limiter, error) {
// existing validation...
l := &Limiter{
workloadName: workloadName,
globalRate: globalRate,
bucketCapacity: bucketCapacity,
buckets: make(map[string]*gorate.Limiter),
clock: clock,
}
for _, opt := range opts {
opt(l)
}
return l, nil
}
```
Callers that need filtering pass `WithWorkloadFilter`, while others keep their existing single-call construction and error handling.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@guzalv: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
While looking into increasing the VM index reports rate limiter default bucket size from 5 to a higher number (to increase burst handling capabilities), a test run with bucket size 500 and high load showed central crashing shortly after the load started. This was unexpected, because a bucket capacity of 500 should lead to max 500 reports being held in memory, which would mean ~1.5-2 GiB, and instead the memory usage quickly reached the container's limit of 16 GiB.
I iterated with an AI agent looking into the problem, and identified a likely root cause: the reports are put into the queues before checking the rate limiter, and the rate limiter drops due to an empty bucket only happen after the pipeline runs $bucket_capacity times and consumes those tokens.
Since this pipeline is slow (the overall report processing rate is under 1 report per second in a default deployment), it takes ~500 seconds for all the tokens to get consumed. During this time all new reports accumulate in memory, causing the crash.
Whether or not the crash happens depends on:
This PR is moves the limiter to act before the reports are enqueued, which should prevent the issue.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
We deployed the modified central to a cluster, set bucket capacity to 500 and generated the same load that previously made central crash (see the load profile below); this time central shows a stable memory usage pattern, as expected:
Central metrics prove that under very high load the overwhelming majority of requests are dropped:
We verified that the rate limiter uses a fair distribution between several secured clusters and updates it when sensors disconnect, by adding several clusters to a single central and testing different scenarios of sensors being killed/scaled down.
Admin events are shown when index reports are rate limited:
