fix: avoid memory leaks when soft-deleting tenant registries#845
Conversation
|
Hi! 👋 We recently merged #947 which changed the CI configuration for unit tests. To get CI passing on this PR, please rebase on the latest Thanks! |
350206f to
da597af
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Race condition between archived copy and registries snapshot
- BuildMetricFamiliesPerTenant now snapshots both r.regs and r.archived under the same lock acquisition so a soft-removal cannot be missed between copies.
Or push these changes by commenting:
@cursor push 3f59d19672
Preview (3f59d19672)
diff --git a/metrics/tenant_registries.go b/metrics/tenant_registries.go
--- a/metrics/tenant_registries.go
+++ b/metrics/tenant_registries.go
@@ -924,6 +924,11 @@
func (r *TenantRegistries) BuildMetricFamiliesPerTenant() MetricFamiliesPerTenant {
r.regsMu.Lock()
+ // Copy registries while holding the same lock as archived metrics to build
+ // a consistent snapshot for this scrape.
+ regsCopy := make([]TenantRegistry, 0, len(r.regs))
+ regsCopy = append(regsCopy, r.regs...)
+
// Shallow copy the archived metrics map to avoid concurrent map access.
// The MetricFamily/Metric objects are cloned on-write, so sharing pointers is safe.
archivedCopy := make(MetricFamilyMap, len(r.archived))
@@ -943,7 +948,7 @@
}
r.regsMu.Unlock()
- for _, entry := range r.Registries() {
+ for _, entry := range regsCopy {
m, err := entry.reg.Gather()
if err == nil {
var mfm MetricFamilyMap // := would shadow err from outer block, and single err check will not workYou can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 9b4980f. Configure here.
| // archiveTenantRegistry gathers and stores counters/histograms/summaries from a tenant | ||
| // registry being removed. Gauges are excluded since they represent current state, not | ||
| // cumulative values. Archived metrics are aggregated to avoid counter resets. | ||
| func (r *TenantRegistries) archiveTenantRegistry(ur *TenantRegistry) { |
There was a problem hiding this comment.
The change here to make this essentially infallible might result in some strange behaviors if we fail to gather the metrics. I'm not sure that is completely safe. In the old model we'd remove if we failed to soft remove so perhaps this is still equivalent. Have you considered this?
There was a problem hiding this comment.
Yep, this is still the same behavior as before.
If we fail to gather the metrics in the current code, we return false. This causes the registry to be removed (same as hard-deleted), so it would cause counter restarts.


Description
This PR fixes a memory leak when soft-deleting tenant registries by creating a special
"archived"registry in theTenantRegistriesstruct.Instead of keeping user registries around indefinitely after soft deletion, the
TenantRegistriesstruct now copies counters, histograms, and summaries into thearchivedregistry. This way, memory growth is not unbounded if we need to delete and recreate the same user registry multiple times.Testing
I tested this PR with two Mimir Alertmanager instances:
maindskitbranchBoth Instances had strict initialization enabled with a grace period of 10 seconds to allow for fast cycles of creating and deleting tenant registries.
I created a script to send bursts of requests for 2.5K tenants to both instances and sleep for 60 seconds. This resulted in registries for these 2.5K tenants being created and deleted every minute.
Results:
mainaccumulated more memory on each soft deletionFixes #746
Note
Medium Risk
Changes how soft-deleted tenant registries are retained and aggregated, which can affect scrape-time metric totals and label-series merging if the deduplication/merge logic is incorrect. Concurrency and Prometheus metric-type semantics (histogram buckets/summary quantiles) make this moderately risky despite added tests.
Overview
Prevents memory growth from soft-deleted tenant registries by removing the old per-tenant registry objects and archiving only cumulative metric types (counters/histograms/summaries) into a shared
archivedmetric map (gauges are intentionally excluded).When archiving, metrics are deduplicated by label set and merged so values accumulate across tenant lifecycles, and
BuildMetricFamiliesPerTenant()now includes the archived metrics as an empty-tenant entry while snapshottingregs/archivedunder one lock to avoid scrape races.Adds end-to-end tests to ensure archived series don’t grow when tenants are repeatedly cycled and that aggregated totals never decrease across add/observe/soft-remove cycles.
Reviewed by Cursor Bugbot for commit 81951b2. Bugbot is set up for automated code reviews on this repo. Configure here.