Skip to content

fix: avoid memory leaks when soft-deleting tenant registries#845

Merged
santihernandezc merged 5 commits intomainfrom
santihernandezc/fix_memory_leak_soft_delete_tenant_registries
Apr 22, 2026
Merged

fix: avoid memory leaks when soft-deleting tenant registries#845
santihernandezc merged 5 commits intomainfrom
santihernandezc/fix_memory_leak_soft_delete_tenant_registries

Conversation

@santihernandezc
Copy link
Copy Markdown
Contributor

@santihernandezc santihernandezc commented Dec 4, 2025

Description

This PR fixes a memory leak when soft-deleting tenant registries by creating a special "archived" registry in the TenantRegistries struct.

Instead of keeping user registries around indefinitely after soft deletion, the TenantRegistries struct now copies counters, histograms, and summaries into the archived registry. This way, memory growth is not unbounded if we need to delete and recreate the same user registry multiple times.

Testing

I tested this PR with two Mimir Alertmanager instances:

  • One from the latest main
  • Another one importing this dskit branch

Both Instances had strict initialization enabled with a grace period of 10 seconds to allow for fast cycles of creating and deleting tenant registries.

I created a script to send bursts of requests for 2.5K tenants to both instances and sleep for 60 seconds. This resulted in registries for these 2.5K tenants being created and deleted every minute.

Results:

  • The instance vendoring this fix maintained a steady memory usage
  • The instance built from Mimir's latest main accumulated more memory on each soft deletion
fix

Fixes #746


Note

Medium Risk
Changes how soft-deleted tenant registries are retained and aggregated, which can affect scrape-time metric totals and label-series merging if the deduplication/merge logic is incorrect. Concurrency and Prometheus metric-type semantics (histogram buckets/summary quantiles) make this moderately risky despite added tests.

Overview
Prevents memory growth from soft-deleted tenant registries by removing the old per-tenant registry objects and archiving only cumulative metric types (counters/histograms/summaries) into a shared archived metric map (gauges are intentionally excluded).

When archiving, metrics are deduplicated by label set and merged so values accumulate across tenant lifecycles, and BuildMetricFamiliesPerTenant() now includes the archived metrics as an empty-tenant entry while snapshotting regs/archived under one lock to avoid scrape races.

Adds end-to-end tests to ensure archived series don’t grow when tenants are repeatedly cycled and that aggregated totals never decrease across add/observe/soft-remove cycles.

Reviewed by Cursor Bugbot for commit 81951b2. Bugbot is set up for automated code reviews on this repo. Configure here.

@santihernandezc santihernandezc marked this pull request as draft December 4, 2025 14:19
Comment thread metrics/tenant_registries.go
@pracucci
Copy link
Copy Markdown
Contributor

pracucci commented Apr 7, 2026

Hi! 👋 We recently merged #947 which changed the CI configuration for unit tests. To get CI passing on this PR, please rebase on the latest main branch:

git fetch origin main
git rebase origin/main
git push --force-with-lease

Thanks!

@santihernandezc santihernandezc force-pushed the santihernandezc/fix_memory_leak_soft_delete_tenant_registries branch from 350206f to da597af Compare April 13, 2026 09:56
@santihernandezc santihernandezc changed the title (WIP): Fix memory leak when soft-deleting tenant registries fix: avoid memory leaks when soft-deleting tenant registries Apr 13, 2026
@santihernandezc santihernandezc marked this pull request as ready for review April 13, 2026 15:05
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Race condition between archived copy and registries snapshot
    • BuildMetricFamiliesPerTenant now snapshots both r.regs and r.archived under the same lock acquisition so a soft-removal cannot be missed between copies.

Create PR

Or push these changes by commenting:

@cursor push 3f59d19672
Preview (3f59d19672)
diff --git a/metrics/tenant_registries.go b/metrics/tenant_registries.go
--- a/metrics/tenant_registries.go
+++ b/metrics/tenant_registries.go
@@ -924,6 +924,11 @@
 func (r *TenantRegistries) BuildMetricFamiliesPerTenant() MetricFamiliesPerTenant {
 	r.regsMu.Lock()
 
+	// Copy registries while holding the same lock as archived metrics to build
+	// a consistent snapshot for this scrape.
+	regsCopy := make([]TenantRegistry, 0, len(r.regs))
+	regsCopy = append(regsCopy, r.regs...)
+
 	// Shallow copy the archived metrics map to avoid concurrent map access.
 	// The MetricFamily/Metric objects are cloned on-write, so sharing pointers is safe.
 	archivedCopy := make(MetricFamilyMap, len(r.archived))
@@ -943,7 +948,7 @@
 	}
 	r.regsMu.Unlock()
 
-	for _, entry := range r.Registries() {
+	for _, entry := range regsCopy {
 		m, err := entry.reg.Gather()
 		if err == nil {
 			var mfm MetricFamilyMap // := would shadow err from outer block, and single err check will not work

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 9b4980f. Configure here.

Comment thread metrics/tenant_registries.go Outdated
@moustafab moustafab self-requested a review April 15, 2026 15:13
Copy link
Copy Markdown

@moustafab moustafab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only really have one concern. This looks like a sane and useful change to me and if we can confirm the "fail to gather on archive" behavior then I think we're good to go.

// archiveTenantRegistry gathers and stores counters/histograms/summaries from a tenant
// registry being removed. Gauges are excluded since they represent current state, not
// cumulative values. Archived metrics are aggregated to avoid counter resets.
func (r *TenantRegistries) archiveTenantRegistry(ur *TenantRegistry) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here to make this essentially infallible might result in some strange behaviors if we fail to gather the metrics. I'm not sure that is completely safe. In the old model we'd remove if we failed to soft remove so perhaps this is still equivalent. Have you considered this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is still the same behavior as before.
If we fail to gather the metrics in the current code, we return false. This causes the registry to be removed (same as hard-deleted), so it would cause counter restarts.

@santihernandezc santihernandezc merged commit 5e372df into main Apr 22, 2026
19 checks passed
@santihernandezc santihernandezc deleted the santihernandezc/fix_memory_leak_soft_delete_tenant_registries branch April 22, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak when repeatedly creating and soft-removing tenant registries

3 participants