Skip to content

clean up stale cluster topic metrics on epoch transitions#8562

Merged
zhangchiqing merged 9 commits into
masterfrom
leo/cleanup-stale-cluster-metrics
May 28, 2026
Merged

clean up stale cluster topic metrics on epoch transitions#8562
zhangchiqing merged 9 commits into
masterfrom
leo/cleanup-stale-cluster-metrics

Conversation

@zhangchiqing
Copy link
Copy Markdown
Member

@zhangchiqing zhangchiqing commented May 18, 2026

Fix #8555

Summary by CodeRabbit

  • New Features

    • Automatic cleanup of cluster-topic metrics when a node leaves, preventing unbounded per-topic metric growth across epoch/cluster transitions.
    • Tracing now triggers cluster-topic cleanup and emits a debug log when cleanup runs.
  • Tests

    • Added tests verifying cleanup removes only cluster-topic metrics, is idempotent, and safely handles nonexistent topics.
  • Chores

    • Added no-op handlers and mock hooks to support the new cleanup callback.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds OnClusterTopicMetricsCleanup(topic string) to metrics interfaces and implementations; Tracer calls it when leaving cluster topics to delete per-topic Prometheus label values. Adds autogenerated mock methods and unit tests verifying cleanup and idempotency.

Changes

Cluster Topic Metrics Cleanup on Node Leave

Layer / File(s) Summary
Metrics cleanup interface and core implementation
module/metrics.go, module/metrics/gossipsub.go, module/metrics/noop.go, module/metrics/gossipsub_rpc_validation_inspector.go, module/metrics/network.go
Declare OnClusterTopicMetricsCleanup(topic string) on relevant metrics interfaces and implement it to delete per-topic label values from mesh size, graft/prune counters, received-ihave histograms, inbound/outbound size histograms and duplicate-messages counters.
Tracer detects cluster topic exit and triggers cleanup
network/p2p/tracer/gossipSubMeshTracer.go
GossipSubMeshTracer.Leave now detects cluster topics, clears tracked mesh peers under topicMeshMu, calls OnClusterTopicMetricsCleanup(topic), and logs the cleanup.
Test infra and autogenerated mocks
network/p2p/tracer/gossipSubMeshTracer_test.go, module/mock/gossip_sub_metrics.go, module/mock/lib_p2_p_metrics.go, module/mock/local_gossip_sub_router_metrics.go, module/mock/network_metrics.go, module/mock/gossip_sub_rpc_validation_inspector_metrics.go, module/metrics/network_test.go, go.mod
Forwarding helper added in test collector and TestGossipSubMeshTracer_ClusterTopicCleanup plus network collector tests added. Autogenerated mock stubs and expecter/call wrappers added across metric mock types to support testing of the new callback; go.mod dependency annotation adjusted.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Breaking Change

Suggested reviewers

  • janezpodhostnik
  • m-Peter

Poem

🐇 I hop through topics, soft and quick,
I nibble labels stale and slick,
I purge old metrics, one by one,
So epochs pass and charts stay fun,
A tidy hop — the cleanup's done.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'clean up stale cluster topic metrics on epoch transitions' directly and specifically describes the main change: removing stale metrics from previous epochs when topics become inactive.
Linked Issues check ✅ Passed The PR successfully addresses all primary requirements from issue #8555: adding OnClusterTopicMetricsCleanup callbacks to clean up stale metrics from previous-epoch cluster topics, implementing cleanup logic in the mesh tracer and network collector, and preventing unbounded metric cardinality growth.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the metric cleanup objective: interface definitions, implementation in tracer and collectors, mock updates for testing, and comprehensive unit tests. The dependency version update is necessary for metric label deletion functionality.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch leo/cleanup-stale-cluster-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

@zhangchiqing zhangchiqing marked this pull request as ready for review May 19, 2026 00:41
@zhangchiqing zhangchiqing requested a review from a team as a code owner May 19, 2026 00:41
@zhangchiqing zhangchiqing force-pushed the leo/cleanup-stale-cluster-metrics branch from 1e54f41 to de0a722 Compare May 19, 2026 00:41
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@network/p2p/tracer/gossipSubMeshTracer_test.go`:
- Around line 300-301: Replace the fixed time.Sleep(100 * time.Millisecond) with
an eventual assertion that waits for the Leave callback to be processed: either
have the Leave callback signal a channel or set a boolean (e.g., callbackCalled)
and in the test use a select with time.After(timeout) or a polling loop that
checks the cleanup condition (for example the peer set size on the mesh tracer
or a callbackCalled flag) until it becomes true, failing the test on timeout;
update the test around the Leave callback invocation and assert the condition
instead of sleeping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e19d6d3-cb3c-4dc3-994b-55fc5afcc193

📥 Commits

Reviewing files that changed from the base of the PR and between 5b9fa9c and de0a722.

📒 Files selected for processing (9)
  • module/metrics.go
  • module/metrics/gossipsub.go
  • module/metrics/noop.go
  • module/mock/gossip_sub_metrics.go
  • module/mock/lib_p2_p_metrics.go
  • module/mock/local_gossip_sub_router_metrics.go
  • module/mock/network_metrics.go
  • network/p2p/tracer/gossipSubMeshTracer.go
  • network/p2p/tracer/gossipSubMeshTracer_test.go

Comment thread network/p2p/tracer/gossipSubMeshTracer_test.go Outdated
@janezpodhostnik
Copy link
Copy Markdown
Contributor

The issue says that the top offenders are:

  • network_gossip_gossipsub_received_ihave_message_ids_bucket
  • network_gossip_inbound_message_size_bytes_bucket
  • network_gossip_outbound_message_size_bytes_bucket

but this seems to touch:

  • network_gossip_gossipsub_local_mesh_size
  • network_gossip_gossipsub_graft_topic_total
  • network_gossip_gossipsub_prune_topic_total

Is this ok?

@zhangchiqing zhangchiqing force-pushed the leo/cleanup-stale-cluster-metrics branch 2 times, most recently from ed54daa to a8f0a12 Compare May 19, 2026 16:35
@zhangchiqing zhangchiqing force-pushed the leo/cleanup-stale-cluster-metrics branch from a8f0a12 to a755560 Compare May 19, 2026 17:12
Comment thread module/metrics/gossipsub.go Outdated

// OnClusterTopicMetricsCleanup removes all metric label values associated with the given cluster topic.
// This prevents unbounded metric cardinality growth during epoch transitions when collection nodes
// join new clusters and leave old ones. Only call this for cluster topics (sync-cluster-*, consensus-cluster-*).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we still see metrics for the old clusters or they will no longer be reported? There is value in getting metrics even for the old cluster to see if the node is being spammed on those topics so I was thinking there could be catch-all label for old clusters e.g. cluster-old.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus client only supports adding or removing metric label values. There's no relabel or aggregation operation. Aggregating old cluster metrics into a cluster-old label would require reading internal histogram bucket values and
re-recording them, which isn't supported by the API (and would be complex to implement).

What I have implemented is to remove metrics for the previous epoch when the node leaves that cluster topic (~600 blocks after epoch ends).

Note: Grafana charges for active series. Before this PR, entering a new epoch would add a new set of metrics to the active list, which is maintained in the node's memory by the Prometheus client. Since we recently restarted all collection
nodes, metrics for past epochs are no longer in memory and therefore no longer active - we should already see a drop in cost from the restart alone.

However, without this PR, metrics would start accumulating again with each new epoch. With these changes, we actively remove old epoch metrics from the active list when epoch transitions occur, preventing unbounded growth going forward.

When metrics are removed from the active list, the historical data remains visible on Grafana dashboards until retention expires - so visibility is preserved, but we stop paying for those series.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@module/metrics/network_test.go`:
- Around line 40-50: Add cleanup assertions for all metric families touched by
the test: after invoking nc.OnLocalMeshSizeUpdated, nc.OnPeerGraftTopic,
nc.OnPeerPruneTopic, and nc.OnIHaveMessageIDsReceived, verify that the metrics
for clusterTopic are removed and metrics for otherTopic remain; specifically add
assertions that the LocalGossipSubRouter (local mesh size), PeerGraft/PeerPrune
counters, and GossipSubRpcValidationInspector (IHaveMessageIDsReceived) metrics
have been cleaned up for clusterTopic and still exist for otherTopic. Apply the
same pair of assertions for the similar block referenced around the other range
(lines 63-71) so both test sections validate cleanup for the same metric
families.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 053f6036-65ae-4b8a-a0ea-32901bad90e1

📥 Commits

Reviewing files that changed from the base of the PR and between a8f0a12 and a755560.

📒 Files selected for processing (6)
  • module/metrics.go
  • module/metrics/gossipsub_rpc_validation_inspector.go
  • module/metrics/network.go
  • module/metrics/network_test.go
  • module/mock/gossip_sub_rpc_validation_inspector_metrics.go
  • network/p2p/tracer/gossipSubMeshTracer_test.go
✅ Files skipped from review due to trivial changes (1)
  • module/mock/gossip_sub_rpc_validation_inspector_metrics.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • module/metrics.go
  • network/p2p/tracer/gossipSubMeshTracer_test.go
  • module/metrics/gossipsub_rpc_validation_inspector.go
  • module/metrics/network.go

Comment thread module/metrics/network_test.go Outdated
Comment on lines +40 to +50
// Record LocalGossipSubRouterMetrics
nc.OnLocalMeshSizeUpdated(clusterTopic, 5)
nc.OnLocalMeshSizeUpdated(otherTopic, 3)
nc.OnPeerGraftTopic(clusterTopic)
nc.OnPeerGraftTopic(otherTopic)
nc.OnPeerPruneTopic(clusterTopic)
nc.OnPeerPruneTopic(otherTopic)

// Record GossipSubRpcValidationInspectorMetrics
nc.OnIHaveMessageIDsReceived(clusterTopic, 10)
nc.OnIHaveMessageIDsReceived(otherTopic, 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add cleanup assertions for all metric families exercised in this test.

This test records cluster-topic samples for local-mesh/graft/prune and IHave metrics, but post-cleanup assertions only validate inbound/outbound/duplicate. A regression in those other collectors would pass unnoticed; please assert they are removed for clusterTopic and retained for otherTopic too.

Also applies to: 63-71

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@module/metrics/network_test.go` around lines 40 - 50, Add cleanup assertions
for all metric families touched by the test: after invoking
nc.OnLocalMeshSizeUpdated, nc.OnPeerGraftTopic, nc.OnPeerPruneTopic, and
nc.OnIHaveMessageIDsReceived, verify that the metrics for clusterTopic are
removed and metrics for otherTopic remain; specifically add assertions that the
LocalGossipSubRouter (local mesh size), PeerGraft/PeerPrune counters, and
GossipSubRpcValidationInspector (IHaveMessageIDsReceived) metrics have been
cleaned up for clusterTopic and still exist for otherTopic. Apply the same pair
of assertions for the similar block referenced around the other range (lines
63-71) so both test sections validate cleanup for the same metric families.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 310: The go.mod line for the dependency
"github.com/prometheus/client_model v0.6.2" was changed to a direct dependency
but is only used transitively; revert it to an indirect dependency by restoring
the "// indirect" marker on the "github.com/prometheus/client_model v0.6.2"
entry in go.mod and then run "go mod tidy" to ensure module metadata is
consistent; locate the exact module line ("github.com/prometheus/client_model
v0.6.2") in go.mod and add the "// indirect" comment back.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0c4d69e5-67fe-453d-b76b-ee099b928e36

📥 Commits

Reviewing files that changed from the base of the PR and between a755560 and df9c16b.

📒 Files selected for processing (1)
  • go.mod

Comment thread go.mod
github.com/polydawn/refmt v0.89.0 // indirect
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/client_model v0.6.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify that client_model is directly imported and check security

# Check if client_model is directly imported in any Go files
echo "=== Direct imports of prometheus/client_model ==="
rg -n --type=go 'import.*"github.com/prometheus/client_model' -C2

# Check for any usage of client_model types
echo -e "\n=== Usage of client_model types ==="
rg -n --type=go 'client_model\.' -C2

# Check GitHub for security advisories on this package
echo -e "\n=== Security advisories for prometheus/client_model ==="
gh api graphql -f query='
{
  securityVulnerabilities(first: 5, ecosystem: GO, package: "github.com/prometheus/client_model") {
    nodes {
      advisory {
        summary
        severity
        publishedAt
      }
      vulnerableVersionRange
      firstPatchedVersion {
        identifier
      }
    }
  }
}'

Repository: onflow/flow-go

Length of output: 251


Revert prometheus/client_model to indirect dependency.

The // indirect marker was removed, marking github.com/prometheus/client_model v0.6.2 as a direct dependency. However, verification found no direct imports or usages of this package in the codebase—it's only used transitively through another dependency. Restore the // indirect comment to accurately reflect its role and maintain go.mod clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go.mod` at line 310, The go.mod line for the dependency
"github.com/prometheus/client_model v0.6.2" was changed to a direct dependency
but is only used transitively; revert it to an indirect dependency by restoring
the "// indirect" marker on the "github.com/prometheus/client_model v0.6.2"
entry in go.mod and then run "go mod tidy" to ensure module metadata is
consistent; locate the exact module line ("github.com/prometheus/client_model
v0.6.2") in go.mod and add the "// indirect" comment back.

Comment thread module/metrics/network.go Outdated
nc.GossipSubRpcValidationInspectorMetrics.OnClusterTopicMetricsCleanup(topic)

// Clean up inbound/outbound message size metrics using partial match on topic
nc.inboundMessageSize.DeletePartialMatch(prometheus.Labels{LabelChannel: topic})
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @janezpodhostnik , more metrics are added to the clean up list here.

@vishalchangrani
Copy link
Copy Markdown
Contributor

@zhangchiqing - will this change take care of all these metrics (Same metric but different labels for cluster)?

This is from a collection node:

$ curl http://localhost:8080/metrics | grep network_gossip_gossipsub_received_ihave_message_ids_bucket
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="1"} 2
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="2.5"} 2
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="5"} 3
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="10"} 1.078611e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="consensus-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="+Inf"} 1.147784e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="1"} 136777
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="2.5"} 492018
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="5"} 8.580713e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="10"} 8.804109e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-blocks/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="+Inf"} 8.804248e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="1"} 597380
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="2.5"} 1.277746e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="5"} 2.499913e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="10"} 3.180214e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-guarantees/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="+Inf"} 3.183469e+06
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="1"} 163036
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="2.5"} 339218
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="5"} 659125
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="10"} 804269
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="push-transactions/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="+Inf"} 994667
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="1"} 288
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="2.5"} 642
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="5"} 133895
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="10"} 537231
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-cluster-cluster-232-d9efe2a782b6967a8c0854f9c23c9d3e399b6338efd1871e1f6e3a13cc4d5033",le="+Inf"} 966484
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.005"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.01"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.025"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.05"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.1"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.25"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="0.5"} 0
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="1"} 2
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="2.5"} 3
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="5"} 8
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="10"} 12
network_gossip_gossipsub_received_ihave_message_ids_bucket{topic="sync-committee/25c44a8f93f53074cc4ef043d381694b8b1f2d1a3a6ea844a9202cd0f9897733",le="+Inf"} 8.810882e+06

@zhangchiqing
Copy link
Copy Markdown
Member Author

Yes, this PR will clean up metrics for the cluster topics (consensus-cluster-* and sync-cluster-*) when the node leaves those topics during epoch transitions.

From your output, these will be cleaned up:

  • consensus-cluster-cluster-232-*
  • sync-cluster-cluster-232-*

The other topics (push-blocks, push-guarantees, push-transactions, sync-committee) are not cluster topics - they're persistent network channels that exist across all epochs. These won't be cleaned up (and shouldn't be - they don't cause unbounded cardinality growth since they don't change per epoch).

The 232 is the current epoch counter.

Without this PR:

  • When epoch 233 starts, the node leaves consensus-cluster-cluster-232-* topics
  • But the metrics remain in Prometheus client memory
  • They continue to be exposed on /metrics, still active, still charged

With this PR:

  • When epoch 233 starts, the node leaves consensus-cluster-cluster-232-* topics
  • Leave() triggers OnClusterTopicMetricsCleanup(), deletes metrics from memory
  • No longer exposed on /metrics, not active, not charged, but the data is still accessible on grafana until the retention expires

…milies (#8564)

* extend cluster topic metrics cleanup to cover all per-topic metric families
* add public OnClusterTopicMetricsCleanup to AlspMetrics instead of accessing private field
Copy link
Copy Markdown
Member

@Kay-Zee Kay-Zee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try this first.

There's a chance we'll STILL be charged for it, even if it's not exposed on /metrics, at least for some time. I forget what the retention is.

I think the "best" solve would be to not have the additional cardinality be created in the first place? like have the thing that's changing as a label, rather than add cadinality, is that possible?

…s-aggregate

normalize cluster topic labels in metrics to prevent unbounded cardin…
…s-aggregate

normalize cluster topic labels in metrics to prevent unbounded cardin…
@zhangchiqing
Copy link
Copy Markdown
Member Author

I think the "best" solve would be to not have the additional cardinality be created in the first place? like have the thing that's changing as a label, rather than add cadinality, is that possible?

Good point.

I've made the change to normalize the topics. The changes are:

Before (master):
The topic name contains epoch id and hash:

"network_gossip_direct_messages_in_progress{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_first_message_delivery_count_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_graft_topic_total{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_graft_topic_total{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_local_mesh_size{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_local_mesh_size{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_mesh_message_delivery_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_inbound_message_size_bytes_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_bucket{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_bucket{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_count{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_count{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_sum{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_gossip_outbound_message_size_bytes_sum{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_queue_current_messages_processing{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_queue_current_messages_processing{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_queue_engine_message_processing_time_seconds{topic=\"consensus-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"
"network_queue_engine_message_processing_time_seconds{topic=\"sync-cluster-cluster-0-3ecb05b373c5e1baa236bb55d4ec13cdcb0bce0faefb804296bc3168de76629a\"}"

After (leo/cleanup-stale-cluster-metrics-aggregate):

The topic is all normalized, only has consensus-cluster and sync-cluster without epoch id and hash.

curl -s -G 'http://localhost:9090/api/v1/series' \
        --data-urlencode 'match[]={topic=~".*cluster.*"}' | \
        jq '.data[] | "\(.__name__){topic=\"\(.topic)\"}"' | sort -u

"network_gossip_direct_messages_in_progress{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_bucket{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_bucket{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_count{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_count{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_sum{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_first_message_delivery_count_sum{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_graft_topic_total{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_graft_topic_total{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_bucket{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_bucket{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_count{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_count{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_sum{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_invalid_message_delivery_count_sum{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_local_mesh_size{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_local_mesh_size{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_bucket{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_bucket{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_count{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_count{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_sum{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_mesh_message_delivery_sum{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_bucket{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_bucket{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_count{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_count{topic=\"sync-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_sum{topic=\"consensus-cluster\"}"
"network_gossip_gossipsub_time_in_mesh_quantum_count_sum{topic=\"sync-cluster\"}"
"network_gossip_inbound_message_size_bytes_bucket{topic=\"consensus-cluster\"}"
"network_gossip_inbound_message_size_bytes_bucket{topic=\"sync-cluster\"}"
"network_gossip_inbound_message_size_bytes_count{topic=\"consensus-cluster\"}"
"network_gossip_inbound_message_size_bytes_count{topic=\"sync-cluster\"}"
"network_gossip_inbound_message_size_bytes_sum{topic=\"consensus-cluster\"}"
"network_gossip_inbound_message_size_bytes_sum{topic=\"sync-cluster\"}"
"network_gossip_outbound_message_size_bytes_bucket{topic=\"consensus-cluster\"}"
"network_gossip_outbound_message_size_bytes_bucket{topic=\"sync-cluster\"}"
"network_gossip_outbound_message_size_bytes_count{topic=\"consensus-cluster\"}"
"network_gossip_outbound_message_size_bytes_count{topic=\"sync-cluster\"}"
"network_gossip_outbound_message_size_bytes_sum{topic=\"consensus-cluster\"}"
"network_gossip_outbound_message_size_bytes_sum{topic=\"sync-cluster\"}"
"network_queue_current_messages_processing{topic=\"consensus-cluster\"}"
"network_queue_current_messages_processing{topic=\"sync-cluster\"}"
"network_queue_engine_message_processing_time_seconds{topic=\"consensus-cluster\"}"
"network_queue_engine_message_processing_time_seconds{topic=\"sync-cluster\"}"

Copy link
Copy Markdown
Member

@Kay-Zee Kay-Zee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think for the network metrics, this is best. This will make it a little bit harder to group the metrics by cluster, but it will still be possible with post-analysis, so i think this is probably best.

@zhangchiqing zhangchiqing added this pull request to the merge queue May 28, 2026
Merged via the queue into master with commit d78e8cf May 28, 2026
61 checks passed
@zhangchiqing zhangchiqing deleted the leo/cleanup-stale-cluster-metrics branch May 28, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clean up stale cluster topic metrics on collection nodes after epoch transitions

5 participants