clean up stale cluster topic metrics on epoch transitions#8562
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds OnClusterTopicMetricsCleanup(topic string) to metrics interfaces and implementations; Tracer calls it when leaving cluster topics to delete per-topic Prometheus label values. Adds autogenerated mock methods and unit tests verifying cleanup and idempotency. ChangesCluster Topic Metrics Cleanup on Node Leave
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
1e54f41 to
de0a722
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@network/p2p/tracer/gossipSubMeshTracer_test.go`:
- Around line 300-301: Replace the fixed time.Sleep(100 * time.Millisecond) with
an eventual assertion that waits for the Leave callback to be processed: either
have the Leave callback signal a channel or set a boolean (e.g., callbackCalled)
and in the test use a select with time.After(timeout) or a polling loop that
checks the cleanup condition (for example the peer set size on the mesh tracer
or a callbackCalled flag) until it becomes true, failing the test on timeout;
update the test around the Leave callback invocation and assert the condition
instead of sleeping.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e19d6d3-cb3c-4dc3-994b-55fc5afcc193
📒 Files selected for processing (9)
module/metrics.gomodule/metrics/gossipsub.gomodule/metrics/noop.gomodule/mock/gossip_sub_metrics.gomodule/mock/lib_p2_p_metrics.gomodule/mock/local_gossip_sub_router_metrics.gomodule/mock/network_metrics.gonetwork/p2p/tracer/gossipSubMeshTracer.gonetwork/p2p/tracer/gossipSubMeshTracer_test.go
|
The issue says that the top offenders are:
but this seems to touch:
Is this ok? |
ed54daa to
a8f0a12
Compare
a8f0a12 to
a755560
Compare
|
|
||
| // OnClusterTopicMetricsCleanup removes all metric label values associated with the given cluster topic. | ||
| // This prevents unbounded metric cardinality growth during epoch transitions when collection nodes | ||
| // join new clusters and leave old ones. Only call this for cluster topics (sync-cluster-*, consensus-cluster-*). |
There was a problem hiding this comment.
will we still see metrics for the old clusters or they will no longer be reported? There is value in getting metrics even for the old cluster to see if the node is being spammed on those topics so I was thinking there could be catch-all label for old clusters e.g. cluster-old.
There was a problem hiding this comment.
Prometheus client only supports adding or removing metric label values. There's no relabel or aggregation operation. Aggregating old cluster metrics into a cluster-old label would require reading internal histogram bucket values and
re-recording them, which isn't supported by the API (and would be complex to implement).
What I have implemented is to remove metrics for the previous epoch when the node leaves that cluster topic (~600 blocks after epoch ends).
Note: Grafana charges for active series. Before this PR, entering a new epoch would add a new set of metrics to the active list, which is maintained in the node's memory by the Prometheus client. Since we recently restarted all collection
nodes, metrics for past epochs are no longer in memory and therefore no longer active - we should already see a drop in cost from the restart alone.
However, without this PR, metrics would start accumulating again with each new epoch. With these changes, we actively remove old epoch metrics from the active list when epoch transitions occur, preventing unbounded growth going forward.
When metrics are removed from the active list, the historical data remains visible on Grafana dashboards until retention expires - so visibility is preserved, but we stop paying for those series.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@module/metrics/network_test.go`:
- Around line 40-50: Add cleanup assertions for all metric families touched by
the test: after invoking nc.OnLocalMeshSizeUpdated, nc.OnPeerGraftTopic,
nc.OnPeerPruneTopic, and nc.OnIHaveMessageIDsReceived, verify that the metrics
for clusterTopic are removed and metrics for otherTopic remain; specifically add
assertions that the LocalGossipSubRouter (local mesh size), PeerGraft/PeerPrune
counters, and GossipSubRpcValidationInspector (IHaveMessageIDsReceived) metrics
have been cleaned up for clusterTopic and still exist for otherTopic. Apply the
same pair of assertions for the similar block referenced around the other range
(lines 63-71) so both test sections validate cleanup for the same metric
families.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 053f6036-65ae-4b8a-a0ea-32901bad90e1
📒 Files selected for processing (6)
module/metrics.gomodule/metrics/gossipsub_rpc_validation_inspector.gomodule/metrics/network.gomodule/metrics/network_test.gomodule/mock/gossip_sub_rpc_validation_inspector_metrics.gonetwork/p2p/tracer/gossipSubMeshTracer_test.go
✅ Files skipped from review due to trivial changes (1)
- module/mock/gossip_sub_rpc_validation_inspector_metrics.go
🚧 Files skipped from review as they are similar to previous changes (4)
- module/metrics.go
- network/p2p/tracer/gossipSubMeshTracer_test.go
- module/metrics/gossipsub_rpc_validation_inspector.go
- module/metrics/network.go
| // Record LocalGossipSubRouterMetrics | ||
| nc.OnLocalMeshSizeUpdated(clusterTopic, 5) | ||
| nc.OnLocalMeshSizeUpdated(otherTopic, 3) | ||
| nc.OnPeerGraftTopic(clusterTopic) | ||
| nc.OnPeerGraftTopic(otherTopic) | ||
| nc.OnPeerPruneTopic(clusterTopic) | ||
| nc.OnPeerPruneTopic(otherTopic) | ||
|
|
||
| // Record GossipSubRpcValidationInspectorMetrics | ||
| nc.OnIHaveMessageIDsReceived(clusterTopic, 10) | ||
| nc.OnIHaveMessageIDsReceived(otherTopic, 5) |
There was a problem hiding this comment.
Add cleanup assertions for all metric families exercised in this test.
This test records cluster-topic samples for local-mesh/graft/prune and IHave metrics, but post-cleanup assertions only validate inbound/outbound/duplicate. A regression in those other collectors would pass unnoticed; please assert they are removed for clusterTopic and retained for otherTopic too.
Also applies to: 63-71
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@module/metrics/network_test.go` around lines 40 - 50, Add cleanup assertions
for all metric families touched by the test: after invoking
nc.OnLocalMeshSizeUpdated, nc.OnPeerGraftTopic, nc.OnPeerPruneTopic, and
nc.OnIHaveMessageIDsReceived, verify that the metrics for clusterTopic are
removed and metrics for otherTopic remain; specifically add assertions that the
LocalGossipSubRouter (local mesh size), PeerGraft/PeerPrune counters, and
GossipSubRpcValidationInspector (IHaveMessageIDsReceived) metrics have been
cleaned up for clusterTopic and still exist for otherTopic. Apply the same pair
of assertions for the similar block referenced around the other range (lines
63-71) so both test sections validate cleanup for the same metric families.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go.mod`:
- Line 310: The go.mod line for the dependency
"github.com/prometheus/client_model v0.6.2" was changed to a direct dependency
but is only used transitively; revert it to an indirect dependency by restoring
the "// indirect" marker on the "github.com/prometheus/client_model v0.6.2"
entry in go.mod and then run "go mod tidy" to ensure module metadata is
consistent; locate the exact module line ("github.com/prometheus/client_model
v0.6.2") in go.mod and add the "// indirect" comment back.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| github.com/polydawn/refmt v0.89.0 // indirect | ||
| github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect | ||
| github.com/prometheus/client_model v0.6.2 // indirect | ||
| github.com/prometheus/client_model v0.6.2 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify that client_model is directly imported and check security
# Check if client_model is directly imported in any Go files
echo "=== Direct imports of prometheus/client_model ==="
rg -n --type=go 'import.*"github.com/prometheus/client_model' -C2
# Check for any usage of client_model types
echo -e "\n=== Usage of client_model types ==="
rg -n --type=go 'client_model\.' -C2
# Check GitHub for security advisories on this package
echo -e "\n=== Security advisories for prometheus/client_model ==="
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: GO, package: "github.com/prometheus/client_model") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'Repository: onflow/flow-go
Length of output: 251
Revert prometheus/client_model to indirect dependency.
The // indirect marker was removed, marking github.com/prometheus/client_model v0.6.2 as a direct dependency. However, verification found no direct imports or usages of this package in the codebase—it's only used transitively through another dependency. Restore the // indirect comment to accurately reflect its role and maintain go.mod clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@go.mod` at line 310, The go.mod line for the dependency
"github.com/prometheus/client_model v0.6.2" was changed to a direct dependency
but is only used transitively; revert it to an indirect dependency by restoring
the "// indirect" marker on the "github.com/prometheus/client_model v0.6.2"
entry in go.mod and then run "go mod tidy" to ensure module metadata is
consistent; locate the exact module line ("github.com/prometheus/client_model
v0.6.2") in go.mod and add the "// indirect" comment back.
| nc.GossipSubRpcValidationInspectorMetrics.OnClusterTopicMetricsCleanup(topic) | ||
|
|
||
| // Clean up inbound/outbound message size metrics using partial match on topic | ||
| nc.inboundMessageSize.DeletePartialMatch(prometheus.Labels{LabelChannel: topic}) |
There was a problem hiding this comment.
Good catch @janezpodhostnik , more metrics are added to the clean up list here.
|
@zhangchiqing - will this change take care of all these metrics (Same metric but different labels for cluster)? This is from a collection node: |
|
Yes, this PR will clean up metrics for the cluster topics (consensus-cluster-* and sync-cluster-*) when the node leaves those topics during epoch transitions. From your output, these will be cleaned up:
The other topics (push-blocks, push-guarantees, push-transactions, sync-committee) are not cluster topics - they're persistent network channels that exist across all epochs. These won't be cleaned up (and shouldn't be - they don't cause unbounded cardinality growth since they don't change per epoch). The 232 is the current epoch counter. Without this PR:
With this PR:
|
…milies (#8564) * extend cluster topic metrics cleanup to cover all per-topic metric families * add public OnClusterTopicMetricsCleanup to AlspMetrics instead of accessing private field
Kay-Zee
left a comment
There was a problem hiding this comment.
We can try this first.
There's a chance we'll STILL be charged for it, even if it's not exposed on /metrics, at least for some time. I forget what the retention is.
I think the "best" solve would be to not have the additional cardinality be created in the first place? like have the thing that's changing as a label, rather than add cadinality, is that possible?
…s-aggregate normalize cluster topic labels in metrics to prevent unbounded cardin…
…ality growth for more metrics
…s-aggregate normalize cluster topic labels in metrics to prevent unbounded cardin…
Good point. I've made the change to normalize the topics. The changes are: Before (master): After (
|
Kay-Zee
left a comment
There was a problem hiding this comment.
Yeah, i think for the network metrics, this is best. This will make it a little bit harder to group the metrics by cluster, but it will still be possible with post-analysis, so i think this is probably best.
Fix #8555
Summary by CodeRabbit
New Features
Tests
Chores