Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics by wu-sheng · Pull Request #13911 · apache/skywalking

wu-sheng · 2026-06-14T09:20:29Z

Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics

Add a unit test to verify that the fix works.
Explain briefly why the bug exists and how to fix it.

Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo.

1. BanyanDB schema-cache self-heal. Peer nodes flooded <metric> is not registered when a node held a live persist worker but never populated its local MetadataRegistry schema cache for that model (a withoutSchemaChange peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (MetadataRegistry.repopulateLocally) before failing, and the no-init defer poll loop retries a transient backend probe error (isRetryableNoInitProbeFailure — default false, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod.

2. v2 MAL CounterWindow key collision. rate() / increase() / irate() keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after .sum(...) therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name. BanyanDBErrorRateReproTest reproduces it with the real frozen values (966 → 0 after the fix).

3. v2 MAL Elvis ?: falsy semantics. Compiled to Optional.ofNullable(primary).orElse(fallback), applying the fallback only on null, so an empty-string primary kept "" (a BanyanDB liaison ServiceInstance stored node_type="" rather than n/a, because .sum([...,'node_type']) fills an absent group-by label with ""). Now single-evaluated through MalRuntimeHelper.elvis / isTruthy, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array). MALElvisFalsyTest covers empty/null/non-empty/side-effecting primaries.

4. banyandb otel-rules. PT15S → PT1M rate window to match the collector scrape / OAP minute-bucket cadence (MAL rate() is a two-point CounterWindow delta, not PromQL).

If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
Update the CHANGES log.

…& Elvis falsy semantics * BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod. * v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name. * v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a". * banyandb otel-rules: PT15S -> PT1M rate window. * Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest.

ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for these actions, so the stale pins were rejected and the CI workflow failed with startup_failure. Updated to the newest approved v4 SHA each: * docker/login-action v3.7.0 -> v4.2.0 (650006c6) * docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5) * docker/setup-qemu-action v3.6.0 -> v4.1.0 (06116385) * dorny/paths-filter v3.0.2 -> v4.0.1 (fbd0ab8f)

The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on each counter's own sample name instead of the rule-level context.metricName. MALExpressionExecutionTest relied on context.metricName (set to a unique sourceFile/metricName) to keep each rule's prime/real pair isolated in the process-wide CounterWindow.INSTANCE singleton — the new keying ignores that field, so leftover samples from one rule leaked into the next across the ~1350 sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6, a lower bound pulled from an earlier rule). Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest already uses via @beforeeach) and drop the now-dead setMetricName scaffolding (context.metricName has no readers after the keying change). No production code or expected values changed; 1350/1350 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wu-sheng closed this Jun 14, 2026

wu-sheng reopened this Jun 14, 2026

wu-sheng added bug Something isn't working and you are sure it's a bug! backend OAP backend related. labels Jun 14, 2026

wu-sheng added this to the 11.0.0 milestone Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911
wu-sheng wants to merge 3 commits into
masterfrom
fix/runtime-rule-schema-cache-self-heal

wu-sheng commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wu-sheng commented Jun 14, 2026

Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant