Skip to content

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911

Open
wu-sheng wants to merge 3 commits into
masterfrom
fix/runtime-rule-schema-cache-self-heal
Open

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911
wu-sheng wants to merge 3 commits into
masterfrom
fix/runtime-rule-schema-cache-self-heal

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo.

1. BanyanDB schema-cache self-heal. Peer nodes flooded <metric> is not registered when a node held a live persist worker but never populated its local MetadataRegistry schema cache for that model (a withoutSchemaChange peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (MetadataRegistry.repopulateLocally) before failing, and the no-init defer poll loop retries a transient backend probe error (isRetryableNoInitProbeFailure — default false, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod.

2. v2 MAL CounterWindow key collision. rate() / increase() / irate() keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after .sum(...) therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name. BanyanDBErrorRateReproTest reproduces it with the real frozen values (966 → 0 after the fix).

3. v2 MAL Elvis ?: falsy semantics. Compiled to Optional.ofNullable(primary).orElse(fallback), applying the fallback only on null, so an empty-string primary kept "" (a BanyanDB liaison ServiceInstance stored node_type="" rather than n/a, because .sum([...,'node_type']) fills an absent group-by label with ""). Now single-evaluated through MalRuntimeHelper.elvis / isTruthy, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array). MALElvisFalsyTest covers empty/null/non-empty/side-effecting primaries.

4. banyandb otel-rules. PT15SPT1M rate window to match the collector scrape / OAP minute-bucket cadence (MAL rate() is a two-point CounterWindow delta, not PromQL).

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

…& Elvis falsy semantics

* BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod.
* v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name.
* v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a".
* banyandb otel-rules: PT15S -> PT1M rate window.
* Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest.
@wu-sheng wu-sheng closed this Jun 14, 2026
@wu-sheng wu-sheng reopened this Jun 14, 2026
ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for
these actions, so the stale pins were rejected and the CI workflow failed
with startup_failure. Updated to the newest approved v4 SHA each:

* docker/login-action       v3.7.0  -> v4.2.0 (650006c6)
* docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5)
* docker/setup-qemu-action   v3.6.0  -> v4.1.0 (06116385)
* dorny/paths-filter         v3.0.2  -> v4.0.1 (fbd0ab8f)
@wu-sheng wu-sheng added bug Something isn't working and you are sure it's a bug! backend OAP backend related. labels Jun 14, 2026
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 14, 2026
The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on
each counter's own sample name instead of the rule-level context.metricName.
MALExpressionExecutionTest relied on context.metricName (set to a unique
sourceFile/metricName) to keep each rule's prime/real pair isolated in the
process-wide CounterWindow.INSTANCE singleton — the new keying ignores that
field, so leftover samples from one rule leaked into the next across the ~1350
sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6,
a lower bound pulled from an earlier rule).

Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest
already uses via @beforeeach) and drop the now-dead setMetricName scaffolding
(context.metricName has no readers after the keying change). No production code
or expected values changed; 1350/1350 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. bug Something isn't working and you are sure it's a bug!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant