Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911
Open
wu-sheng wants to merge 3 commits into
Open
Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911wu-sheng wants to merge 3 commits into
wu-sheng wants to merge 3 commits into
Conversation
…& Elvis falsy semantics * BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod. * v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name. * v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a". * banyandb otel-rules: PT15S -> PT1M rate window. * Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest.
ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for these actions, so the stale pins were rejected and the CI workflow failed with startup_failure. Updated to the newest approved v4 SHA each: * docker/login-action v3.7.0 -> v4.2.0 (650006c6) * docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5) * docker/setup-qemu-action v3.6.0 -> v4.1.0 (06116385) * dorny/paths-filter v3.0.2 -> v4.0.1 (fbd0ab8f)
The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on each counter's own sample name instead of the rule-level context.metricName. MALExpressionExecutionTest relied on context.metricName (set to a unique sourceFile/metricName) to keep each rule's prime/real pair isolated in the process-wide CounterWindow.INSTANCE singleton — the new keying ignores that field, so leftover samples from one rule leaked into the next across the ~1350 sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6, a lower bound pulled from an earlier rule). Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest already uses via @beforeeach) and drop the now-dead setMetricName scaffolding (context.metricName has no readers after the keying change). No production code or expected values changed; 1350/1350 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics
Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo.
1. BanyanDB schema-cache self-heal. Peer nodes flooded
<metric> is not registeredwhen a node held a live persist worker but never populated its localMetadataRegistryschema cache for that model (awithoutSchemaChangepeer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (MetadataRegistry.repopulateLocally) before failing, and the no-init defer poll loop retries a transient backend probe error (isRetryableNoInitProbeFailure— defaultfalse, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod.2. v2 MAL
CounterWindowkey collision.rate()/increase()/irate()keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after.sum(...)therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name.BanyanDBErrorRateReproTestreproduces it with the real frozen values (966 → 0 after the fix).3. v2 MAL Elvis
?:falsy semantics. Compiled toOptional.ofNullable(primary).orElse(fallback), applying the fallback only onnull, so an empty-string primary kept""(a BanyanDB liaisonServiceInstancestorednode_type=""rather thann/a, because.sum([...,'node_type'])fills an absent group-by label with""). Now single-evaluated throughMalRuntimeHelper.elvis/isTruthy, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array).MALElvisFalsyTestcovers empty/null/non-empty/side-effecting primaries.4. banyandb otel-rules.
PT15S→PT1Mrate window to match the collector scrape / OAP minute-bucket cadence (MALrate()is a two-pointCounterWindowdelta, not PromQL).CHANGESlog.