Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909
Open
wu-sheng wants to merge 1 commit into
Open
Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909wu-sheng wants to merge 1 commit into
wu-sheng wants to merge 1 commit into
Conversation
…ode identity Runtime-rule schema changes were inoperative in no-init mode (the mode every production OAP cluster runs), and runtime-rule cross-node writes failed on multi-replica Kubernetes clusters. Both are fixed here. * no-init schema change: the storage installer's init-node poll loop (ModelInstaller.whenCreating) was gated on RunningMode, so a runtime withSchemaChange create / update / revert blocked forever on a no-init OAP. Gate it instead on a new StorageManipulationOpt.Flags.deferDDLToInitNode bit, set only on the static-boot schemaCreateIfAbsent opt and DRYed into ModelInstaller.deferDDLToInitNode(opt) (reused by the BanyanDB shape-check and group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly / withoutSchemaChange) are now driven by their flags and by cluster main-ness: no-init and default no longer differ for DSL DDL; init stays the dedicated initializer. DSLManager.tickStorageOpt is collapsed accordingly. * k8s node identity: resolve the runtime-rule selfNodeId from the unique per-pod SKYWALKING_COLLECTOR_UID (pod UID, injected from metadata.uid) instead of the colliding telemetry id (0.0.0.0_11800 under a 0.0.0.0 gRPC bind host), in start() before any apply. This fixes HTTP 400 forward_self_loop on the cross-node Forward path; MainRouter already routes correctly off pod IPs. * tests: add ModelInstallerNoInitTest (UT); convert the runtime-rule/cluster e2e from docker-compose (default mode, which exercised neither bug) to a kind + skywalking-helm no-init cluster (oap.replicas=2) covering the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the Forward path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix runtime-rule (MAL/LAL hot-update) schema changes in
no-initmode, and the runtime-rule cluster node-identity collision on KubernetesTwo bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed end-to-end on a local kind cluster:
1. Runtime-rule schema changes were inoperative in
no-initmode — the mode every production OAP cluster runs (a one-shot-Dmode=initJob creates the static schema; the OAP Deployment runs-Dmode=no-init). A runtimeaddOrUpdateintroducing a new metric blocked forever in the storage installer's init-node poll loop (ModelInstaller.whenCreating), because the loop was gated onRunningModerather than the operation's intent./delete?mode=revertToBundledrecreate and BanyanDB in-place shape updates were dead the same way. Fix: a newStorageManipulationOpt.Flags.deferDDLToInitNodebit, set only on the static boot-timeschemaCreateIfAbsent()opt (DRYed intoModelInstaller.deferDDLToInitNode(opt), reused by the BanyanDB shape-check / group-DDL gates). The runtime-rule opts (withSchemaChange/verifySchemaOnly/withoutSchemaChange) are now driven by their flags and by cluster main-ness —no-initanddefaultno longer differ for DSL DDL;initstays the dedicated initializer.DSLManager.tickStorageOptis collapsed accordingly.2. Runtime-rule cross-node writes failed with
HTTP 400 forward_self_loopon a multi-replica Kubernetes cluster. Every OAP replica shared the clusterselfNodeId0.0.0.0_11800(derived from the0.0.0.0agent gRPC bind host viaTelemetryRelatedContext), so the main's self-loop guard rejected a legitimate peer-to-peer Forward as if it had looped back. Fix: resolve the runtime-rule node identity from the unique per-podSKYWALKING_COLLECTOR_UID(the pod UID injected by the helm chart / swck operator frommetadata.uid), instart()before any apply; falls back to the telemetry id off-Kubernetes.MainRouteralready routes correctly off the cluster peer addresses (pod IPs); only the self-loop identity needed to be unique.Tests: new
ModelInstallerNoInitTest(UT) for the no-init create chokepoint; the runtime-rule cluster e2e is converted from docker-compose (default mode — which never exercised either bug) to a kind + skywalking-helmno-initcluster (oap.replicas=2) driving the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the cross-node Forward path.CHANGESlog.