Skip to content

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909

Open
wu-sheng wants to merge 1 commit into
masterfrom
fix/runtime-rule-no-init-schema-change
Open

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909
wu-sheng wants to merge 1 commit into
masterfrom
fix/runtime-rule-no-init-schema-change

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Fix runtime-rule (MAL/LAL hot-update) schema changes in no-init mode, and the runtime-rule cluster node-identity collision on Kubernetes

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed end-to-end on a local kind cluster:

1. Runtime-rule schema changes were inoperative in no-init mode — the mode every production OAP cluster runs (a one-shot -Dmode=init Job creates the static schema; the OAP Deployment runs -Dmode=no-init). A runtime addOrUpdate introducing a new metric blocked forever in the storage installer's init-node poll loop (ModelInstaller.whenCreating), because the loop was gated on RunningMode rather than the operation's intent. /delete?mode=revertToBundled recreate and BanyanDB in-place shape updates were dead the same way. Fix: a new StorageManipulationOpt.Flags.deferDDLToInitNode bit, set only on the static boot-time schemaCreateIfAbsent() opt (DRYed into ModelInstaller.deferDDLToInitNode(opt), reused by the BanyanDB shape-check / group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly / withoutSchemaChange) are now driven by their flags and by cluster main-ness — no-init and default no longer differ for DSL DDL; init stays the dedicated initializer. DSLManager.tickStorageOpt is collapsed accordingly.

2. Runtime-rule cross-node writes failed with HTTP 400 forward_self_loop on a multi-replica Kubernetes cluster. Every OAP replica shared the cluster selfNodeId 0.0.0.0_11800 (derived from the 0.0.0.0 agent gRPC bind host via TelemetryRelatedContext), so the main's self-loop guard rejected a legitimate peer-to-peer Forward as if it had looped back. Fix: resolve the runtime-rule node identity from the unique per-pod SKYWALKING_COLLECTOR_UID (the pod UID injected by the helm chart / swck operator from metadata.uid), in start() before any apply; falls back to the telemetry id off-Kubernetes. MainRouter already routes correctly off the cluster peer addresses (pod IPs); only the self-loop identity needed to be unique.

Tests: new ModelInstallerNoInitTest (UT) for the no-init create chokepoint; the runtime-rule cluster e2e is converted from docker-compose (default mode — which never exercised either bug) to a kind + skywalking-helm no-init cluster (oap.replicas=2) driving the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the cross-node Forward path.

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

…ode identity

Runtime-rule schema changes were inoperative in no-init mode (the mode every
production OAP cluster runs), and runtime-rule cross-node writes failed on
multi-replica Kubernetes clusters. Both are fixed here.

* no-init schema change: the storage installer's init-node poll loop
  (ModelInstaller.whenCreating) was gated on RunningMode, so a runtime
  withSchemaChange create / update / revert blocked forever on a no-init OAP.
  Gate it instead on a new StorageManipulationOpt.Flags.deferDDLToInitNode bit,
  set only on the static-boot schemaCreateIfAbsent opt and DRYed into
  ModelInstaller.deferDDLToInitNode(opt) (reused by the BanyanDB shape-check and
  group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly /
  withoutSchemaChange) are now driven by their flags and by cluster main-ness:
  no-init and default no longer differ for DSL DDL; init stays the dedicated
  initializer. DSLManager.tickStorageOpt is collapsed accordingly.

* k8s node identity: resolve the runtime-rule selfNodeId from the unique per-pod
  SKYWALKING_COLLECTOR_UID (pod UID, injected from metadata.uid) instead of the
  colliding telemetry id (0.0.0.0_11800 under a 0.0.0.0 gRPC bind host), in
  start() before any apply. This fixes HTTP 400 forward_self_loop on the
  cross-node Forward path; MainRouter already routes correctly off pod IPs.

* tests: add ModelInstallerNoInitTest (UT); convert the runtime-rule/cluster e2e
  from docker-compose (default mode, which exercised neither bug) to a kind +
  skywalking-helm no-init cluster (oap.replicas=2) covering the apply / STRUCTURAL
  / inactivate / delete lifecycle, cross-node convergence, and the Forward path.
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 13, 2026
@wu-sheng wu-sheng added the bug Something isn't working and you are sure it's a bug! label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working and you are sure it's a bug!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant