Skip to content

mesh-granularity=location: silently degraded leader blackholes relayed traffic with no failover #489

@kvaps

Description

@kvaps

Summary

With --mesh-granularity=location, all cross-location traffic to non-leader
nodes is relayed through the location's WireGuard leader. Kilo only re-elects
a leader when the previous one is fully removed from the cluster
(NotReady/deleted). It has no mechanism to detect a leader whose Kubernetes
state is Ready but whose underlying packet forwarding is broken. When this
happens, every non-leader node in the location becomes unreachable from other
locations until the failing node is fully removed — typically several minutes.

Reproducer / observed scenario

Hybrid cluster: on-prem location (3 nodes) + cloud-burst location (Azure VMSS,
2-3 nodes). Cilium with VXLAN overlay as CNI, Kilo for the inter-location
WireGuard mesh, --mesh-granularity=location, --compatibility=cilium.

The cloud leader experienced a kernel/hypervisor-level VXLAN forwarding fault:

  • WireGuard tunnel up, handshakes OK
  • kubelet/API server responsive (Ready)
  • ICMP ping to the leader and through it: OK
  • Cilium control-plane health: OK (existing flows)
  • New VXLAN-encapsulated flows: enter cilium_vxlan on the leader but never
    appear on the underlying NIC

Effect: every non-leader pod in that location was unreachable from the other
location for the full window (~5–15 min) until the autoscaler removed the node
because it eventually went NotReady. From Kilo's point of view the leader
was healthy throughout, so no re-election happened.

Why the current model doesn't catch it

Leader election is driven by Kubernetes node readiness, not by the actual
forwarding capability of the leader. There's no equivalent of a periodic
forwarding probe between leaders/peers, so any fault below the kubelet level
(NIC driver, hypervisor, kernel datapath) can blackhole relayed traffic
indefinitely while the node looks Ready.

Possible directions (open for discussion)

  • Active forwarding probe between location leaders (and from leaders to
    non-leaders) — promote a non-leader if the current leader fails the probe,
    independent of node readiness.
  • Allow per-location leader redundancy: e.g. multiple leaders per location
    with ECMP/anycast on the WG endpoint, so a degraded one is bypassed by
    routing rather than re-election.
  • A hybrid granularity: leaders per location for non-leader pods that don't
    have a public IP, plus direct tunnels between any nodes that do have one.

Environment

  • Kilo 0.7.0 (upstream)
  • Cilium with VXLAN overlay, kubeProxyReplacement=true
  • Talos Linux on cloud-burst VMSS, on-prem control planes
  • --mesh-granularity=location, --compatibility=cilium,
    --encapsulate=crosssubnet, --mtu=auto, --internal-cidr=$(NODE_IP)/32

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions