Summary
With --mesh-granularity=location, all cross-location traffic to non-leader
nodes is relayed through the location's WireGuard leader. Kilo only re-elects
a leader when the previous one is fully removed from the cluster
(NotReady/deleted). It has no mechanism to detect a leader whose Kubernetes
state is Ready but whose underlying packet forwarding is broken. When this
happens, every non-leader node in the location becomes unreachable from other
locations until the failing node is fully removed — typically several minutes.
Reproducer / observed scenario
Hybrid cluster: on-prem location (3 nodes) + cloud-burst location (Azure VMSS,
2-3 nodes). Cilium with VXLAN overlay as CNI, Kilo for the inter-location
WireGuard mesh, --mesh-granularity=location, --compatibility=cilium.
The cloud leader experienced a kernel/hypervisor-level VXLAN forwarding fault:
- WireGuard tunnel up, handshakes OK
- kubelet/API server responsive (
Ready)
- ICMP ping to the leader and through it: OK
- Cilium control-plane health: OK (existing flows)
- New VXLAN-encapsulated flows: enter
cilium_vxlan on the leader but never
appear on the underlying NIC
Effect: every non-leader pod in that location was unreachable from the other
location for the full window (~5–15 min) until the autoscaler removed the node
because it eventually went NotReady. From Kilo's point of view the leader
was healthy throughout, so no re-election happened.
Why the current model doesn't catch it
Leader election is driven by Kubernetes node readiness, not by the actual
forwarding capability of the leader. There's no equivalent of a periodic
forwarding probe between leaders/peers, so any fault below the kubelet level
(NIC driver, hypervisor, kernel datapath) can blackhole relayed traffic
indefinitely while the node looks Ready.
Possible directions (open for discussion)
- Active forwarding probe between location leaders (and from leaders to
non-leaders) — promote a non-leader if the current leader fails the probe,
independent of node readiness.
- Allow per-location leader redundancy: e.g. multiple leaders per location
with ECMP/anycast on the WG endpoint, so a degraded one is bypassed by
routing rather than re-election.
- A hybrid granularity: leaders per location for non-leader pods that don't
have a public IP, plus direct tunnels between any nodes that do have one.
Environment
- Kilo
0.7.0 (upstream)
- Cilium with VXLAN overlay,
kubeProxyReplacement=true
- Talos Linux on cloud-burst VMSS, on-prem control planes
--mesh-granularity=location, --compatibility=cilium,
--encapsulate=crosssubnet, --mtu=auto, --internal-cidr=$(NODE_IP)/32
Summary
With
--mesh-granularity=location, all cross-location traffic to non-leadernodes is relayed through the location's WireGuard leader. Kilo only re-elects
a leader when the previous one is fully removed from the cluster
(
NotReady/deleted). It has no mechanism to detect a leader whose Kubernetesstate is
Readybut whose underlying packet forwarding is broken. When thishappens, every non-leader node in the location becomes unreachable from other
locations until the failing node is fully removed — typically several minutes.
Reproducer / observed scenario
Hybrid cluster: on-prem location (3 nodes) + cloud-burst location (Azure VMSS,
2-3 nodes). Cilium with VXLAN overlay as CNI, Kilo for the inter-location
WireGuard mesh,
--mesh-granularity=location,--compatibility=cilium.The cloud leader experienced a kernel/hypervisor-level VXLAN forwarding fault:
Ready)cilium_vxlanon the leader but neverappear on the underlying NIC
Effect: every non-leader pod in that location was unreachable from the other
location for the full window (~5–15 min) until the autoscaler removed the node
because it eventually went
NotReady. From Kilo's point of view the leaderwas healthy throughout, so no re-election happened.
Why the current model doesn't catch it
Leader election is driven by Kubernetes node readiness, not by the actual
forwarding capability of the leader. There's no equivalent of a periodic
forwarding probe between leaders/peers, so any fault below the kubelet level
(NIC driver, hypervisor, kernel datapath) can blackhole relayed traffic
indefinitely while the node looks
Ready.Possible directions (open for discussion)
non-leaders) — promote a non-leader if the current leader fails the probe,
independent of node readiness.
with ECMP/anycast on the WG endpoint, so a degraded one is bypassed by
routing rather than re-election.
have a public IP, plus direct tunnels between any nodes that do have one.
Environment
0.7.0(upstream)kubeProxyReplacement=true--mesh-granularity=location,--compatibility=cilium,--encapsulate=crosssubnet,--mtu=auto,--internal-cidr=$(NODE_IP)/32