mesh-granularity=location: silently degraded leader blackholes relayed traffic with no failover

## Summary

With `--mesh-granularity=location`, all cross-location traffic to non-leader
nodes is relayed through the location's WireGuard leader. Kilo only re-elects
a leader when the previous one is fully removed from the cluster
(`NotReady`/deleted). It has no mechanism to detect a leader whose Kubernetes
state is `Ready` but whose underlying packet forwarding is broken. When this
happens, every non-leader node in the location becomes unreachable from other
locations until the failing node is fully removed — typically several minutes.

## Reproducer / observed scenario

Hybrid cluster: on-prem location (3 nodes) + cloud-burst location (Azure VMSS,
2-3 nodes). Cilium with VXLAN overlay as CNI, Kilo for the inter-location
WireGuard mesh, `--mesh-granularity=location`, `--compatibility=cilium`.

The cloud leader experienced a kernel/hypervisor-level VXLAN forwarding fault:
- WireGuard tunnel up, handshakes OK
- kubelet/API server responsive (`Ready`)
- ICMP ping to the leader and through it: OK
- Cilium control-plane health: OK (existing flows)
- New VXLAN-encapsulated flows: enter `cilium_vxlan` on the leader but never
  appear on the underlying NIC

Effect: every non-leader pod in that location was unreachable from the other
location for the full window (~5–15 min) until the autoscaler removed the node
because it eventually went `NotReady`. From Kilo's point of view the leader
was healthy throughout, so no re-election happened.

## Why the current model doesn't catch it

Leader election is driven by Kubernetes node readiness, not by the actual
forwarding capability of the leader. There's no equivalent of a periodic
forwarding probe between leaders/peers, so any fault below the kubelet level
(NIC driver, hypervisor, kernel datapath) can blackhole relayed traffic
indefinitely while the node looks `Ready`.

## Possible directions (open for discussion)

- Active forwarding probe between location leaders (and from leaders to
  non-leaders) — promote a non-leader if the current leader fails the probe,
  independent of node readiness.
- Allow per-location leader redundancy: e.g. multiple leaders per location
  with ECMP/anycast on the WG endpoint, so a degraded one is bypassed by
  routing rather than re-election.
- A hybrid granularity: leaders per location for non-leader pods that don't
  have a public IP, plus direct tunnels between any nodes that do have one.

## Environment

- Kilo `0.7.0` (upstream)
- Cilium with VXLAN overlay, `kubeProxyReplacement=true`
- Talos Linux on cloud-burst VMSS, on-prem control planes
- `--mesh-granularity=location`, `--compatibility=cilium`,
  `--encapsulate=crosssubnet`, `--mtu=auto`, `--internal-cidr=$(NODE_IP)/32`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh-granularity=location: silently degraded leader blackholes relayed traffic with no failover #489

Summary

Reproducer / observed scenario

Why the current model doesn't catch it

Possible directions (open for discussion)

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

mesh-granularity=location: silently degraded leader blackholes relayed traffic with no failover #489

Description

Summary

Reproducer / observed scenario

Why the current model doesn't catch it

Possible directions (open for discussion)

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions