CFP-44774: Cilium Control Plane at Scale by vipul-21 · Pull Request #88 · cilium/design-cfps

vipul-21 · 2026-03-13T18:21:19Z

This CFP proposes reusing the existing ClusterMesh etcd as a non-persistent, cache-backed data distribution layer so agents read Cilium CRDs from etcd instead of the API server, reducing API server load at scale.

Issue: cilium/cilium#44774

Signed-off-by: Vipul Singh <singhvipul@microsoft.com>

MrFreezeex

Hi thanks for the proposal! I skimmed through this CFP and left a few comments. This is more a quick pass than a full review.

Overall I think this proposal is mostly focused on the why and sounds like it's missing some key details on how you would implement this. Like would you use the pkg/clustermesh code and try to treat the local cluster like almost any other remote cluster? Something else? ~~There's also the Service aspects which you didn't say explicitly that you would use the data from Kubernetes or clustermesh-apiserver in this mode.~~ EDIT: nevermind this is discussed

MrFreezeex · 2026-03-13T19:18:22Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+the same cluster whose agents depend on it creates a bootstrap problem:
+agents need ClusterMesh etcd for control plane state, but the ClusterMesh
+API Server itself may need functioning agents for pod networking or Service
+IP routing. To address this, we propose two options:


Unless there is a clear advantage on why having different modes I would suggest only supporting ones. There are already many different options and combination within clustermesh/kvstore/kvstoremesh and adding more options and ways sounds difficult to maintain and keep in mind in the long run. So we should have a very clear reason to support multiple option here if we want to go forward with that IMO.

MrFreezeex · 2026-03-13T19:21:21Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* **Multi-cluster integration**: Ensure the single-cluster KVStore mode
+  composes cleanly with existing multi-cluster ClusterMesh deployments,
+  allowing clusters that use this optimization to also participate in
+  cross-cluster service discovery and identity sharing.


I would tend to think that this should be fairly straightforward, any reasons to have this in future milestones?

MrFreezeex · 2026-03-13T19:24:34Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* **Centralized control plane operations**: Move compute-intensive control
+  plane activities such as network policy calculation and ipcache computation
+  into the ClusterMesh API Server, reducing per-agent CPU overhead and
+  enabling consistent, cluster-wide policy evaluation. (Currently a non-goal
+  for this CFP, but can be a natural evolution of the architecture.)


Is that desirable? Clustermesh-apiserver is treated as a cache here so we would need in each clustermesh-apiserver to compute the same thing independently (and that those are consistent enough) or to also use the kube-apiserver to perform that which would be similar to the operator 🤔

Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.

MrFreezeex · 2026-03-13T19:31:45Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead
+  where every CRD mutation traverses both the Kubernetes API server and
+  ClusterMesh etcd. Agents and the operator would write Cilium state directly
+  to the ClusterMesh etcd, removing the extra API server round-trip and
+  reducing end-to-end propagation latency. (Currently a non-goal for this
+  CFP, but can be an evolution of this work.)


We were discussing in slack with Marco, Tamilmani and I that this was just looking like the kvstore mode and that due to the ephemeral nature of ClusterMesh etcd would not really work unless you make it persistent which would make it almost exactly like the kvstore mode. So I don't think it's a desirable milestone and I would suggest to remove or possibly change it to, as Marco was saying on slack, improving the existing kvstore mode.

Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.

MrFreezeex · 2026-03-13T20:07:22Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+| **Namespaces** | 1,000 |
+| **Pods** | 40,000 |
+| **Deployments** | 4,000 |
+| **Pod Deployment Rate** | 100 pods/sec |


Does that deploy some Services as well? If not it sounds interesting to have as part of this benchmark to better match the reality and reflect what part of the load we are offloading to the clustermesh-apiserver

MrFreezeex · 2026-03-13T20:09:15Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+as they do today. Additionally, Kubernetes-native resources such as Services,
+Endpoints, Pods, and Nodes are not covered by this proposal, agents continue
+to watch these directly from the API server.


Could you add some details on the reasoning behind this part?

MrFreezeex · 2026-03-13T20:37:49Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+
+![Embedded etcd CPU and memory usage over time in ClusterMesh mode](./images/CFP-44774-cilium-clustermesh-etcd-cpu-memory.png)
+
+Clustermesh API server containers(i.e apiserver and etcd) CPU and memory usage remained stable during the test, with no significant spikes during workload churn.


It might be nice to have some quantile number or graph of memory/CPU usage. For instance, it is said here that the clustermesh-apiserver etcd was stable in term of CPU, does that mean it's always running 3 CPU (the max on the chart provided)? A nice way (if actually possible) to show it could be a graph of CPU of the two mode for both the k8s apiserver and clustermesh-apiserver on the same graph and likewise for memory.

vipul-21 · 2026-03-16T14:40:01Z

@MrFreezeex Thanks for the review. Yes, the implementation reuses pkg/clustermesh. The local cluster data is synced to the clustermesh API server’s etcd (similar to how remote cluster data is handled). Each agent then watches the same etcd to retrieve the required information.

giorio94

Thanks for the proposal!

A few initial high-level comments from my side as well.

giorio94 · 2026-03-26T14:03:29Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

@@ -0,0 +1,315 @@
+# CFP-44774: Cilium Control Plane at Scale
+
+**SIG: SIG Scalability, SIG-ClusterMesh**


It makes sense IMO to involve the scalability and clustermesh SIGs for the initial design discussion, but I don't think it is sustainable to extend the scope of these SIGs to also oversee the new logic possibly resulting from this CFP. Both because outside of their main scope (it does not pertain multi-cluster, and SIG scalability should oversee scalability-related topics, but not end up owning any improvement potentially targeting improved scalability), and as both teams are seriously understaffed at the moment.

giorio94 · 2026-03-26T14:08:34Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+separately provisioned and managed etcd cluster, and it only partially reduces
+API server pressure because endpoint slices and node objects continue to be
+served through CRD watches. A solution that offloads all agent CRD watches


I don't follow the second part of the sentence (assuming you refer to CiliumEndpointSlice and CiliumNodes). When Cilium runs in KVStore mode, identity, endpoint and node entries are all watched from etcd, rather than the Kubernetes API Server. There are certain limitations and feature incompatibilities given that the overall trend in recent years has been to focus more on CRD mode, but these are orthogonal aspects that can be resolved separately.

giorio94 · 2026-03-26T14:19:37Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* Utilize ClusterMesh etcd as an alternative datastore, avoiding the need to
+  provision and maintain a separate external KVStore.


To me, a key pre-requisite for adding the support for a new datastore is to have a unified abstraction (likely based on statedb) that allows to decouple the downstream consumers from the actual ingestion logic, which I don't see discussed in this CFP.

Currently, the code paths for ingesting data in CRD and KVStore mode (as well as from clustermesh) are diverse and fairly independent, mostly for historical reasons. However, that proved being a constant pain point, leading to feature incompatibilities, caveats only applying to one method or the other (e.g., around connection disruption on agent restart) and overall complexity in general.

I don't think it is viable to introduce yet another datastore without carefully rethinking the supporting architecture, as it would lead to even more feature incompatibilities and complexity.

giorio94 · 2026-03-26T14:26:11Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+To reduce Kubernetes API server load, we propose utilizing the existing
+ClusterMesh etcd as a non-persistent, cache-backed datastore for agent


From a deployment perspective, I think it would be appropriate to keep the proposed datastore separate from the existing clustermesh-apiserver for better isolation, both in terms of permissions, and to prevent possible interference in case of e.g., certain replicas end up being overloaded due to either in-cluster or cross-cluster data synchronization. Potential failures do also have very different consequences and implications in the two cases. The underlying binary may be reused if appropriate though.

giorio94 · 2026-03-26T14:34:15Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+1. **Pod networking mode (recommended)**: The ClusterMesh API Server runs as
+   a regular pod (no host networking) and is exposed via a Kubernetes
+   Service. Agents attempt to connect to ClusterMesh etcd at startup; if the
+   connection fails or is not yet available, agents fall back to reading
+   CRDs directly from the Kubernetes API server. Once the ClusterMesh API
+   Server becomes reachable, agents switch over to reading from ClusterMesh
+   etcd. This mode requires no special scheduling or networking
+   configuration and is operationally identical to a standard Cilium
+   deployment.


As mentioned in Slack, Cilium historically included a handover mechanism similar to the one proposed here, to support KVStore mode with etcd running in pod network. This turned out being a non-ending source of bugs and inconsistencies, because the two sources are not perfectly aligned at any point in time. Additionally, it also leads to unnecessary overhead on the Kubernetes API Server, if the informers are started only to be immediately terminated afterwards once successfully connected to etcd. That support got ripped out a few versions ago, and IMO we really need to stick to a single source of truth, either K8s or etcd, and not do hand-overs in case of failures or alike.

giorio94 · 2026-03-26T14:41:15Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+   directly to those IPs. This avoids depending on Service IP datapath
+   programming (which Cilium would need to provide) but
+   introduces host port conflicts and requires agents to maintain a
+   lightweight watch on the ClusterMesh API Server pods for IP discovery
+   and failover when pods reschedule to different nodes.


We already bypass datapath-level service load balancing when connecting to KVStoreMesh, performing service DNS name resolution and backend selection via a custom dialer, to break possible chicken-and-egg dependencies. The same logic may be potentially adopted here, without having to watch for the pods themselves.

giorio94 · 2026-03-26T14:49:25Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+| **Namespaces** | 1,000 |
+| **Pods** | 40,000 |
+| **Deployments** | 4,000 |
+| **Pod Deployment Rate** | 100 pods/sec |


Which settings are you using for the clustermesh-apiserver? I assume not the default ones, as the etcd client is configured with 20 QPS, which could not sustain a pod deployment rate of 100/sec.

giorio94 · 2026-03-26T14:55:03Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+| **API Server Request/Response Latency** | `apiserver_request_duration_seconds_bucket` |
+| **ClusterMesh API Server CPU** | `container_cpu_usage_seconds_total` |
+| **ClusterMesh API Server Memory** | `container_memory_usage_bytes` |
+| **Pod Startup Latency** | `kubelet_pod_start_sli_duration_seconds_bucket` |


How did you measure pod readiness? Time before the pod is marked as ready, or time before another pod is able to successfully connect to the new pod, subject to network policies (i.e., all information propagated to the other agents)? The former may be misleading, if the hosting agent does not require any information to be able to start the pod.

giorio94 · 2026-03-26T14:57:00Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead
+  where every CRD mutation traverses both the Kubernetes API server and
+  ClusterMesh etcd. Agents and the operator would write Cilium state directly
+  to the ClusterMesh etcd, removing the extra API server round-trip and
+  reducing end-to-end propagation latency. (Currently a non-goal for this
+  CFP, but can be an evolution of this work.)


Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.

giorio94 · 2026-03-26T14:58:53Z

cilium/CFP-44774-cilium-control-plane-at-scale.md

+* **Centralized control plane operations**: Move compute-intensive control
+  plane activities such as network policy calculation and ipcache computation
+  into the ClusterMesh API Server, reducing per-agent CPU overhead and
+  enabling consistent, cluster-wide policy evaluation. (Currently a non-goal
+  for this CFP, but can be a natural evolution of the architecture.)


Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.

CFP-44774: Cilium Control Plane at Scale

f184f4e

Signed-off-by: Vipul Singh <singhvipul@microsoft.com>

vipul-21 mentioned this pull request Mar 13, 2026

CFP: Cilium Control Plane at Scale - Reuse ClusterMesh etcd as Cache-Backed Data Distribution Layer cilium/cilium#44774

Open

MrFreezeex requested changes Mar 13, 2026

View reviewed changes

MrFreezeex reviewed Mar 13, 2026

View reviewed changes

giorio94 requested changes Mar 26, 2026

View reviewed changes


		![Embedded etcd CPU and memory usage over time in ClusterMesh mode](./images/CFP-44774-cilium-clustermesh-etcd-cpu-memory.png)

		Clustermesh API server containers(i.e apiserver and etcd) CPU and memory usage remained stable during the test, with no significant spikes during workload churn.

		@@ -0,0 +1,315 @@
		# CFP-44774: Cilium Control Plane at Scale

		SIG: SIG Scalability, SIG-ClusterMesh

		* Utilize ClusterMesh etcd as an alternative datastore, avoiding the need to
		provision and maintain a separate external KVStore.

		To reduce Kubernetes API server load, we propose utilizing the existing
		ClusterMesh etcd as a non-persistent, cache-backed datastore for agent

Conversation

vipul-21 commented Mar 13, 2026

Uh oh!

MrFreezeex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrFreezeex Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrFreezeex Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrFreezeex Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrFreezeex Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vipul-21 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MrFreezeex left a comment •

edited

Loading

MrFreezeex Mar 13, 2026 •

edited

Loading

MrFreezeex Mar 13, 2026 •

edited

Loading

MrFreezeex Mar 13, 2026 •

edited

Loading

MrFreezeex Mar 13, 2026 •

edited

Loading

vipul-21 commented Mar 16, 2026 •

edited

Loading