Skip to content

CFP-44774: Cilium Control Plane at Scale#88

Open
vipul-21 wants to merge 1 commit intocilium:mainfrom
vipul-21:singvipul/ccp-CFP
Open

CFP-44774: Cilium Control Plane at Scale#88
vipul-21 wants to merge 1 commit intocilium:mainfrom
vipul-21:singvipul/ccp-CFP

Conversation

@vipul-21
Copy link

This CFP proposes reusing the existing ClusterMesh etcd as a non-persistent, cache-backed data distribution layer so agents read Cilium CRDs from etcd instead of the API server, reducing API server load at scale.

Issue: cilium/cilium#44774

Signed-off-by: Vipul Singh <singhvipul@microsoft.com>
Copy link
Member

@MrFreezeex MrFreezeex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for the proposal! I skimmed through this CFP and left a few comments. This is more a quick pass than a full review.

Overall I think this proposal is mostly focused on the why and sounds like it's missing some key details on how you would implement this. Like would you use the pkg/clustermesh code and try to treat the local cluster like almost any other remote cluster? Something else? There's also the Service aspects which you didn't say explicitly that you would use the data from Kubernetes or clustermesh-apiserver in this mode. EDIT: nevermind this is discussed

the same cluster whose agents depend on it creates a bootstrap problem:
agents need ClusterMesh etcd for control plane state, but the ClusterMesh
API Server itself may need functioning agents for pod networking or Service
IP routing. To address this, we propose two options:
Copy link
Member

@MrFreezeex MrFreezeex Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a clear advantage on why having different modes I would suggest only supporting ones. There are already many different options and combination within clustermesh/kvstore/kvstoremesh and adding more options and ways sounds difficult to maintain and keep in mind in the long run. So we should have a very clear reason to support multiple option here if we want to go forward with that IMO.

Comment on lines +312 to +315
* **Multi-cluster integration**: Ensure the single-cluster KVStore mode
composes cleanly with existing multi-cluster ClusterMesh deployments,
allowing clusters that use this optimization to also participate in
cross-cluster service discovery and identity sharing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to think that this should be fairly straightforward, any reasons to have this in future milestones?

Comment on lines +307 to +311
* **Centralized control plane operations**: Move compute-intensive control
plane activities such as network policy calculation and ipcache computation
into the ClusterMesh API Server, reducing per-agent CPU overhead and
enabling consistent, cluster-wide policy evaluation. (Currently a non-goal
for this CFP, but can be a natural evolution of the architecture.)
Copy link
Member

@MrFreezeex MrFreezeex Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that desirable? Clustermesh-apiserver is treated as a cache here so we would need in each clustermesh-apiserver to compute the same thing independently (and that those are consistent enough) or to also use the kube-apiserver to perform that which would be similar to the operator 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.

Comment on lines +301 to +306
* **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead
where every CRD mutation traverses both the Kubernetes API server and
ClusterMesh etcd. Agents and the operator would write Cilium state directly
to the ClusterMesh etcd, removing the extra API server round-trip and
reducing end-to-end propagation latency. (Currently a non-goal for this
CFP, but can be an evolution of this work.)
Copy link
Member

@MrFreezeex MrFreezeex Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were discussing in slack with Marco, Tamilmani and I that this was just looking like the kvstore mode and that due to the ephemeral nature of ClusterMesh etcd would not really work unless you make it persistent which would make it almost exactly like the kvstore mode. So I don't think it's a desirable milestone and I would suggest to remove or possibly change it to, as Marco was saying on slack, improving the existing kvstore mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.

Comment on lines +224 to +227
| **Namespaces** | 1,000 |
| **Pods** | 40,000 |
| **Deployments** | 4,000 |
| **Pod Deployment Rate** | 100 pods/sec |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that deploy some Services as well? If not it sounds interesting to have as part of this benchmark to better match the reality and reflect what part of the load we are offloading to the clustermesh-apiserver

Comment on lines +79 to +81
as they do today. Additionally, Kubernetes-native resources such as Services,
Endpoints, Pods, and Nodes are not covered by this proposal, agents continue
to watch these directly from the API server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some details on the reasoning behind this part?


![Embedded etcd CPU and memory usage over time in ClusterMesh mode](./images/CFP-44774-cilium-clustermesh-etcd-cpu-memory.png)

Clustermesh API server containers(i.e apiserver and etcd) CPU and memory usage remained stable during the test, with no significant spikes during workload churn.
Copy link
Member

@MrFreezeex MrFreezeex Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have some quantile number or graph of memory/CPU usage. For instance, it is said here that the clustermesh-apiserver etcd was stable in term of CPU, does that mean it's always running 3 CPU (the max on the chart provided)? A nice way (if actually possible) to show it could be a graph of CPU of the two mode for both the k8s apiserver and clustermesh-apiserver on the same graph and likewise for memory.

@vipul-21
Copy link
Author

vipul-21 commented Mar 16, 2026

@MrFreezeex Thanks for the review. Yes, the implementation reuses pkg/clustermesh. The local cluster data is synced to the clustermesh API server’s etcd (similar to how remote cluster data is handled). Each agent then watches the same etcd to retrieve the required information.

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal!

A few initial high-level comments from my side as well.

@@ -0,0 +1,315 @@
# CFP-44774: Cilium Control Plane at Scale

**SIG: SIG Scalability, SIG-ClusterMesh**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense IMO to involve the scalability and clustermesh SIGs for the initial design discussion, but I don't think it is sustainable to extend the scope of these SIGs to also oversee the new logic possibly resulting from this CFP. Both because outside of their main scope (it does not pertain multi-cluster, and SIG scalability should oversee scalability-related topics, but not end up owning any improvement potentially targeting improved scalability), and as both teams are seriously understaffed at the moment.

Comment on lines +40 to +42
separately provisioned and managed etcd cluster, and it only partially reduces
API server pressure because endpoint slices and node objects continue to be
served through CRD watches. A solution that offloads all agent CRD watches
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow the second part of the sentence (assuming you refer to CiliumEndpointSlice and CiliumNodes). When Cilium runs in KVStore mode, identity, endpoint and node entries are all watched from etcd, rather than the Kubernetes API Server. There are certain limitations and feature incompatibilities given that the overall trend in recent years has been to focus more on CRD mode, but these are orthogonal aspects that can be resolved separately.

Comment on lines +51 to +52
* Utilize ClusterMesh etcd as an alternative datastore, avoiding the need to
provision and maintain a separate external KVStore.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, a key pre-requisite for adding the support for a new datastore is to have a unified abstraction (likely based on statedb) that allows to decouple the downstream consumers from the actual ingestion logic, which I don't see discussed in this CFP.

Currently, the code paths for ingesting data in CRD and KVStore mode (as well as from clustermesh) are diverse and fairly independent, mostly for historical reasons. However, that proved being a constant pain point, leading to feature incompatibilities, caveats only applying to one method or the other (e.g., around connection disruption on agent restart) and overall complexity in general.

I don't think it is viable to introduce yet another datastore without carefully rethinking the supporting architecture, as it would lead to even more feature incompatibilities and complexity.

Comment on lines +65 to +66
To reduce Kubernetes API server load, we propose utilizing the existing
ClusterMesh etcd as a non-persistent, cache-backed datastore for agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a deployment perspective, I think it would be appropriate to keep the proposed datastore separate from the existing clustermesh-apiserver for better isolation, both in terms of permissions, and to prevent possible interference in case of e.g., certain replicas end up being overloaded due to either in-cluster or cross-cluster data synchronization. Potential failures do also have very different consequences and implications in the two cases. The underlying binary may be reused if appropriate though.

Comment on lines +101 to +109
1. **Pod networking mode (recommended)**: The ClusterMesh API Server runs as
a regular pod (no host networking) and is exposed via a Kubernetes
Service. Agents attempt to connect to ClusterMesh etcd at startup; if the
connection fails or is not yet available, agents fall back to reading
CRDs directly from the Kubernetes API server. Once the ClusterMesh API
Server becomes reachable, agents switch over to reading from ClusterMesh
etcd. This mode requires no special scheduling or networking
configuration and is operationally identical to a standard Cilium
deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Slack, Cilium historically included a handover mechanism similar to the one proposed here, to support KVStore mode with etcd running in pod network. This turned out being a non-ending source of bugs and inconsistencies, because the two sources are not perfectly aligned at any point in time. Additionally, it also leads to unnecessary overhead on the Kubernetes API Server, if the informers are started only to be immediately terminated afterwards once successfully connected to etcd. That support got ripped out a few versions ago, and IMO we really need to stick to a single source of truth, either K8s or etcd, and not do hand-overs in case of failures or alike.

Comment on lines +115 to +119
directly to those IPs. This avoids depending on Service IP datapath
programming (which Cilium would need to provide) but
introduces host port conflicts and requires agents to maintain a
lightweight watch on the ClusterMesh API Server pods for IP discovery
and failover when pods reschedule to different nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already bypass datapath-level service load balancing when connecting to KVStoreMesh, performing service DNS name resolution and backend selection via a custom dialer, to break possible chicken-and-egg dependencies. The same logic may be potentially adopted here, without having to watch for the pods themselves.

| **Namespaces** | 1,000 |
| **Pods** | 40,000 |
| **Deployments** | 4,000 |
| **Pod Deployment Rate** | 100 pods/sec |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which settings are you using for the clustermesh-apiserver? I assume not the default ones, as the etcd client is configured with 20 QPS, which could not sustain a pod deployment rate of 100/sec.

| **API Server Request/Response Latency** | `apiserver_request_duration_seconds_bucket` |
| **ClusterMesh API Server CPU** | `container_cpu_usage_seconds_total` |
| **ClusterMesh API Server Memory** | `container_memory_usage_bytes` |
| **Pod Startup Latency** | `kubelet_pod_start_sli_duration_seconds_bucket` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you measure pod readiness? Time before the pod is marked as ready, or time before another pod is able to successfully connect to the new pod, subject to network policies (i.e., all information propagated to the other agents)? The former may be misleading, if the hosting agent does not require any information to be able to start the pod.

Comment on lines +301 to +306
* **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead
where every CRD mutation traverses both the Kubernetes API server and
ClusterMesh etcd. Agents and the operator would write Cilium state directly
to the ClusterMesh etcd, removing the extra API server round-trip and
reducing end-to-end propagation latency. (Currently a non-goal for this
CFP, but can be an evolution of this work.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.

Comment on lines +307 to +311
* **Centralized control plane operations**: Move compute-intensive control
plane activities such as network policy calculation and ipcache computation
into the ClusterMesh API Server, reducing per-agent CPU overhead and
enabling consistent, cluster-wide policy evaluation. (Currently a non-goal
for this CFP, but can be a natural evolution of the architecture.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants