Conversation
Signed-off-by: Vipul Singh <singhvipul@microsoft.com>
There was a problem hiding this comment.
Hi thanks for the proposal! I skimmed through this CFP and left a few comments. This is more a quick pass than a full review.
Overall I think this proposal is mostly focused on the why and sounds like it's missing some key details on how you would implement this. Like would you use the pkg/clustermesh code and try to treat the local cluster like almost any other remote cluster? Something else? There's also the Service aspects which you didn't say explicitly that you would use the data from Kubernetes or clustermesh-apiserver in this mode. EDIT: nevermind this is discussed
| the same cluster whose agents depend on it creates a bootstrap problem: | ||
| agents need ClusterMesh etcd for control plane state, but the ClusterMesh | ||
| API Server itself may need functioning agents for pod networking or Service | ||
| IP routing. To address this, we propose two options: |
There was a problem hiding this comment.
Unless there is a clear advantage on why having different modes I would suggest only supporting ones. There are already many different options and combination within clustermesh/kvstore/kvstoremesh and adding more options and ways sounds difficult to maintain and keep in mind in the long run. So we should have a very clear reason to support multiple option here if we want to go forward with that IMO.
| * **Multi-cluster integration**: Ensure the single-cluster KVStore mode | ||
| composes cleanly with existing multi-cluster ClusterMesh deployments, | ||
| allowing clusters that use this optimization to also participate in | ||
| cross-cluster service discovery and identity sharing. |
There was a problem hiding this comment.
I would tend to think that this should be fairly straightforward, any reasons to have this in future milestones?
| * **Centralized control plane operations**: Move compute-intensive control | ||
| plane activities such as network policy calculation and ipcache computation | ||
| into the ClusterMesh API Server, reducing per-agent CPU overhead and | ||
| enabling consistent, cluster-wide policy evaluation. (Currently a non-goal | ||
| for this CFP, but can be a natural evolution of the architecture.) |
There was a problem hiding this comment.
Is that desirable? Clustermesh-apiserver is treated as a cache here so we would need in each clustermesh-apiserver to compute the same thing independently (and that those are consistent enough) or to also use the kube-apiserver to perform that which would be similar to the operator 🤔
There was a problem hiding this comment.
Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.
| * **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead | ||
| where every CRD mutation traverses both the Kubernetes API server and | ||
| ClusterMesh etcd. Agents and the operator would write Cilium state directly | ||
| to the ClusterMesh etcd, removing the extra API server round-trip and | ||
| reducing end-to-end propagation latency. (Currently a non-goal for this | ||
| CFP, but can be an evolution of this work.) |
There was a problem hiding this comment.
We were discussing in slack with Marco, Tamilmani and I that this was just looking like the kvstore mode and that due to the ephemeral nature of ClusterMesh etcd would not really work unless you make it persistent which would make it almost exactly like the kvstore mode. So I don't think it's a desirable milestone and I would suggest to remove or possibly change it to, as Marco was saying on slack, improving the existing kvstore mode.
There was a problem hiding this comment.
Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.
| | **Namespaces** | 1,000 | | ||
| | **Pods** | 40,000 | | ||
| | **Deployments** | 4,000 | | ||
| | **Pod Deployment Rate** | 100 pods/sec | |
There was a problem hiding this comment.
Does that deploy some Services as well? If not it sounds interesting to have as part of this benchmark to better match the reality and reflect what part of the load we are offloading to the clustermesh-apiserver
| as they do today. Additionally, Kubernetes-native resources such as Services, | ||
| Endpoints, Pods, and Nodes are not covered by this proposal, agents continue | ||
| to watch these directly from the API server. |
There was a problem hiding this comment.
Could you add some details on the reasoning behind this part?
|
|
||
|  | ||
|
|
||
| Clustermesh API server containers(i.e apiserver and etcd) CPU and memory usage remained stable during the test, with no significant spikes during workload churn. |
There was a problem hiding this comment.
It might be nice to have some quantile number or graph of memory/CPU usage. For instance, it is said here that the clustermesh-apiserver etcd was stable in term of CPU, does that mean it's always running 3 CPU (the max on the chart provided)? A nice way (if actually possible) to show it could be a graph of CPU of the two mode for both the k8s apiserver and clustermesh-apiserver on the same graph and likewise for memory.
|
@MrFreezeex Thanks for the review. Yes, the implementation reuses |
giorio94
left a comment
There was a problem hiding this comment.
Thanks for the proposal!
A few initial high-level comments from my side as well.
| @@ -0,0 +1,315 @@ | |||
| # CFP-44774: Cilium Control Plane at Scale | |||
|
|
|||
| **SIG: SIG Scalability, SIG-ClusterMesh** | |||
There was a problem hiding this comment.
It makes sense IMO to involve the scalability and clustermesh SIGs for the initial design discussion, but I don't think it is sustainable to extend the scope of these SIGs to also oversee the new logic possibly resulting from this CFP. Both because outside of their main scope (it does not pertain multi-cluster, and SIG scalability should oversee scalability-related topics, but not end up owning any improvement potentially targeting improved scalability), and as both teams are seriously understaffed at the moment.
| separately provisioned and managed etcd cluster, and it only partially reduces | ||
| API server pressure because endpoint slices and node objects continue to be | ||
| served through CRD watches. A solution that offloads all agent CRD watches |
There was a problem hiding this comment.
I don't follow the second part of the sentence (assuming you refer to CiliumEndpointSlice and CiliumNodes). When Cilium runs in KVStore mode, identity, endpoint and node entries are all watched from etcd, rather than the Kubernetes API Server. There are certain limitations and feature incompatibilities given that the overall trend in recent years has been to focus more on CRD mode, but these are orthogonal aspects that can be resolved separately.
| * Utilize ClusterMesh etcd as an alternative datastore, avoiding the need to | ||
| provision and maintain a separate external KVStore. |
There was a problem hiding this comment.
To me, a key pre-requisite for adding the support for a new datastore is to have a unified abstraction (likely based on statedb) that allows to decouple the downstream consumers from the actual ingestion logic, which I don't see discussed in this CFP.
Currently, the code paths for ingesting data in CRD and KVStore mode (as well as from clustermesh) are diverse and fairly independent, mostly for historical reasons. However, that proved being a constant pain point, leading to feature incompatibilities, caveats only applying to one method or the other (e.g., around connection disruption on agent restart) and overall complexity in general.
I don't think it is viable to introduce yet another datastore without carefully rethinking the supporting architecture, as it would lead to even more feature incompatibilities and complexity.
| To reduce Kubernetes API server load, we propose utilizing the existing | ||
| ClusterMesh etcd as a non-persistent, cache-backed datastore for agent |
There was a problem hiding this comment.
From a deployment perspective, I think it would be appropriate to keep the proposed datastore separate from the existing clustermesh-apiserver for better isolation, both in terms of permissions, and to prevent possible interference in case of e.g., certain replicas end up being overloaded due to either in-cluster or cross-cluster data synchronization. Potential failures do also have very different consequences and implications in the two cases. The underlying binary may be reused if appropriate though.
| 1. **Pod networking mode (recommended)**: The ClusterMesh API Server runs as | ||
| a regular pod (no host networking) and is exposed via a Kubernetes | ||
| Service. Agents attempt to connect to ClusterMesh etcd at startup; if the | ||
| connection fails or is not yet available, agents fall back to reading | ||
| CRDs directly from the Kubernetes API server. Once the ClusterMesh API | ||
| Server becomes reachable, agents switch over to reading from ClusterMesh | ||
| etcd. This mode requires no special scheduling or networking | ||
| configuration and is operationally identical to a standard Cilium | ||
| deployment. |
There was a problem hiding this comment.
As mentioned in Slack, Cilium historically included a handover mechanism similar to the one proposed here, to support KVStore mode with etcd running in pod network. This turned out being a non-ending source of bugs and inconsistencies, because the two sources are not perfectly aligned at any point in time. Additionally, it also leads to unnecessary overhead on the Kubernetes API Server, if the informers are started only to be immediately terminated afterwards once successfully connected to etcd. That support got ripped out a few versions ago, and IMO we really need to stick to a single source of truth, either K8s or etcd, and not do hand-overs in case of failures or alike.
| directly to those IPs. This avoids depending on Service IP datapath | ||
| programming (which Cilium would need to provide) but | ||
| introduces host port conflicts and requires agents to maintain a | ||
| lightweight watch on the ClusterMesh API Server pods for IP discovery | ||
| and failover when pods reschedule to different nodes. |
There was a problem hiding this comment.
We already bypass datapath-level service load balancing when connecting to KVStoreMesh, performing service DNS name resolution and backend selection via a custom dialer, to break possible chicken-and-egg dependencies. The same logic may be potentially adopted here, without having to watch for the pods themselves.
| | **Namespaces** | 1,000 | | ||
| | **Pods** | 40,000 | | ||
| | **Deployments** | 4,000 | | ||
| | **Pod Deployment Rate** | 100 pods/sec | |
There was a problem hiding this comment.
Which settings are you using for the clustermesh-apiserver? I assume not the default ones, as the etcd client is configured with 20 QPS, which could not sustain a pod deployment rate of 100/sec.
| | **API Server Request/Response Latency** | `apiserver_request_duration_seconds_bucket` | | ||
| | **ClusterMesh API Server CPU** | `container_cpu_usage_seconds_total` | | ||
| | **ClusterMesh API Server Memory** | `container_memory_usage_bytes` | | ||
| | **Pod Startup Latency** | `kubelet_pod_start_sli_duration_seconds_bucket` | |
There was a problem hiding this comment.
How did you measure pod readiness? Time before the pod is marked as ready, or time before another pod is able to successfully connect to the new pod, subject to network policies (i.e., all information propagated to the other agents)? The former may be misleading, if the hosting agent does not require any information to be able to start the pod.
| * **Direct writes to ClusterMesh etcd**: Eliminate the duplication overhead | ||
| where every CRD mutation traverses both the Kubernetes API server and | ||
| ClusterMesh etcd. Agents and the operator would write Cilium state directly | ||
| to the ClusterMesh etcd, removing the extra API server round-trip and | ||
| reducing end-to-end propagation latency. (Currently a non-goal for this | ||
| CFP, but can be an evolution of this work.) |
There was a problem hiding this comment.
Agree with Arthur that this would be an improvement of the existing KVStore mode, and not compatible with the approach of using ephemeral etcd instances.
| * **Centralized control plane operations**: Move compute-intensive control | ||
| plane activities such as network policy calculation and ipcache computation | ||
| into the ClusterMesh API Server, reducing per-agent CPU overhead and | ||
| enabling consistent, cluster-wide policy evaluation. (Currently a non-goal | ||
| for this CFP, but can be a natural evolution of the architecture.) |
There was a problem hiding this comment.
Agree with Arthur that this looks out of scope here, and likely problematic with the cache-based approach.
This CFP proposes reusing the existing ClusterMesh etcd as a non-persistent, cache-backed data distribution layer so agents read Cilium CRDs from etcd instead of the API server, reducing API server load at scale.
Issue: cilium/cilium#44774