[Feature] Add Network Policy and mTLS via Cert Manager Support to RayCluster#4566
[Feature] Add Network Policy and mTLS via Cert Manager Support to RayCluster#4566kryanbeane wants to merge 1 commit intoray-project:masterfrom
Conversation
c6da256 to
25f359c
Compare
25f359c to
4a63057
Compare
|
Hi @kryanbeane, |
|
@machichima yes I am working on them as we speak! |
andrewsykim
left a comment
There was a problem hiding this comment.
This PR does not reflect the API proposed in this doc https://docs.google.com/document/d/1z6safIXUgs7ZaUG-Vt9cFLQ_nryjq3ktAdXUoNRh4G4/edit?usp=sharing
Please update to reflect the changes there
seanlaii
left a comment
There was a problem hiding this comment.
Thanks for the great work on this PR!
| @@ -50,14 +50,6 @@ rules: | |||
| - "" | |||
| resources: | |||
| - secrets | |||
There was a problem hiding this comment.
The operator has get/list/delete on all secrets with no distinction for mTLS CA secret, which contains the CA private key.
Maybe we should document that namespaces with mTLS Rayclusters should restrict who has secret read access, or add labels to CA/head/worker secrets for future policy-based protection?
There was a problem hiding this comment.
Unfortunately I think this will be required because we want to support BYOC whcih can reference arbitrary secrets
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e536b72a4c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR adds optional per-RayCluster security controls to KubeRay by introducing NetworkPolicy-based network isolation and mTLS support (cert-manager auto-generation or BYOC), along with controllers, API/CRD updates, samples, and tests.
Changes:
- Extend RayCluster (and embedded RayClusterSpec usage in RayJob/RayService/RayCronJob) with
networkIsolation,enableMTLS, andmTLSOptions. - Add new controllers:
NetworkPolicyControllerandRayClusterMTLSController, plus pod-template injection for TLS env/volume. - Add/expand unit + integration/e2e tests and update RBAC, Helm CRDs, and docs/samples.
Reviewed changes
Copilot reviewed 43 out of 45 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ray-operator/test/e2e/raycluster_tls_test.go | New e2e coverage for mTLS auto-gen/BYOC and pod-level TLS wiring. |
| ray-operator/pkg/client/applyconfiguration/utils.go | Register applyconfig types for new API structs. |
| ray-operator/pkg/client/applyconfiguration/ray/v1/rayclusterspec.go | Applyconfig updates for new RayClusterSpec fields and builders. |
| ray-operator/pkg/client/applyconfiguration/ray/v1/networkisolationconfig.go | New generated applyconfig type for NetworkIsolationConfig. |
| ray-operator/pkg/client/applyconfiguration/ray/v1/mtlsoptions.go | New generated applyconfig type for MTLSOptions. |
| ray-operator/main.go | Wire cert-manager types into scheme and register new controllers. |
| ray-operator/go.mod | Add cert-manager dependency; bump indirect deps. |
| ray-operator/go.sum | Dependency lock updates for cert-manager + indirect bumps. |
| ray-operator/controllers/ray/utils/validation_test.go | Add validation tests for NetworkIsolation constraints. |
| ray-operator/controllers/ray/utils/validation.go | Validate NetworkIsolation + mTLS option consistency. |
| ray-operator/controllers/ray/utils/util.go | Add helpers for mTLS enablement and TLS secret naming. |
| ray-operator/controllers/ray/utils/constant.go | Add TLS env/volume constants + new event types. |
| ray-operator/controllers/ray/suite_test.go | Register networking API scheme + set up NetworkPolicy controller in envtest. |
| ray-operator/controllers/ray/raycluster_mtls_controller.go | New controller to reconcile cert-manager Issuers/Certificates + secret cleanup finalizer. |
| ray-operator/controllers/ray/raycluster_mtls_controller_test.go | Unit tests for mTLS controller behaviors (auto-gen, BYOC, deletion, SAN updates). |
| ray-operator/controllers/ray/raycluster_controller.go | Block pod creation on TLS secret readiness; detect and recreate pods lacking mTLS config. |
| ray-operator/controllers/ray/raycluster_controller_unit_test.go | Add cert-manager types to schemes used in unit tests. |
| ray-operator/controllers/ray/networkpolicy_controller.go | New controller to create/update/delete head/worker NetworkPolicies from RayCluster spec. |
| ray-operator/controllers/ray/networkpolicy_controller_unit_test.go | Unit tests for NetworkPolicy rule construction and naming. |
| ray-operator/controllers/ray/networkpolicy_controller_test.go | Integration tests verifying NetworkPolicy reconciliation in envtest. |
| ray-operator/controllers/ray/common/pod.go | Inject TLS env/volume into head/worker/init/autoscaler containers when mTLS enabled. |
| ray-operator/controllers/ray/common/pod_test.go | Unit tests for TLS injection behavior and autoscaler/init-container coverage. |
| ray-operator/config/samples/ray-cluster.network-isolation-monitoring.yaml | Sample for network isolation with monitoring scrape allowlist. |
| ray-operator/config/samples/ray-cluster.network-isolation-deny-ingress.yaml | Sample for deny-ingress isolation mode. |
| ray-operator/config/samples/ray-cluster.network-isolation-deny-all.yaml | Sample for deny-all isolation mode. |
| ray-operator/config/samples/ray-cluster.network-isolation-custom-rules.yaml | Sample for custom ingress rules. |
| ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml | Sample demonstrating auto-resolved dashboard/client ports. |
| ray-operator/config/samples/ray-cluster.network-isolation-complex-rules.yaml | Sample with multi-namespace/selector ingress rules. |
| ray-operator/config/samples/ray-cluster.mtls.yaml | Sample enabling cert-manager-backed mTLS. |
| ray-operator/config/samples/ray-cluster.mtls-byoc.yaml | Sample enabling BYOC mTLS using a user-provided secret. |
| ray-operator/config/rbac/role.yaml | Expand RBAC for cert-manager resources + networkpolicies and secret delete. |
| ray-operator/config/crd/bases/ray.io_rayclusters.yaml | CRD schema update for networkIsolation + mTLS fields. |
| ray-operator/config/crd/bases/ray.io_rayjobs.yaml | CRD schema update for embedded RayClusterSpec fields. |
| ray-operator/config/crd/bases/ray.io_rayservices.yaml | CRD schema update for embedded RayClusterSpec fields. |
| ray-operator/config/crd/bases/ray.io_raycronjobs.yaml | CRD schema update for embedded RayClusterSpec fields. |
| ray-operator/apis/ray/v1/raycluster_types.go | API types: NetworkIsolationConfig, MTLSOptions, and new RayClusterSpec fields. |
| ray-operator/apis/ray/v1/zz_generated.deepcopy.go | Generated deepcopy updates for new API fields/types. |
| helm-chart/kuberay-operator/templates/_helpers.tpl | Helm RBAC template updates for cert-manager + networkpolicy permissions. |
| helm-chart/kuberay-operator/crds/ray.io_rayclusters.yaml | Helm CRD bundle update for new fields. |
| helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml | Helm CRD bundle update for new fields. |
| helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml | Helm CRD bundle update for new fields. |
| helm-chart/kuberay-operator/crds/ray.io_raycronjobs.yaml | Helm CRD bundle update for new fields. |
| docs/reference/api.md | API reference docs for NetworkIsolationConfig + MTLSOptions + RayClusterSpec fields. |
| go.mod | Root module indirect dependency bumps (otel/x/sys). |
| go.sum | Root module dependency lock updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dcdb4eb to
5841cc8
Compare
5841cc8 to
47463cc
Compare
| // Check if cert-manager API is available before registering the mTLS controller. | ||
| // If cert-manager is not installed, the controller's Certificate/Issuer cache would | ||
| // never sync and the manager would fail to start (e.g. in E2E environments without cert-manager). | ||
| certManagerAvailable := certManagerAPIAvailable(restConfig) |
There was a problem hiding this comment.
The discovery API check is nice, but I'm wondering if accessing cert-manager APIs should still be gated by a flag in KubeRay that is also plumbed into the helm charts. i.e. this feature requires a opt-in when running helm install kuberay-operator instead of automatically working if cert-manager is available.
Is there any risks to auto-enabling cert-manager or any reason this shouldn't be on by default?
There was a problem hiding this comment.
Setting tlsOptions without cert manager installed would leave the cluster stuck indefinitely waiting for secrets. I've added a guard that rejects it with a clear warning event in 4ea99a5. BYOC mode (via certificateSecretName) still works without cert-manager since the user manages their own certs. Re the helm charts: k8s ignores RBAC rules for API groups that don't exist in the cluster AFAIK, so the cert-manager rules should be fine to ship unconditionally. Happy to add a feature gate if you'd still prefer an explicit opt-in instead?
There was a problem hiding this comment.
Yeah not worried about RBAC breaking and I don't think we need a feature gate, I was thinking of a config or flag to toggle it. I'm wondering if a cluster admin would want some sort of opt-in to allow KubeRay to start managing cert-manager Certificates. I'm thinking that as long as Ray clusters do not touch cert-manager certificates by default it would be fine, but I'm curious if others have thoughts on this.
| // alpha: v1.6 | ||
| // | ||
| // Enables NetworkPolicy-based network isolation and mTLS for RayClusters. | ||
| EnhancedSecurityPrimitives featuregate.Feature = "EnhancedSecurityPrimitives" |
There was a problem hiding this comment.
I think we should treat mTLS and network policy has two distinct features, and probably requires separate feature gates.
There was a problem hiding this comment.
However, given this is likely going to miss v1.6, we can re-evaluate for v1.7 if the feature gates are even required
There was a problem hiding this comment.
Thanks Andrew, I'd imagine just aiming for 1.7 and dropping the feature gate is probably the best approach for us going forward. WDYT?
|
I am testing with follow YAML locally. And I will keep getting warnings like so: "Failed to connect to GCS at address 10.244.0.41:6379" in head pod, and head pod cannot get ready. I printed out the events with 107s Normal MTLSPKIReady raycluster/rayjob-mtls-origin-4ffkg mTLS PKI is ready: CA, head, and worker certificates issued
76s Normal CreatedHeadPod raycluster/rayjob-mtls-origin-4ffkg Created head Pod default/rayjob-mtls-origin-4ffkg-head-rckkn
75s Normal Started pod/rayjob-mtls-origin-4ffkg-head-rckkn Started container ray-head
75s Normal Issuing certificate/ray-head-cert-rayjob-mtls-origin-4ffkg Fields on existing CertificateRequest resource not up to date: [spec.ipAddresses]
75s Normal Requested certificate/ray-head-cert-rayjob-mtls-origin-4ffkg Created new CertificateRequest resource "ray-head-cert-rayjob-mtls-origin-4ffkg-5"
74s Normal CertificateIssued certificaterequest/ray-head-cert-rayjob-mtls-origin-4ffkg-5 Certificate fetched from issuer successfully
73s Normal Issuing certificate/ray-head-cert-rayjob-mtls-origin-4ffkg The certificate has been successfully issued
35s Warning Unhealthy pod/rayjob-mtls-origin-4ffkg-head-rckkn Liveness probe failed:
5s Warning Unhealthy pod/rayjob-mtls-origin-4ffkg-head-rckkn Readiness probe failed:The events suggest the following order:
Based on this, I think the issue is that the initial head certificate may not yet include the final pod IPs. After the head pod is created, KubeRay updates the certificate's IP SANs and cert-manager re-issues it. Since Ray does not hot-reload TLS material, the head process may continue using the earlier certificate state and fail to connect to GCS over the pod IP. Test YAML I used apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-mtls-verification
spec:
entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 60
rayClusterSpec:
rayVersion: '2.52.0'
networkIsolation:
mode: denyAllIngress
tlsOptions: {}
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.52.0
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "500m"
memory: "1Gi"
volumeMounts:
- mountPath: /home/ray/samples
name: tls-test-code
volumes:
- name: tls-test-code
configMap:
name: rayjob-mtls-verification-code
items:
- key: worker_to_worker_tls_test.py
path: worker_to_worker_tls_test.py
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: small-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.52.0
resources:
limits:
cpu: "1"
memory: "1Gi"
requests:
cpu: "500m"
memory: "1Gi"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rayjob-mtls-verification-code
data:
worker_to_worker_tls_test.py: |
"""
Verify head-to-worker and worker-to-worker communication across distinct Ray pods.
Usage:
python worker_to_worker_tls_test.py
"""
import os
import socket
import ray
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy
ctx = ray.init()
print("Dashboard:", ctx.dashboard_url)
def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)
def alive_node_ids():
return [n["NodeID"] for n in ray.nodes() if n["Alive"]]
driver = {
"node_id": ray.get_runtime_context().get_node_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
}
print("Driver:", driver)
all_nodes = alive_node_ids()
worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]
if len(worker_nodes) < 2:
raise RuntimeError(
f"Need at least 2 worker nodes/pods besides the driver node. "
f"driver={driver['node_id']} all_nodes={all_nodes}"
)
worker_a, worker_b = worker_nodes[:2]
print(f"Using worker_a={worker_a}")
print(f"Using worker_b={worker_b}")
# --- Driver/Head <-> Worker: force task onto worker_a ---
@ray.remote
def where_am_i(tag: str):
return {
"tag": tag,
"node_id": ray.get_runtime_context().get_node_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
}
head_to_worker = ray.get(
where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
)
assert head_to_worker["node_id"] == worker_a
assert head_to_worker["node_id"] != driver["node_id"]
print("Head -> Worker -> Head:", head_to_worker)
print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)
# --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
@ray.remote
def double(x):
return {
"value": x * 2,
"node_id": ray.get_runtime_context().get_node_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
}
@ray.remote
def chain(x, target_node_id):
here = {
"node_id": ray.get_runtime_context().get_node_id(),
"hostname": socket.gethostname(),
"pid": os.getpid(),
}
out = ray.get(
double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
)
return {
"chain_node": here,
"double_node": out,
}
chain_result = ray.get(
chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
)
assert chain_result["chain_node"]["node_id"] == worker_a
assert chain_result["double_node"]["node_id"] == worker_b
assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]
print("Worker -> Worker chained:", chain_result)
print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
print(
"chain() and double() same node?",
chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
)
# --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
@ray.remote
class Ping:
def __init__(self, name):
self.name = name
self.node_id = ray.get_runtime_context().get_node_id()
self.hostname = socket.gethostname()
self.pid = os.getpid()
def info(self):
return {
"name": self.name,
"node_id": self.node_id,
"hostname": self.hostname,
"pid": self.pid,
}
def greet(self, msg, sender):
return {
"receiver": self.name,
"receiver_node": self.node_id,
"receiver_host": self.hostname,
"receiver_pid": self.pid,
"msg": msg,
"sender": sender,
}
def send(self, other, msg):
sender = self.info()
response = ray.get(other.greet.remote(msg, self.name))
return {
"sender": sender,
"response": response,
}
a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")
a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
assert a_info["node_id"] == worker_a
assert b_info["node_id"] == worker_b
assert a_info["node_id"] != b_info["node_id"]
print("Actor A:", a_info)
print("Actor B:", b_info)
print("A pinned to worker_a?", a_info["node_id"] == worker_a)
print("B pinned to worker_b?", b_info["node_id"] == worker_b)
print("A and B same node?", a_info["node_id"] == b_info["node_id"])
ab = ray.get(a.send.remote(b, "hello"))
ba = ray.get(b.send.remote(a, "world"))
assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]
print("A->B:", ab)
print("B->A:", ba)
print(
"A send() and B greet() same node?",
ab["sender"]["node_id"] == ab["response"]["receiver_node"],
)
print(
"B send() and A greet() same node?",
ba["sender"]["node_id"] == ba["response"]["receiver_node"],
)
print("ALL OK") |
Yeah I've run into the same and your assessment is accurate. I think when I had been running this, the liveness probe identified the failing pod and restarted it so the issue wasn't as apparent, it was intermittent. This becomes a bit of a chicken and egg situation I think. I'll see if I can find a solution and report back! |
| if rayNodeType == rayv1.HeadNode { | ||
| rayContainer := &podTemplate.Spec.Containers[utils.RayContainerIndex] | ||
| if !utils.EnvVarExists("MY_POD_IP", rayContainer.Env) { | ||
| rayContainer.Env = append(rayContainer.Env, corev1.EnvVar{ | ||
| Name: "MY_POD_IP", | ||
| ValueFrom: &corev1.EnvVarSource{ | ||
| FieldRef: &corev1.ObjectFieldSelector{FieldPath: "status.podIP"}, | ||
| }, | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
I'm wondering why we need to set MY_POD_IP here? Is this used anywhere?
|
Hi @chipspeak I’m thinking a possible fix would be to gate the head container startup with an init container:
This would follow the same general pattern as the existing worker pod kuberay/ray-operator/controllers/ray/common/pod.go Lines 466 to 491 in 05a1a7c cc @rueian @andrewsykim @Future-Outlier to check if this is a good approach |
Thanks @machichima ! We'd been trying to avoid init containers just from our perspective on performance hits. I've checked our midstream and we'd solved it there via setting our midstream PR here might give some indication of what this will look like. |
| // when mTLS is enabled: certs are created before pods exist, so the initial cert only | ||
| // contains DNS SANs and 127.0.0.1. Using the FQDN ensures GCS connections match a | ||
| // DNS SAN that is present from the start. | ||
| if _, ok := rayStartParams["node-ip-address"]; !ok && fqdnRayIP != "" { |
There was a problem hiding this comment.
If user set any the node-ip-address in rayStartParams, they may still get the issue. We will need to add this into documentation
There was a problem hiding this comment.
Absolutely. I think there'll be a lot of this to be documented in any case!
docs/reference/api.md
Outdated
| | `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs | | | | ||
| | `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT | | | | ||
| | `networkIsolation` _[NetworkIsolationConfig](#networkisolationconfig)_ | NetworkIsolation specifies optional configuration for network isolation.<br />When set, NetworkPolicies will be created to control traffic to/from Ray pods.<br />The reconciler always ensures intra-cluster and KubeRay operator communication is permitted. | | | | ||
| | `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS). | | | |
|
I'm not sure if this has been discussed previously why the controller create self-signed CA certificate per RayCluster instead of using a self-signed CA certificate per controller to sign the head's and worker's certificate? |
Purely a "belt and braces" approach. This way, if the CA was compromised, it would have a fairly small blast radius(just the RayCluster) rather than the larger scope of the controller-level CA. |
| headFQDN := "" | ||
| if utils.IsTLSEnabled(&instance.Spec) { | ||
| headFQDN = fqdnRayIP | ||
| } | ||
| podConf := common.DefaultHeadPodTemplate(ctx, instance, instance.Spec.HeadGroupSpec, podName, headPort, headFQDN) |
There was a problem hiding this comment.
I wonder if the user would like to apply a RayJob with BYOC. How to sign the certificate? The RayJob creates the corresponding RayCluster with hash suffix which is hard to predict. So the name of service is hard to predict, too.
I tried to overwrite node-ip-address with another predictable service but it didn't work. That service would be created later.
Or, did I miss something?
There was a problem hiding this comment.
I think this is already an existing limitation. This is essentially the same conceit in terms of this already not working for IP SANS, just for the DNS instead.
RE node-ip-address, that's consumed by Ray at startup time. If it's set to a service that doesn't exist, Ray can't resolve the DNS and it should fail. The service will get created by KubeRay after the fact but that's after the failure. Our solution gets around this by relying on head service which is created prior to the head pod's creation ensuring that by the time it comes up, the service exists and node-ip-address can be resolved.
There was a problem hiding this comment.
@laurafitzgerald Yeh, please reference this ray-cluster.mtls-byoc.yaml.
Take the above RayJob rayjob-mtls-verification as example. It will create RayCluster rayjob-mtls-verification-pc79r which suffix a truncated hash. As the result, it also create services as follow:
rayjob-mtls-verification-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 5s
rayjob-mtls-verification-pc79r-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 28s
In the current solution, it leverages --node-ip-address=rayjob-mtls-verification-pc79r-head-svc.default.svc.cluster.local to make it work.
In BYOC mode, it should put DNS in the SAN of the certificate. However, the hash suffix of the corresponding RayCluster is unknown before creating it.
Additionally, it didn't work to use rayjob-mtls-verification-head-svc as mentioned by
chipspeak it was created by RayCluster which is after the failure.
|
Hi @chipspeak @laurafitzgerald, Sorry if I missed any prior discussion. I was wondering if we could split this large PR into two: one for network isolation and the other for mTLS. Smaller PRs are preferable in OSS, as they are much easier to review compared to large ones. For this PR, having to consider both network isolation and mTLS at the same time will make the review process much longer and harder to discuss. Thank you! |
|
Want to check whether you have considered the following scenario:
Since the mTLS controller appears to reconcile periodically (30s / 1m), there may be a window where the new worker starts with a certificate whose SANs do not yet include its current Pod IP. Have you tested this case explicitly? If so, could you share how this race is avoided? |
| if instance.DeletionTimestamp != nil && !instance.DeletionTimestamp.IsZero() { | ||
| if controllerutil.ContainsFinalizer(instance, utils.MTLSCleanupFinalizer) { | ||
| // Only delete secrets in auto-generate mode; BYOC secrets belong to the user. | ||
| if !utils.IsTLSBYOC(&instance.Spec) { |
There was a problem hiding this comment.
If a cluster starts in auto-generate mode (secrets created, finalizer added), then the user switches to BYOC via kubectl apply, and later runs kubectl delete raycluster.
Would CA/head/worker secrets permanently orphaned in the namespace since IsTLSBYOC is true and deleteTLSSecrets() is skipped ?
| sort.Strings(desiredDNSNames) | ||
| desiredIPAddresses := normalizeIPs(podIPs) | ||
|
|
||
| // Known limitation: Ray reads TLS files once at startup and does not hot-reload them. |
There was a problem hiding this comment.
Should we document this behavior? The known limitation is acknowledged in code comments but does not appear in any docs (docs/reference/api.md). Otherwise users might assume certificates are renewed end-to-end automatically without pod restarts.
There was a problem hiding this comment.
Absolutely. I just have it in the comments for now. I wanted to wait until everything was locked in before updating documentation as I wanted to find out what the preference is from you folks in terms of docs.
|
Hi team. We see a request to split out the pr again. We will take a look at that. We will need to split and rename the feature gate as there is currently one controlling both features. I’ll make a stab at a name for those but they can be changed. On a seperate note, and apologies if this was discussed in threads, but i see a lot of discussion around BYOC. In our original proposal we did not consider BYOC as being in scope as part of this change as the idea is that a user can switch this on and have all the work done for them. So I have a few questions about BYOC. What use case might a user bring their own cert but then also want to rely on the automated mTLS config that’s being added? If there is a valid use case could we ask that a design for this is added to the existing or a new proposal doc and agreed before we start to make changes for that use case as part of this pr. Or is there potential to document it as out of scope for now to reduce the scope of the changes for this pr - aka cert-manager dependent mTLS configuration/no BYOC. (edited) |
| // and recreated with the same name, it gets a new CA secret rather than reusing a | ||
| // potentially stale one from a previous instance. | ||
| func GetCASecretName(clusterName string, clusterUID types.UID) string { | ||
| uidSuffix := string(clusterUID)[:8] |
There was a problem hiding this comment.
Potential panic slicing UID shorter than eight characters
Low Severity
GetCASecretName slices clusterUID with [:8] without a length guard. If the UID is empty or shorter than 8 characters (possible in unit tests or edge cases), this panics with an index-out-of-range error. Real Kubernetes UIDs are 36 characters, but the function has no defensive check.
|
|
3bcc39d to
dd43b59
Compare
Hi @laurafitzgerald, thanks for your reply. It seems the design doc has mentioned BYOC. Thanks for @AndySung320 to support.
After discussing with the team offline, would you mind to split the BYOC related part into another PR ? It could help us to focus auto-generation certificates in this PR. |
dd43b59 to
829d8a1
Compare
| - list | ||
| - patch | ||
| - update | ||
| - watch |
There was a problem hiding this comment.
Helm template has duplicate networking.k8s.io ingresses RBAC entry
Medium Severity
The helm template now has two separate RBAC rules granting permissions on ingresses: one under the extensions apiGroup (lines 260-271, which previously also included networking.k8s.io) and a new duplicate entry under networking.k8s.io (lines 291-302). The original combined extensions + networking.k8s.io entry was split, but networking.k8s.io was removed from the first entry without removing the now-redundant second copy. Meanwhile, the generated role.yaml correctly keeps them combined (lines 117-129). This divergence between the helm chart and kustomize RBAC definitions means the helm-deployed operator gets a redundant rule, and future RBAC changes may drift.
Additional Locations (1)
| // and recreated with the same name, it gets a new CA secret rather than reusing a | ||
| // potentially stale one from a previous instance. | ||
| func GetCASecretName(clusterName string, clusterUID types.UID) string { | ||
| uidSuffix := string(clusterUID)[:8] |
There was a problem hiding this comment.
GetCASecretName panics if UID shorter than 8 chars
Low Severity
GetCASecretName slices clusterUID with [:8] without a length guard. If the UID string is ever shorter than 8 characters (e.g., in tests with synthetic UIDs or unexpected states), this causes an out-of-bounds panic. While Kubernetes UIDs are typically 36 characters, a defensive bounds check would prevent a runtime crash.
| - name: AsyncJobInfoQuery | ||
| enabled: false | ||
| - name: RayClusterMTLS | ||
| enabled: false |
There was a problem hiding this comment.
Helm values.yaml feature gate list indentation is wrong
Medium Severity
The new feature gate entries AsyncJobInfoQuery and RayClusterMTLS appear to be at the wrong indentation level compared to the preceding list items (lines 135-144). The existing entries use - name: with consistent indentation under a parent key, but the new entries lack the leading indentation, which would place them outside the feature gates list and cause incorrect helm rendering.
|
I've dropped the recent, BYOC-specific material from this branch so what's left should be the auto-generating mTLS work. |
|
Thank you! @chipspeak |
| | `headServiceAnnotations` _object (keys:string, values:string)_ | | | | | ||
| | `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs | | | | ||
| | `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT | | | | ||
| | `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS).<br />Requires the RayClusterMTLS feature gate on the operator. | | | |
There was a problem hiding this comment.
There is a redundant line break.
| | Field | Description | Default | Validation | | ||
| | --- | --- | --- | --- | | ||
| | `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> | | ||
| | `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. | | | |
There was a problem hiding this comment.
There are some redundant line breaks.
| | --- | --- | --- | --- | | ||
| | `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> | | ||
| | `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. | | | | ||
| | `workerCertificateSecretName` _string_ | WorkerCertificateSecretName is an optional user-provided Kubernetes Secret<br />containing tls.crt, tls.key, and ca.crt for worker nodes.<br />When set, workers use this secret instead of CertificateSecretName, giving<br />head and worker pods separate TLS identities. This prevents a compromised<br />worker key from impersonating the head node at the TLS layer.<br />The certificate SANs must cover worker node identities<br />(worker pod IPs or wildcards). Both secrets must share the same CA. | | | |
There was a problem hiding this comment.
There are some redundant line breaks.
| apiVersion: ray.io/v1 | ||
| kind: RayCluster | ||
| metadata: | ||
| name: raycluster-byoc |
There was a problem hiding this comment.
Should this file be moved to another PR?




Why are these changes needed?
The Ray docs describes the assumption that users should ensure they run their Ray workloads in an environment which have a secure trusted network configuration set up. This PR adds network policy and mTLS support via a RayCluster API change alongside new reconcilers, tests, and docs updates. This addition gives users the ability to configure these security features from within each RayCluster.
Related issue number
Checks