[Feature] Add Network Policy and mTLS via Cert Manager Support to RayCluster by kryanbeane · Pull Request #4566 · ray-project/kuberay

kryanbeane · 2026-03-05T09:32:23Z

Why are these changes needed?

The Ray docs describes the assumption that users should ensure they run their Ray workloads in an environment which have a secure trusted network configuration set up. This PR adds network policy and mTLS support via a RayCluster API change alongside new reconcilers, tests, and docs updates. This addition gives users the ability to configure these security features from within each RayCluster.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ray-operator/apis/ray/v1/zz_generated.deepcopy.go

ray-operator/config/default/kustomization.yaml

ray-operator/controllers/ray/networkpolicy_controller.go

machichima · 2026-03-05T13:33:17Z

Hi @kryanbeane,
Could you fix the CI errors?
Thank you!

kryanbeane · 2026-03-05T13:54:06Z

@machichima yes I am working on them as we speak!

ray-operator/config/rbac/role.yaml

ray-operator/controllers/ray/networkpolicy_controller.go

andrewsykim

This PR does not reflect the API proposed in this doc https://docs.google.com/document/d/1z6safIXUgs7ZaUG-Vt9cFLQ_nryjq3ktAdXUoNRh4G4/edit?usp=sharing

Please update to reflect the changes there

ray-operator/apis/ray/v1/raycluster_types.go

seanlaii

Thanks for the great work on this PR!

ray-operator/main.go

ray-operator/apis/ray/v1/raycluster_types.go

AndySung320 · 2026-03-06T01:21:11Z

ray-operator/config/rbac/role.yaml

@@ -50,14 +50,6 @@ rules:
  - ""
  resources:
  - secrets


The operator has get/list/delete on all secrets with no distinction for mTLS CA secret, which contains the CA private key.
Maybe we should document that namespaces with mTLS Rayclusters should restrict who has secret read access, or add labels to CA/head/worker secrets for future policy-based protection?

Unfortunately I think this will be required because we want to support BYOC whcih can reference arbitrary secrets

Future-Outlier · 2026-03-06T02:38:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e536b72a4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ray-operator/controllers/ray/raycluster_mtls_controller.go

ray-operator/controllers/ray/networkpolicy_controller.go

Copilot

Pull request overview

This PR adds optional per-RayCluster security controls to KubeRay by introducing NetworkPolicy-based network isolation and mTLS support (cert-manager auto-generation or BYOC), along with controllers, API/CRD updates, samples, and tests.

Changes:

Extend RayCluster (and embedded RayClusterSpec usage in RayJob/RayService/RayCronJob) with networkIsolation, enableMTLS, and mTLSOptions.
Add new controllers: NetworkPolicyController and RayClusterMTLSController, plus pod-template injection for TLS env/volume.
Add/expand unit + integration/e2e tests and update RBAC, Helm CRDs, and docs/samples.

Reviewed changes

Copilot reviewed 43 out of 45 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ray-operator/test/e2e/raycluster_tls_test.go	New e2e coverage for mTLS auto-gen/BYOC and pod-level TLS wiring.
ray-operator/pkg/client/applyconfiguration/utils.go	Register applyconfig types for new API structs.
ray-operator/pkg/client/applyconfiguration/ray/v1/rayclusterspec.go	Applyconfig updates for new RayClusterSpec fields and builders.
ray-operator/pkg/client/applyconfiguration/ray/v1/networkisolationconfig.go	New generated applyconfig type for NetworkIsolationConfig.
ray-operator/pkg/client/applyconfiguration/ray/v1/mtlsoptions.go	New generated applyconfig type for MTLSOptions.
ray-operator/main.go	Wire cert-manager types into scheme and register new controllers.
ray-operator/go.mod	Add cert-manager dependency; bump indirect deps.
ray-operator/go.sum	Dependency lock updates for cert-manager + indirect bumps.
ray-operator/controllers/ray/utils/validation_test.go	Add validation tests for NetworkIsolation constraints.
ray-operator/controllers/ray/utils/validation.go	Validate NetworkIsolation + mTLS option consistency.
ray-operator/controllers/ray/utils/util.go	Add helpers for mTLS enablement and TLS secret naming.
ray-operator/controllers/ray/utils/constant.go	Add TLS env/volume constants + new event types.
ray-operator/controllers/ray/suite_test.go	Register networking API scheme + set up NetworkPolicy controller in envtest.
ray-operator/controllers/ray/raycluster_mtls_controller.go	New controller to reconcile cert-manager Issuers/Certificates + secret cleanup finalizer.
ray-operator/controllers/ray/raycluster_mtls_controller_test.go	Unit tests for mTLS controller behaviors (auto-gen, BYOC, deletion, SAN updates).
ray-operator/controllers/ray/raycluster_controller.go	Block pod creation on TLS secret readiness; detect and recreate pods lacking mTLS config.
ray-operator/controllers/ray/raycluster_controller_unit_test.go	Add cert-manager types to schemes used in unit tests.
ray-operator/controllers/ray/networkpolicy_controller.go	New controller to create/update/delete head/worker NetworkPolicies from RayCluster spec.
ray-operator/controllers/ray/networkpolicy_controller_unit_test.go	Unit tests for NetworkPolicy rule construction and naming.
ray-operator/controllers/ray/networkpolicy_controller_test.go	Integration tests verifying NetworkPolicy reconciliation in envtest.
ray-operator/controllers/ray/common/pod.go	Inject TLS env/volume into head/worker/init/autoscaler containers when mTLS enabled.
ray-operator/controllers/ray/common/pod_test.go	Unit tests for TLS injection behavior and autoscaler/init-container coverage.
ray-operator/config/samples/ray-cluster.network-isolation-monitoring.yaml	Sample for network isolation with monitoring scrape allowlist.
ray-operator/config/samples/ray-cluster.network-isolation-deny-ingress.yaml	Sample for deny-ingress isolation mode.
ray-operator/config/samples/ray-cluster.network-isolation-deny-all.yaml	Sample for deny-all isolation mode.
ray-operator/config/samples/ray-cluster.network-isolation-custom-rules.yaml	Sample for custom ingress rules.
ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml	Sample demonstrating auto-resolved dashboard/client ports.
ray-operator/config/samples/ray-cluster.network-isolation-complex-rules.yaml	Sample with multi-namespace/selector ingress rules.
ray-operator/config/samples/ray-cluster.mtls.yaml	Sample enabling cert-manager-backed mTLS.
ray-operator/config/samples/ray-cluster.mtls-byoc.yaml	Sample enabling BYOC mTLS using a user-provided secret.
ray-operator/config/rbac/role.yaml	Expand RBAC for cert-manager resources + networkpolicies and secret delete.
ray-operator/config/crd/bases/ray.io_rayclusters.yaml	CRD schema update for networkIsolation + mTLS fields.
ray-operator/config/crd/bases/ray.io_rayjobs.yaml	CRD schema update for embedded RayClusterSpec fields.
ray-operator/config/crd/bases/ray.io_rayservices.yaml	CRD schema update for embedded RayClusterSpec fields.
ray-operator/config/crd/bases/ray.io_raycronjobs.yaml	CRD schema update for embedded RayClusterSpec fields.
ray-operator/apis/ray/v1/raycluster_types.go	API types: NetworkIsolationConfig, MTLSOptions, and new RayClusterSpec fields.
ray-operator/apis/ray/v1/zz_generated.deepcopy.go	Generated deepcopy updates for new API fields/types.
helm-chart/kuberay-operator/templates/_helpers.tpl	Helm RBAC template updates for cert-manager + networkpolicy permissions.
helm-chart/kuberay-operator/crds/ray.io_rayclusters.yaml	Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml	Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml	Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_raycronjobs.yaml	Helm CRD bundle update for new fields.
docs/reference/api.md	API reference docs for NetworkIsolationConfig + MTLSOptions + RayClusterSpec fields.
go.mod	Root module indirect dependency bumps (otel/x/sys).
go.sum	Root module dependency lock updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ray-operator/test/e2e/raycluster_tls_test.go

ray-operator/controllers/ray/utils/validation.go

ray-operator/main.go

ray-operator/controllers/ray/networkpolicy_controller.go

ray-operator/apis/ray/v1/raycluster_types.go

helm-chart/kuberay-operator/templates/_helpers.tpl

ray-operator/controllers/ray/networkpolicy_controller.go

ray-operator/controllers/ray/common/pod.go

ray-operator/controllers/ray/utils/constant.go

andrewsykim · 2026-03-06T19:54:46Z

ray-operator/main.go

+	// Check if cert-manager API is available before registering the mTLS controller.
+	// If cert-manager is not installed, the controller's Certificate/Issuer cache would
+	// never sync and the manager would fail to start (e.g. in E2E environments without cert-manager).
+	certManagerAvailable := certManagerAPIAvailable(restConfig)


The discovery API check is nice, but I'm wondering if accessing cert-manager APIs should still be gated by a flag in KubeRay that is also plumbed into the helm charts. i.e. this feature requires a opt-in when running helm install kuberay-operator instead of automatically working if cert-manager is available.

Is there any risks to auto-enabling cert-manager or any reason this shouldn't be on by default?

Setting tlsOptions without cert manager installed would leave the cluster stuck indefinitely waiting for secrets. I've added a guard that rejects it with a clear warning event in 4ea99a5. BYOC mode (via certificateSecretName) still works without cert-manager since the user manages their own certs. Re the helm charts: k8s ignores RBAC rules for API groups that don't exist in the cluster AFAIK, so the cert-manager rules should be fine to ship unconditionally. Happy to add a feature gate if you'd still prefer an explicit opt-in instead?

Yeah not worried about RBAC breaking and I don't think we need a feature gate, I was thinking of a config or flag to toggle it. I'm wondering if a cluster admin would want some sort of opt-in to allow KubeRay to start managing cert-manager Certificates. I'm thinking that as long as Ray clusters do not touch cert-manager certificates by default it would be fine, but I'm curious if others have thoughts on this.

andrewsykim · 2026-03-16T14:15:24Z

ray-operator/pkg/features/features.go

+	// alpha: v1.6
+	//
+	// Enables NetworkPolicy-based network isolation and mTLS for RayClusters.
+	EnhancedSecurityPrimitives featuregate.Feature = "EnhancedSecurityPrimitives"


I think we should treat mTLS and network policy has two distinct features, and probably requires separate feature gates.

However, given this is likely going to miss v1.6, we can re-evaluate for v1.7 if the feature gates are even required

Thanks Andrew, I'd imagine just aiming for 1.7 and dropping the feature gate is probably the best approach for us going forward. WDYT?

machichima · 2026-03-17T14:04:12Z

I am testing with follow YAML locally. And I will keep getting warnings like so: "Failed to connect to GCS at address 10.244.0.41:6379" in head pod, and head pod cannot get ready.

I printed out the events with kubectl get events -n default --sort-by=.lastTimestamp and collect the relative ones below

107s        Normal    MTLSPKIReady               raycluster/rayjob-mtls-origin-4ffkg                                mTLS PKI is ready: CA, head, and worker certificates issued
76s         Normal    CreatedHeadPod             raycluster/rayjob-mtls-origin-4ffkg                                Created head Pod default/rayjob-mtls-origin-4ffkg-head-rckkn
75s         Normal    Started                    pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Started container ray-head
75s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Fields on existing CertificateRequest resource not up to date: [spec.ipAddresses]
75s         Normal    Requested                  certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Created new CertificateRequest resource "ray-head-cert-rayjob-mtls-origin-4ffkg-5"
74s         Normal    CertificateIssued          certificaterequest/ray-head-cert-rayjob-mtls-origin-4ffkg-5        Certificate fetched from issuer successfully
73s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 The certificate has been successfully issued
35s         Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Liveness probe failed:
5s          Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Readiness probe failed:

The events suggest the following order:

KubeRay reports mTLS PKI ready.
The head pod starts.
The head certificate is re-issued because spec.ipAddresses changes.
The head pod later becomes unhealthy.

Based on this, I think the issue is that the initial head certificate may not yet include the final pod IPs. After the head pod is created, KubeRay updates the certificate's IP SANs and cert-manager re-issues it. Since Ray does not hot-reload TLS material, the head process may continue using the earlier certificate state and fail to connect to GCS over the pod IP.

Test YAML I used

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

chipspeak · 2026-03-18T08:07:11Z

I am testing with follow YAML locally. And I will keep getting warnings like so: "Failed to connect to GCS at address 10.244.0.41:6379" in head pod, and head pod cannot get ready.

I printed out the events with kubectl get events -n default --sort-by=.lastTimestamp and collect the relative ones below

107s        Normal    MTLSPKIReady               raycluster/rayjob-mtls-origin-4ffkg                                mTLS PKI is ready: CA, head, and worker certificates issued
76s         Normal    CreatedHeadPod             raycluster/rayjob-mtls-origin-4ffkg                                Created head Pod default/rayjob-mtls-origin-4ffkg-head-rckkn
75s         Normal    Started                    pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Started container ray-head
75s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Fields on existing CertificateRequest resource not up to date: [spec.ipAddresses]
75s         Normal    Requested                  certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Created new CertificateRequest resource "ray-head-cert-rayjob-mtls-origin-4ffkg-5"
74s         Normal    CertificateIssued          certificaterequest/ray-head-cert-rayjob-mtls-origin-4ffkg-5        Certificate fetched from issuer successfully
73s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 The certificate has been successfully issued
35s         Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Liveness probe failed:
5s          Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Readiness probe failed:

The events suggest the following order:

KubeRay reports mTLS PKI ready.
The head pod starts.
The head certificate is re-issued because spec.ipAddresses changes.
The head pod later becomes unhealthy.

Based on this, I think the issue is that the initial head certificate may not yet include the final pod IPs. After the head pod is created, KubeRay updates the certificate's IP SANs and cert-manager re-issues it. Since Ray does not hot-reload TLS material, the head process may continue using the earlier certificate state and fail to connect to GCS over the pod IP.

Test YAML I used

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

Yeah I've run into the same and your assessment is accurate. I think when I had been running this, the liveness probe identified the failing pod and restarted it so the issue wasn't as apparent, it was intermittent. This becomes a bit of a chicken and egg situation I think. I'll see if I can find a solution and report back!

machichima · 2026-03-18T13:55:10Z

ray-operator/controllers/ray/common/pod.go

+	if rayNodeType == rayv1.HeadNode {
+		rayContainer := &podTemplate.Spec.Containers[utils.RayContainerIndex]
+		if !utils.EnvVarExists("MY_POD_IP", rayContainer.Env) {
+			rayContainer.Env = append(rayContainer.Env, corev1.EnvVar{
+				Name: "MY_POD_IP",
+				ValueFrom: &corev1.EnvVarSource{
+					FieldRef: &corev1.ObjectFieldSelector{FieldPath: "status.podIP"},
+				},
+			})
+		}
+	}


I'm wondering why we need to set MY_POD_IP here? Is this used anywhere?

machichima · 2026-03-18T14:11:01Z

Hi @chipspeak

I’m thinking a possible fix would be to gate the head container startup with an init container:

Create the pod first so the pod IP exists.
Wait until the mounted head certificate SAN contains the pod IP.
Once the certificate is ready, exit the init container and allow the ray-head container to start.

This would follow the same general pattern as the existing worker pod wait-gcs-ready init container, which blocks worker startup until GCS is reachable

kuberay/ray-operator/controllers/ray/common/pod.go

Lines 466 to 491 in 05a1a7c

    
           		initContainer := corev1.Container{ 
        
           			Name:            "wait-gcs-ready", 
        
           			Image:           podTemplate.Spec.Containers[utils.RayContainerIndex].Image, 
        
           			ImagePullPolicy: podTemplate.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy, 
        
           			Command:         utils.GetContainerCommand([]string{}), 
        
           			Args: []string{ 
        
           				fmt.Sprintf(` 
        
           					SECONDS=0 
        
           					while true; do 
        
           						if (( SECONDS <= 120 )); then 
        
           							if ray health-check --address %s:%s > /dev/null 2>&1; then 
        
           								echo "GCS is ready." 
        
           								break 
        
           							fi 
        
           							echo "$SECONDS seconds elapsed: Waiting for GCS to be ready." 
        
           						else 
        
           							if ray health-check --address %s:%s; then 
        
           								echo "GCS is ready. Any error messages above can be safely ignored." 
        
           								break 
        
           							fi 
        
           							echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting.html." 
        
           						fi 
        
           						sleep 5 
        
           					done 
        
           				`, fqdnRayIP, headPort, fqdnRayIP, headPort), 
        
           			},

cc @rueian @andrewsykim @Future-Outlier to check if this is a good approach
Thanks!

chipspeak · 2026-03-18T14:26:31Z

Hi @chipspeak

I’m thinking a possible fix would be to gate the head container startup with an init container:

Create the pod first so the pod IP exists.

Wait until the mounted head certificate SAN contains the pod IP.

Once the certificate is ready, exit the init container and allow the ray-head container to start.

This would follow the same general pattern as the existing worker pod wait-gcs-ready init container, which blocks worker startup until GCS is reachable

kuberay/ray-operator/controllers/ray/common/pod.go

Lines 466 to 491 in 05a1a7c

initContainer := corev1.Container{

Name: "wait-gcs-ready",

Image: podTemplate.Spec.Containers[utils.RayContainerIndex].Image,

ImagePullPolicy: podTemplate.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy,

Command: utils.GetContainerCommand([]string{}),

Args: []string{

fmt.Sprintf(`

SECONDS=0

while true; do

if (( SECONDS <= 120 )); then

if ray health-check --address %s:%s > /dev/null 2>&1; then

echo "GCS is ready."

break

fi

echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."

else

if ray health-check --address %s:%s; then

echo "GCS is ready. Any error messages above can be safely ignored."

break

fi

echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting.html."

fi

sleep 5

done

`, fqdnRayIP, headPort, fqdnRayIP, headPort),

},

cc @rueian @andrewsykim @Future-Outlier to check if this is a good approach Thanks!

Thanks @machichima ! We'd been trying to avoid init containers just from our perspective on performance hits. I've checked our midstream and we'd solved it there via setting --node-ip-address to the head service FQDN in setMissingRayStartParams when the user hasn't explicitly provided one. This way the head node advertises its stable service address ({cluster}-head-svc.{ns}.svc.cluster.local) instead of the pod IP.
The DNS SANs for that FQDN are already present in the certificate from the moment it's created, so there's no race condition — the cert is valid before the pod even exists. Workers are unaffected since the existing wait-gcs-ready init container provides enough of a delay for the controller to update the worker cert with pod IPs before the main container reads it.
I'll add this to the PR soon, separate from the other review feedback. The change is minimal — just passing fqdnRayIP through to DefaultHeadPodTemplate and adding the node-ip-address default in setMissingRayStartParams for head nodes.

our midstream PR here might give some indication of what this will look like.

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/common/pod.go

machichima · 2026-03-19T12:45:25Z

ray-operator/controllers/ray/common/pod.go

+		// when mTLS is enabled: certs are created before pods exist, so the initial cert only
+		// contains DNS SANs and 127.0.0.1. Using the FQDN ensures GCS connections match a
+		// DNS SAN that is present from the start.
+		if _, ok := rayStartParams["node-ip-address"]; !ok && fqdnRayIP != "" {


If user set any the node-ip-address in rayStartParams, they may still get the issue. We will need to add this into documentation

Absolutely. I think there'll be a lot of this to be documented in any case!

machichima · 2026-03-19T13:03:50Z

I can run RayJob successfully with the newest changes. Thank you for the quick update!

Tested with following scripts to ensure cross-node communication. It looks good for following three paths:

head -> worker
worker -> worker (chained task, one direction)
worker <-> worker (actor-to-actor, bidirectional)

YAML:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

fscnick · 2026-03-19T14:20:51Z

docs/reference/api.md

 | `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs |  |  |
 | `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT |  |  |
+| `networkIsolation` _[NetworkIsolationConfig](#networkisolationconfig)_ | NetworkIsolation specifies optional configuration for network isolation.<br />When set, NetworkPolicies will be created to control traffic to/from Ray pods.<br />The reconciler always ensures intra-cluster and KubeRay operator communication is permitted. |  |  |
+| `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS). |  |  |


It seems there is a redundant line break.

And there are some in this file.

I'll take a look thanks!

Addressed in the latest commit.

fscnick · 2026-03-19T14:52:01Z

I'm not sure if this has been discussed previously why the controller create self-signed CA certificate per RayCluster instead of using a self-signed CA certificate per controller to sign the head's and worker's certificate?

chipspeak · 2026-03-19T16:11:34Z

I'm not sure if this has been discussed previously why the controller create self-signed CA certificate per RayCluster instead of using a self-signed CA certificate per controller to sign the head's and worker's certificate?

Purely a "belt and braces" approach. This way, if the CA was compromised, it would have a fairly small blast radius(just the RayCluster) rather than the larger scope of the controller-level CA.

ray-operator/controllers/ray/networkpolicy_controller.go

fscnick · 2026-03-20T14:43:02Z

ray-operator/controllers/ray/raycluster_controller.go

+	headFQDN := ""
+	if utils.IsTLSEnabled(&instance.Spec) {
+		headFQDN = fqdnRayIP
+	}
+	podConf := common.DefaultHeadPodTemplate(ctx, instance, instance.Spec.HeadGroupSpec, podName, headPort, headFQDN)


I wonder if the user would like to apply a RayJob with BYOC. How to sign the certificate? The RayJob creates the corresponding RayCluster with hash suffix which is hard to predict. So the name of service is hard to predict, too.

I tried to overwrite node-ip-address with another predictable service but it didn't work. That service would be created later.

Or, did I miss something?

@fscnick can i ask what you mean by BYOC? Bring your own cert? The proposal and implementation relies on the use of certificates created via cert-manager which is handled inside the controller where the name of the cluster and service are known.

I think this is already an existing limitation. This is essentially the same conceit in terms of this already not working for IP SANS, just for the DNS instead.

RE node-ip-address, that's consumed by Ray at startup time. If it's set to a service that doesn't exist, Ray can't resolve the DNS and it should fail. The service will get created by KubeRay after the fact but that's after the failure. Our solution gets around this by relying on head service which is created prior to the head pod's creation ensuring that by the time it comes up, the service exists and node-ip-address can be resolved.

@laurafitzgerald Yeh, please reference this ray-cluster.mtls-byoc.yaml.

Take the above RayJob rayjob-mtls-verification as example. It will create RayCluster rayjob-mtls-verification-pc79r which suffix a truncated hash. As the result, it also create services as follow:

rayjob-mtls-verification-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 5s rayjob-mtls-verification-pc79r-head-svc ClusterIP None <none> 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 28s

In the current solution, it leverages --node-ip-address=rayjob-mtls-verification-pc79r-head-svc.default.svc.cluster.local to make it work.

In BYOC mode, it should put DNS in the SAN of the certificate. However, the hash suffix of the corresponding RayCluster is unknown before creating it.

Additionally, it didn't work to use rayjob-mtls-verification-head-svc as mentioned by
chipspeak it was created by RayCluster which is after the failure.

machichima · 2026-03-21T08:13:41Z

Hi @chipspeak @laurafitzgerald,

Sorry if I missed any prior discussion. I was wondering if we could split this large PR into two: one for network isolation and the other for mTLS. Smaller PRs are preferable in OSS, as they are much easier to review compared to large ones.

For this PR, having to consider both network isolation and mTLS at the same time will make the review process much longer and harder to discuss.

Thank you!

cc @rueian @Future-Outlier @andrewsykim

machichima · 2026-03-21T12:55:03Z

Want to check whether you have considered the following scenario:

The head and worker Pods are already running with a correct certificate.
A worker Pod is then deleted and recreated, or a new worker Pod is added during autoscaling.
The new worker Pod may get mounted with the existing worker Secret before the mTLS controller updates the worker certificate SAN list.

Since the mTLS controller appears to reconcile periodically (30s / 1m), there may be a window where the new worker starts with a certificate whose SANs do not yet include its current Pod IP.

Have you tested this case explicitly? If so, could you share how this race is avoided?
Thank you!

AndySung320 · 2026-03-23T00:02:05Z

ray-operator/controllers/ray/raycluster_mtls_controller.go

+	if instance.DeletionTimestamp != nil && !instance.DeletionTimestamp.IsZero() {
+		if controllerutil.ContainsFinalizer(instance, utils.MTLSCleanupFinalizer) {
+			// Only delete secrets in auto-generate mode; BYOC secrets belong to the user.
+			if !utils.IsTLSBYOC(&instance.Spec) {


If a cluster starts in auto-generate mode (secrets created, finalizer added), then the user switches to BYOC via kubectl apply, and later runs kubectl delete raycluster.
Would CA/head/worker secrets permanently orphaned in the namespace since IsTLSBYOC is true and deleteTLSSecrets() is skipped ?

AndySung320 · 2026-03-23T00:19:56Z

ray-operator/controllers/ray/raycluster_mtls_controller.go

+	sort.Strings(desiredDNSNames)
+	desiredIPAddresses := normalizeIPs(podIPs)
+
+	// Known limitation: Ray reads TLS files once at startup and does not hot-reload them.


Should we document this behavior? The known limitation is acknowledged in code comments but does not appear in any docs (docs/reference/api.md). Otherwise users might assume certificates are renewed end-to-end automatically without pod restarts.

Absolutely. I just have it in the comments for now. I wanted to wait until everything was locked in before updating documentation as I wanted to find out what the preference is from you folks in terms of docs.

laurafitzgerald · 2026-03-23T09:32:49Z

Hi team. We see a request to split out the pr again. We will take a look at that. We will need to split and rename the feature gate as there is currently one controlling both features. I’ll make a stab at a name for those but they can be changed.

On a seperate note, and apologies if this was discussed in threads, but i see a lot of discussion around BYOC. In our original proposal we did not consider BYOC as being in scope as part of this change as the idea is that a user can switch this on and have all the work done for them. So I have a few questions about BYOC.

What use case might a user bring their own cert but then also want to rely on the automated mTLS config that’s being added?

If there is a valid use case could we ask that a design for this is added to the existing or a new proposal doc and agreed before we start to make changes for that use case as part of this pr. Or is there potential to document it as out of scope for now to reduce the scope of the changes for this pr - aka cert-manager dependent mTLS configuration/no BYOC. (edited)

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

ray-operator/controllers/ray/raycluster_mtls_controller_test.go

ray-operator/controllers/ray/raycluster_mtls_controller.go

cursor · 2026-03-23T11:21:26Z

ray-operator/controllers/ray/utils/util.go

+// and recreated with the same name, it gets a new CA secret rather than reusing a
+// potentially stale one from a previous instance.
+func GetCASecretName(clusterName string, clusterUID types.UID) string {
+	uidSuffix := string(clusterUID)[:8]


Potential panic slicing UID shorter than eight characters

Low Severity

GetCASecretName slices clusterUID with [:8] without a length guard. If the UID is empty or shorter than 8 characters (possible in unit tests or edge cases), this panics with an index-out-of-range error. Real Kubernetes UIDs are 36 characters, but the function has no defensive check.

chipspeak · 2026-03-23T11:40:19Z

networkPolicy PR can be found here
cc. @rueian @Future-Outlier @andrewsykim @machichima @kevin85421 @MortalHappiness @fscnick

fscnick · 2026-03-28T02:23:46Z

Hi team. We see a request to split out the pr again. We will take a look at that. We will need to split and rename the feature gate as there is currently one controlling both features. I’ll make a stab at a name for those but they can be changed.

On a seperate note, and apologies if this was discussed in threads, but i see a lot of discussion around BYOC. In our original proposal we did not consider BYOC as being in scope as part of this change as the idea is that a user can switch this on and have all the work done for them. So I have a few questions about BYOC.

What use case might a user bring their own cert but then also want to rely on the automated mTLS config that’s being added?

If there is a valid use case could we ask that a design for this is added to the existing or a new proposal doc and agreed before we start to make changes for that use case as part of this pr. Or is there potential to document it as out of scope for now to reduce the scope of the changes for this pr - aka cert-manager dependent mTLS configuration/no BYOC. (edited)

Hi @laurafitzgerald, thanks for your reply. It seems the design doc has mentioned BYOC. Thanks for @AndySung320 to support.

SecretName
The name of the secret to be used for the RayCluster, if the user chooses to bring their own certificates

After discussing with the team offline, would you mind to split the BYOC related part into another PR ? It could help us to focus auto-generation certificates in this PR.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-03-30T08:32:34Z

helm-chart/kuberay-operator/templates/_helpers.tpl

+  - list
+  - patch
+  - update
+  - watch


Helm template has duplicate networking.k8s.io ingresses RBAC entry

Medium Severity

The helm template now has two separate RBAC rules granting permissions on ingresses: one under the extensions apiGroup (lines 260-271, which previously also included networking.k8s.io) and a new duplicate entry under networking.k8s.io (lines 291-302). The original combined extensions + networking.k8s.io entry was split, but networking.k8s.io was removed from the first entry without removing the now-redundant second copy. Meanwhile, the generated role.yaml correctly keeps them combined (lines 117-129). This divergence between the helm chart and kustomize RBAC definitions means the helm-deployed operator gets a redundant rule, and future RBAC changes may drift.

Additional Locations (1)

helm-chart/kuberay-operator/templates/_helpers.tpl#L259-L271

cursor · 2026-03-30T08:32:34Z

ray-operator/controllers/ray/utils/util.go

+// and recreated with the same name, it gets a new CA secret rather than reusing a
+// potentially stale one from a previous instance.
+func GetCASecretName(clusterName string, clusterUID types.UID) string {
+	uidSuffix := string(clusterUID)[:8]


GetCASecretName panics if UID shorter than 8 chars

Low Severity

GetCASecretName slices clusterUID with [:8] without a length guard. If the UID string is ever shorter than 8 characters (e.g., in tests with synthetic UIDs or unexpected states), this causes an out-of-bounds panic. While Kubernetes UIDs are typically 36 characters, a defensive bounds check would prevent a runtime crash.

cursor · 2026-03-30T08:32:34Z

helm-chart/kuberay-operator/values.yaml

+- name: AsyncJobInfoQuery
+  enabled: false
+- name: RayClusterMTLS
+  enabled: false


Helm values.yaml feature gate list indentation is wrong

Medium Severity

The new feature gate entries AsyncJobInfoQuery and RayClusterMTLS appear to be at the wrong indentation level compared to the preceding list items (lines 135-144). The existing entries use - name: with consistent indentation under a parent key, but the new entries lack the leading indentation, which would place them outside the feature gates list and cause incorrect helm rendering.

chipspeak · 2026-03-30T08:33:19Z

I've dropped the recent, BYOC-specific material from this branch so what's left should be the auto-generating mTLS work.

machichima · 2026-03-30T08:50:08Z

Thank you! @chipspeak
Could you please also update the PR subject and description as this PR is only focusing on mTLS support now?

fscnick · 2026-03-31T12:17:22Z

docs/reference/api.md

 | `headServiceAnnotations` _object (keys:string, values:string)_ |  |  |  |
 | `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs |  |  |
 | `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT |  |  |
+| `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS).<br />Requires the RayClusterMTLS feature gate on the operator. |  |  |


There is a redundant line break.

fscnick · 2026-03-31T12:19:49Z

docs/reference/api.md

+| Field | Description | Default | Validation |
+| --- | --- | --- | --- |
+| `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> |
+| `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. |  |  |


There are some redundant line breaks.

fscnick · 2026-03-31T12:20:35Z

docs/reference/api.md

+| --- | --- | --- | --- |
+| `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> |
+| `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. |  |  |
+| `workerCertificateSecretName` _string_ | WorkerCertificateSecretName is an optional user-provided Kubernetes Secret<br />containing tls.crt, tls.key, and ca.crt for worker nodes.<br />When set, workers use this secret instead of CertificateSecretName, giving<br />head and worker pods separate TLS identities. This prevents a compromised<br />worker key from impersonating the head node at the TLS layer.<br />The certificate SANs must cover worker node identities<br />(worker pod IPs or wildcards). Both secrets must share the same CA. |  |  |


There are some redundant line breaks.

fscnick · 2026-03-31T12:22:44Z

ray-operator/config/samples/ray-cluster.mtls-byoc.yaml

+apiVersion: ray.io/v1
+kind: RayCluster
+metadata:
+  name: raycluster-byoc


Should this file be moved to another PR?

kryanbeane requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners March 5, 2026 09:32

cursor bot reviewed Mar 5, 2026

View reviewed changes

ray-operator/apis/ray/v1/zz_generated.deepcopy.go Outdated Show resolved Hide resolved

ray-operator/config/default/kustomization.yaml Outdated Show resolved Hide resolved

kryanbeane force-pushed the enhanced-security-features branch from c6da256 to 25f359c Compare March 5, 2026 10:27

cursor bot reviewed Mar 5, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

kryanbeane force-pushed the enhanced-security-features branch from 25f359c to 4a63057 Compare March 5, 2026 10:39

cursor bot reviewed Mar 5, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 5, 2026

View reviewed changes

ray-operator/config/rbac/role.yaml Show resolved Hide resolved

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 5, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

andrewsykim requested changes Mar 5, 2026

View reviewed changes

ray-operator/apis/ray/v1/raycluster_types.go Outdated Show resolved Hide resolved

ray-operator/apis/ray/v1/raycluster_types.go Outdated Show resolved Hide resolved

seanlaii reviewed Mar 5, 2026

View reviewed changes

ray-operator/main.go Outdated Show resolved Hide resolved

AndySung320 reviewed Mar 6, 2026

View reviewed changes

Future-Outlier requested a review from Copilot March 6, 2026 02:39

Copilot started reviewing on behalf of Future-Outlier March 6, 2026 02:39 View session

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

ray-operator/controllers/ray/raycluster_mtls_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 6, 2026

View reviewed changes

ray-operator/test/e2e/raycluster_tls_test.go Show resolved Hide resolved

ray-operator/controllers/ray/utils/validation.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 6, 2026

View reviewed changes

ray-operator/main.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 6, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

andrewsykim requested changes Mar 6, 2026

View reviewed changes

ray-operator/apis/ray/v1/raycluster_types.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 9, 2026

View reviewed changes

helm-chart/kuberay-operator/templates/_helpers.tpl Show resolved Hide resolved

kryanbeane force-pushed the enhanced-security-features branch from dcdb4eb to 5841cc8 Compare March 9, 2026 11:23

cursor bot reviewed Mar 9, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

kryanbeane force-pushed the enhanced-security-features branch from 5841cc8 to 47463cc Compare March 9, 2026 11:37

cursor bot reviewed Mar 9, 2026

View reviewed changes

ray-operator/controllers/ray/common/pod.go Show resolved Hide resolved

andrewsykim requested changes Mar 9, 2026

View reviewed changes

andrewsykim reviewed Mar 16, 2026

View reviewed changes

machichima reviewed Mar 18, 2026

View reviewed changes

cursor bot reviewed Mar 18, 2026

View reviewed changes

ray-operator/controllers/ray/raycluster_controller.go Show resolved Hide resolved

ray-operator/controllers/ray/common/pod.go Outdated Show resolved Hide resolved

cursor bot reviewed Mar 18, 2026

View reviewed changes

ray-operator/controllers/ray/common/pod.go Show resolved Hide resolved

machichima reviewed Mar 19, 2026

View reviewed changes

fscnick reviewed Mar 19, 2026

View reviewed changes

cursor bot reviewed Mar 19, 2026

View reviewed changes

ray-operator/controllers/ray/networkpolicy_controller.go Outdated Show resolved Hide resolved

fscnick reviewed Mar 20, 2026

View reviewed changes

AndySung320 reviewed Mar 23, 2026

View reviewed changes

cursor bot reviewed Mar 23, 2026

View reviewed changes

[Feature] Add mTLS support to RayClusters

829d8a1

laurafitzgerald force-pushed the enhanced-security-features branch 2 times, most recently from 3bcc39d to dd43b59 Compare March 23, 2026 16:41

chipspeak force-pushed the enhanced-security-features branch from dd43b59 to 829d8a1 Compare March 30, 2026 08:31

cursor bot reviewed Mar 30, 2026

View reviewed changes

fscnick reviewed Mar 31, 2026

View reviewed changes

Conversation

kryanbeane commented Mar 5, 2026

Why are these changes needed?

Related issue number

Checks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima commented Mar 5, 2026

Uh oh!

kryanbeane commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andrewsykim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

seanlaii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima commented Mar 17, 2026

Uh oh!

chipspeak commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima commented Mar 18, 2026

Uh oh!

chipspeak commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

machichima Mar 19, 2026 •

edited

Loading

chipspeak commented Mar 19, 2026 •

edited

Loading

laurafitzgerald Mar 20, 2026 •

edited

Loading