Skip to content

[Feature] Add Network Policy and mTLS via Cert Manager Support to RayCluster#4566

Open
kryanbeane wants to merge 1 commit intoray-project:masterfrom
opendatahub-io:enhanced-security-features
Open

[Feature] Add Network Policy and mTLS via Cert Manager Support to RayCluster#4566
kryanbeane wants to merge 1 commit intoray-project:masterfrom
opendatahub-io:enhanced-security-features

Conversation

@kryanbeane
Copy link
Copy Markdown
Contributor

Why are these changes needed?

The Ray docs describes the assumption that users should ensure they run their Ray workloads in an environment which have a secure trusted network configuration set up. This PR adds network policy and mTLS support via a RayCluster API change alongside new reconcilers, tests, and docs updates. This addition gives users the ability to configure these security features from within each RayCluster.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kryanbeane kryanbeane force-pushed the enhanced-security-features branch from c6da256 to 25f359c Compare March 5, 2026 10:27
@kryanbeane kryanbeane force-pushed the enhanced-security-features branch from 25f359c to 4a63057 Compare March 5, 2026 10:39
@machichima
Copy link
Copy Markdown
Collaborator

Hi @kryanbeane,
Could you fix the CI errors?
Thank you!

@kryanbeane
Copy link
Copy Markdown
Contributor Author

@machichima yes I am working on them as we speak!

Copy link
Copy Markdown
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not reflect the API proposed in this doc https://docs.google.com/document/d/1z6safIXUgs7ZaUG-Vt9cFLQ_nryjq3ktAdXUoNRh4G4/edit?usp=sharing

Please update to reflect the changes there

Copy link
Copy Markdown
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work on this PR!

@@ -50,14 +50,6 @@ rules:
- ""
resources:
- secrets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator has get/list/delete on all secrets with no distinction for mTLS CA secret, which contains the CA private key.
Maybe we should document that namespaces with mTLS Rayclusters should restrict who has secret read access, or add labels to CA/head/worker secrets for future policy-based protection?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think this will be required because we want to support BYOC whcih can reference arbitrary secrets

@Future-Outlier
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e536b72a4c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional per-RayCluster security controls to KubeRay by introducing NetworkPolicy-based network isolation and mTLS support (cert-manager auto-generation or BYOC), along with controllers, API/CRD updates, samples, and tests.

Changes:

  • Extend RayCluster (and embedded RayClusterSpec usage in RayJob/RayService/RayCronJob) with networkIsolation, enableMTLS, and mTLSOptions.
  • Add new controllers: NetworkPolicyController and RayClusterMTLSController, plus pod-template injection for TLS env/volume.
  • Add/expand unit + integration/e2e tests and update RBAC, Helm CRDs, and docs/samples.

Reviewed changes

Copilot reviewed 43 out of 45 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ray-operator/test/e2e/raycluster_tls_test.go New e2e coverage for mTLS auto-gen/BYOC and pod-level TLS wiring.
ray-operator/pkg/client/applyconfiguration/utils.go Register applyconfig types for new API structs.
ray-operator/pkg/client/applyconfiguration/ray/v1/rayclusterspec.go Applyconfig updates for new RayClusterSpec fields and builders.
ray-operator/pkg/client/applyconfiguration/ray/v1/networkisolationconfig.go New generated applyconfig type for NetworkIsolationConfig.
ray-operator/pkg/client/applyconfiguration/ray/v1/mtlsoptions.go New generated applyconfig type for MTLSOptions.
ray-operator/main.go Wire cert-manager types into scheme and register new controllers.
ray-operator/go.mod Add cert-manager dependency; bump indirect deps.
ray-operator/go.sum Dependency lock updates for cert-manager + indirect bumps.
ray-operator/controllers/ray/utils/validation_test.go Add validation tests for NetworkIsolation constraints.
ray-operator/controllers/ray/utils/validation.go Validate NetworkIsolation + mTLS option consistency.
ray-operator/controllers/ray/utils/util.go Add helpers for mTLS enablement and TLS secret naming.
ray-operator/controllers/ray/utils/constant.go Add TLS env/volume constants + new event types.
ray-operator/controllers/ray/suite_test.go Register networking API scheme + set up NetworkPolicy controller in envtest.
ray-operator/controllers/ray/raycluster_mtls_controller.go New controller to reconcile cert-manager Issuers/Certificates + secret cleanup finalizer.
ray-operator/controllers/ray/raycluster_mtls_controller_test.go Unit tests for mTLS controller behaviors (auto-gen, BYOC, deletion, SAN updates).
ray-operator/controllers/ray/raycluster_controller.go Block pod creation on TLS secret readiness; detect and recreate pods lacking mTLS config.
ray-operator/controllers/ray/raycluster_controller_unit_test.go Add cert-manager types to schemes used in unit tests.
ray-operator/controllers/ray/networkpolicy_controller.go New controller to create/update/delete head/worker NetworkPolicies from RayCluster spec.
ray-operator/controllers/ray/networkpolicy_controller_unit_test.go Unit tests for NetworkPolicy rule construction and naming.
ray-operator/controllers/ray/networkpolicy_controller_test.go Integration tests verifying NetworkPolicy reconciliation in envtest.
ray-operator/controllers/ray/common/pod.go Inject TLS env/volume into head/worker/init/autoscaler containers when mTLS enabled.
ray-operator/controllers/ray/common/pod_test.go Unit tests for TLS injection behavior and autoscaler/init-container coverage.
ray-operator/config/samples/ray-cluster.network-isolation-monitoring.yaml Sample for network isolation with monitoring scrape allowlist.
ray-operator/config/samples/ray-cluster.network-isolation-deny-ingress.yaml Sample for deny-ingress isolation mode.
ray-operator/config/samples/ray-cluster.network-isolation-deny-all.yaml Sample for deny-all isolation mode.
ray-operator/config/samples/ray-cluster.network-isolation-custom-rules.yaml Sample for custom ingress rules.
ray-operator/config/samples/ray-cluster.network-isolation-custom-ports.yaml Sample demonstrating auto-resolved dashboard/client ports.
ray-operator/config/samples/ray-cluster.network-isolation-complex-rules.yaml Sample with multi-namespace/selector ingress rules.
ray-operator/config/samples/ray-cluster.mtls.yaml Sample enabling cert-manager-backed mTLS.
ray-operator/config/samples/ray-cluster.mtls-byoc.yaml Sample enabling BYOC mTLS using a user-provided secret.
ray-operator/config/rbac/role.yaml Expand RBAC for cert-manager resources + networkpolicies and secret delete.
ray-operator/config/crd/bases/ray.io_rayclusters.yaml CRD schema update for networkIsolation + mTLS fields.
ray-operator/config/crd/bases/ray.io_rayjobs.yaml CRD schema update for embedded RayClusterSpec fields.
ray-operator/config/crd/bases/ray.io_rayservices.yaml CRD schema update for embedded RayClusterSpec fields.
ray-operator/config/crd/bases/ray.io_raycronjobs.yaml CRD schema update for embedded RayClusterSpec fields.
ray-operator/apis/ray/v1/raycluster_types.go API types: NetworkIsolationConfig, MTLSOptions, and new RayClusterSpec fields.
ray-operator/apis/ray/v1/zz_generated.deepcopy.go Generated deepcopy updates for new API fields/types.
helm-chart/kuberay-operator/templates/_helpers.tpl Helm RBAC template updates for cert-manager + networkpolicy permissions.
helm-chart/kuberay-operator/crds/ray.io_rayclusters.yaml Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml Helm CRD bundle update for new fields.
helm-chart/kuberay-operator/crds/ray.io_raycronjobs.yaml Helm CRD bundle update for new fields.
docs/reference/api.md API reference docs for NetworkIsolationConfig + MTLSOptions + RayClusterSpec fields.
go.mod Root module indirect dependency bumps (otel/x/sys).
go.sum Root module dependency lock updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kryanbeane kryanbeane force-pushed the enhanced-security-features branch from dcdb4eb to 5841cc8 Compare March 9, 2026 11:23
@kryanbeane kryanbeane force-pushed the enhanced-security-features branch from 5841cc8 to 47463cc Compare March 9, 2026 11:37
// Check if cert-manager API is available before registering the mTLS controller.
// If cert-manager is not installed, the controller's Certificate/Issuer cache would
// never sync and the manager would fail to start (e.g. in E2E environments without cert-manager).
certManagerAvailable := certManagerAPIAvailable(restConfig)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discovery API check is nice, but I'm wondering if accessing cert-manager APIs should still be gated by a flag in KubeRay that is also plumbed into the helm charts. i.e. this feature requires a opt-in when running helm install kuberay-operator instead of automatically working if cert-manager is available.

Is there any risks to auto-enabling cert-manager or any reason this shouldn't be on by default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting tlsOptions without cert manager installed would leave the cluster stuck indefinitely waiting for secrets. I've added a guard that rejects it with a clear warning event in 4ea99a5. BYOC mode (via certificateSecretName) still works without cert-manager since the user manages their own certs. Re the helm charts: k8s ignores RBAC rules for API groups that don't exist in the cluster AFAIK, so the cert-manager rules should be fine to ship unconditionally. Happy to add a feature gate if you'd still prefer an explicit opt-in instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not worried about RBAC breaking and I don't think we need a feature gate, I was thinking of a config or flag to toggle it. I'm wondering if a cluster admin would want some sort of opt-in to allow KubeRay to start managing cert-manager Certificates. I'm thinking that as long as Ray clusters do not touch cert-manager certificates by default it would be fine, but I'm curious if others have thoughts on this.

// alpha: v1.6
//
// Enables NetworkPolicy-based network isolation and mTLS for RayClusters.
EnhancedSecurityPrimitives featuregate.Feature = "EnhancedSecurityPrimitives"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should treat mTLS and network policy has two distinct features, and probably requires separate feature gates.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, given this is likely going to miss v1.6, we can re-evaluate for v1.7 if the feature gates are even required

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrew, I'd imagine just aiming for 1.7 and dropping the feature gate is probably the best approach for us going forward. WDYT?

@machichima
Copy link
Copy Markdown
Collaborator

I am testing with follow YAML locally. And I will keep getting warnings like so: "Failed to connect to GCS at address 10.244.0.41:6379" in head pod, and head pod cannot get ready.

I printed out the events with kubectl get events -n default --sort-by=.lastTimestamp and collect the relative ones below

107s        Normal    MTLSPKIReady               raycluster/rayjob-mtls-origin-4ffkg                                mTLS PKI is ready: CA, head, and worker certificates issued
76s         Normal    CreatedHeadPod             raycluster/rayjob-mtls-origin-4ffkg                                Created head Pod default/rayjob-mtls-origin-4ffkg-head-rckkn
75s         Normal    Started                    pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Started container ray-head
75s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Fields on existing CertificateRequest resource not up to date: [spec.ipAddresses]
75s         Normal    Requested                  certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Created new CertificateRequest resource "ray-head-cert-rayjob-mtls-origin-4ffkg-5"
74s         Normal    CertificateIssued          certificaterequest/ray-head-cert-rayjob-mtls-origin-4ffkg-5        Certificate fetched from issuer successfully
73s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 The certificate has been successfully issued
35s         Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Liveness probe failed:
5s          Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Readiness probe failed:

The events suggest the following order:

  1. KubeRay reports mTLS PKI ready.
  2. The head pod starts.
  3. The head certificate is re-issued because spec.ipAddresses changes.
  4. The head pod later becomes unhealthy.

Based on this, I think the issue is that the initial head certificate may not yet include the final pod IPs. After the head pod is created, KubeRay updates the certificate's IP SANs and cert-manager re-issues it. Since Ray does not hot-reload TLS material, the head process may continue using the earlier certificate state and fail to connect to GCS over the pod IP.


Test YAML I used

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

@chipspeak
Copy link
Copy Markdown

I am testing with follow YAML locally. And I will keep getting warnings like so: "Failed to connect to GCS at address 10.244.0.41:6379" in head pod, and head pod cannot get ready.

I printed out the events with kubectl get events -n default --sort-by=.lastTimestamp and collect the relative ones below

107s        Normal    MTLSPKIReady               raycluster/rayjob-mtls-origin-4ffkg                                mTLS PKI is ready: CA, head, and worker certificates issued
76s         Normal    CreatedHeadPod             raycluster/rayjob-mtls-origin-4ffkg                                Created head Pod default/rayjob-mtls-origin-4ffkg-head-rckkn
75s         Normal    Started                    pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Started container ray-head
75s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Fields on existing CertificateRequest resource not up to date: [spec.ipAddresses]
75s         Normal    Requested                  certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 Created new CertificateRequest resource "ray-head-cert-rayjob-mtls-origin-4ffkg-5"
74s         Normal    CertificateIssued          certificaterequest/ray-head-cert-rayjob-mtls-origin-4ffkg-5        Certificate fetched from issuer successfully
73s         Normal    Issuing                    certificate/ray-head-cert-rayjob-mtls-origin-4ffkg                 The certificate has been successfully issued
35s         Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Liveness probe failed:
5s          Warning   Unhealthy                  pod/rayjob-mtls-origin-4ffkg-head-rckkn                            Readiness probe failed:

The events suggest the following order:

  1. KubeRay reports mTLS PKI ready.
  2. The head pod starts.
  3. The head certificate is re-issued because spec.ipAddresses changes.
  4. The head pod later becomes unhealthy.

Based on this, I think the issue is that the initial head certificate may not yet include the final pod IPs. After the head pod is created, KubeRay updates the certificate's IP SANs and cert-manager re-issues it. Since Ray does not hot-reload TLS material, the head process may continue using the earlier certificate state and fail to connect to GCS over the pod IP.

Test YAML I used

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

Yeah I've run into the same and your assessment is accurate. I think when I had been running this, the liveness probe identified the failing pod and restarted it so the issue wasn't as apparent, it was intermittent. This becomes a bit of a chicken and egg situation I think. I'll see if I can find a solution and report back!

Comment on lines +384 to +394
if rayNodeType == rayv1.HeadNode {
rayContainer := &podTemplate.Spec.Containers[utils.RayContainerIndex]
if !utils.EnvVarExists("MY_POD_IP", rayContainer.Env) {
rayContainer.Env = append(rayContainer.Env, corev1.EnvVar{
Name: "MY_POD_IP",
ValueFrom: &corev1.EnvVarSource{
FieldRef: &corev1.ObjectFieldSelector{FieldPath: "status.podIP"},
},
})
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we need to set MY_POD_IP here? Is this used anywhere?

@machichima
Copy link
Copy Markdown
Collaborator

Hi @chipspeak

I’m thinking a possible fix would be to gate the head container startup with an init container:

  1. Create the pod first so the pod IP exists.
  2. Wait until the mounted head certificate SAN contains the pod IP.
  3. Once the certificate is ready, exit the init container and allow the ray-head container to start.

This would follow the same general pattern as the existing worker pod wait-gcs-ready init container, which blocks worker startup until GCS is reachable

initContainer := corev1.Container{
Name: "wait-gcs-ready",
Image: podTemplate.Spec.Containers[utils.RayContainerIndex].Image,
ImagePullPolicy: podTemplate.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy,
Command: utils.GetContainerCommand([]string{}),
Args: []string{
fmt.Sprintf(`
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
if ray health-check --address %s:%s > /dev/null 2>&1; then
echo "GCS is ready."
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
else
if ray health-check --address %s:%s; then
echo "GCS is ready. Any error messages above can be safely ignored."
break
fi
echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting.html."
fi
sleep 5
done
`, fqdnRayIP, headPort, fqdnRayIP, headPort),
},

cc @rueian @andrewsykim @Future-Outlier to check if this is a good approach
Thanks!

@chipspeak
Copy link
Copy Markdown

Hi @chipspeak

I’m thinking a possible fix would be to gate the head container startup with an init container:

  1. Create the pod first so the pod IP exists.
  2. Wait until the mounted head certificate SAN contains the pod IP.
  3. Once the certificate is ready, exit the init container and allow the ray-head container to start.

This would follow the same general pattern as the existing worker pod wait-gcs-ready init container, which blocks worker startup until GCS is reachable

initContainer := corev1.Container{
Name: "wait-gcs-ready",
Image: podTemplate.Spec.Containers[utils.RayContainerIndex].Image,
ImagePullPolicy: podTemplate.Spec.Containers[utils.RayContainerIndex].ImagePullPolicy,
Command: utils.GetContainerCommand([]string{}),
Args: []string{
fmt.Sprintf(`
SECONDS=0
while true; do
if (( SECONDS <= 120 )); then
if ray health-check --address %s:%s > /dev/null 2>&1; then
echo "GCS is ready."
break
fi
echo "$SECONDS seconds elapsed: Waiting for GCS to be ready."
else
if ray health-check --address %s:%s; then
echo "GCS is ready. Any error messages above can be safely ignored."
break
fi
echo "$SECONDS seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting.html."
fi
sleep 5
done
`, fqdnRayIP, headPort, fqdnRayIP, headPort),
},

cc @rueian @andrewsykim @Future-Outlier to check if this is a good approach Thanks!

Thanks @machichima ! We'd been trying to avoid init containers just from our perspective on performance hits. I've checked our midstream and we'd solved it there via setting --node-ip-address to the head service FQDN in setMissingRayStartParams when the user hasn't explicitly provided one. This way the head node advertises its stable service address ({cluster}-head-svc.{ns}.svc.cluster.local) instead of the pod IP.
The DNS SANs for that FQDN are already present in the certificate from the moment it's created, so there's no race condition — the cert is valid before the pod even exists. Workers are unaffected since the existing wait-gcs-ready init container provides enough of a delay for the controller to update the worker cert with pod IPs before the main container reads it.
I'll add this to the PR soon, separate from the other review feedback. The change is minimal — just passing fqdnRayIP through to DefaultHeadPodTemplate and adding the node-ip-address default in setMissingRayStartParams for head nodes.

our midstream PR here might give some indication of what this will look like.

// when mTLS is enabled: certs are created before pods exist, so the initial cert only
// contains DNS SANs and 127.0.0.1. Using the FQDN ensures GCS connections match a
// DNS SAN that is present from the start.
if _, ok := rayStartParams["node-ip-address"]; !ok && fqdnRayIP != "" {
Copy link
Copy Markdown
Collaborator

@machichima machichima Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user set any the node-ip-address in rayStartParams, they may still get the issue. We will need to add this into documentation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I think there'll be a lot of this to be documented in any case!

@machichima
Copy link
Copy Markdown
Collaborator

I can run RayJob successfully with the newest changes. Thank you for the quick update!

Tested with following scripts to ensure cross-node communication. It looks good for following three paths:

  1. head -> worker
  2. worker -> worker (chained task, one direction)
  3. worker <-> worker (actor-to-actor, bidirectional)
image

YAML:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-mtls-verification
spec:
  entrypoint: python /home/ray/samples/worker_to_worker_tls_test.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 60
  rayClusterSpec:
    rayVersion: '2.52.0'
    networkIsolation:
      mode: denyAllIngress
    tlsOptions: {}
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "2Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: tls-test-code
          volumes:
          - name: tls-test-code
            configMap:
              name: rayjob-mtls-verification-code
              items:
              - key: worker_to_worker_tls_test.py
                path: worker_to_worker_tls_test.py
    workerGroupSpecs:
    - replicas: 2
      minReplicas: 2
      maxReplicas: 2
      groupName: small-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.52.0
            resources:
              limits:
                cpu: "1"
                memory: "1Gi"
              requests:
                cpu: "500m"
                memory: "1Gi"

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rayjob-mtls-verification-code
data:
  worker_to_worker_tls_test.py: |
    """
    Verify head-to-worker and worker-to-worker communication across distinct Ray pods.

    Usage:
      python worker_to_worker_tls_test.py
    """

    import os
    import socket

    import ray
    from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

    ctx = ray.init()

    print("Dashboard:", ctx.dashboard_url)


    def pin_to(node_id: str) -> NodeAffinitySchedulingStrategy:
        return NodeAffinitySchedulingStrategy(node_id=node_id, soft=False)


    def alive_node_ids():
        return [n["NodeID"] for n in ray.nodes() if n["Alive"]]


    driver = {
        "node_id": ray.get_runtime_context().get_node_id(),
        "hostname": socket.gethostname(),
        "pid": os.getpid(),
    }
    print("Driver:", driver)

    all_nodes = alive_node_ids()
    worker_nodes = [nid for nid in all_nodes if nid != driver["node_id"]]

    if len(worker_nodes) < 2:
        raise RuntimeError(
            f"Need at least 2 worker nodes/pods besides the driver node. "
            f"driver={driver['node_id']} all_nodes={all_nodes}"
        )

    worker_a, worker_b = worker_nodes[:2]
    print(f"Using worker_a={worker_a}")
    print(f"Using worker_b={worker_b}")


    # --- Driver/Head <-> Worker: force task onto worker_a ---

    @ray.remote
    def where_am_i(tag: str):
        return {
            "tag": tag,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }

    head_to_worker = ray.get(
        where_am_i.options(scheduling_strategy=pin_to(worker_a)).remote("head-to-worker")
    )
    assert head_to_worker["node_id"] == worker_a
    assert head_to_worker["node_id"] != driver["node_id"]

    print("Head -> Worker -> Head:", head_to_worker)
    print("Same node as driver?", head_to_worker["node_id"] == driver["node_id"])
    print("Pinned to worker_a?", head_to_worker["node_id"] == worker_a)


    # --- Worker -> Worker: chained tasks forced across worker_a -> worker_b ---
    @ray.remote
    def double(x):
        return {
            "value": x * 2,
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }


    @ray.remote
    def chain(x, target_node_id):
        here = {
            "node_id": ray.get_runtime_context().get_node_id(),
            "hostname": socket.gethostname(),
            "pid": os.getpid(),
        }
        out = ray.get(
            double.options(scheduling_strategy=pin_to(target_node_id)).remote(x)
        )
        return {
            "chain_node": here,
            "double_node": out,
        }


    chain_result = ray.get(
        chain.options(scheduling_strategy=pin_to(worker_a)).remote(21, worker_b)
    )

    assert chain_result["chain_node"]["node_id"] == worker_a
    assert chain_result["double_node"]["node_id"] == worker_b
    assert chain_result["chain_node"]["node_id"] != chain_result["double_node"]["node_id"]

    print("Worker -> Worker chained:", chain_result)
    print("chain() on worker_a?", chain_result["chain_node"]["node_id"] == worker_a)
    print("double() on worker_b?", chain_result["double_node"]["node_id"] == worker_b)
    print(
        "chain() and double() same node?",
        chain_result["chain_node"]["node_id"] == chain_result["double_node"]["node_id"],
    )


    # --- Worker <-> Worker: actor-to-actor bidirectional across worker_a <-> worker_b ---
    @ray.remote
    class Ping:
        def __init__(self, name):
            self.name = name
            self.node_id = ray.get_runtime_context().get_node_id()
            self.hostname = socket.gethostname()
            self.pid = os.getpid()

        def info(self):
            return {
                "name": self.name,
                "node_id": self.node_id,
                "hostname": self.hostname,
                "pid": self.pid,
            }

        def greet(self, msg, sender):
            return {
                "receiver": self.name,
                "receiver_node": self.node_id,
                "receiver_host": self.hostname,
                "receiver_pid": self.pid,
                "msg": msg,
                "sender": sender,
            }

        def send(self, other, msg):
            sender = self.info()
            response = ray.get(other.greet.remote(msg, self.name))
            return {
                "sender": sender,
                "response": response,
            }


    a = Ping.options(scheduling_strategy=pin_to(worker_a)).remote("A")
    b = Ping.options(scheduling_strategy=pin_to(worker_b)).remote("B")

    a_info, b_info = ray.get([a.info.remote(), b.info.remote()])
    assert a_info["node_id"] == worker_a
    assert b_info["node_id"] == worker_b
    assert a_info["node_id"] != b_info["node_id"]

    print("Actor A:", a_info)
    print("Actor B:", b_info)
    print("A pinned to worker_a?", a_info["node_id"] == worker_a)
    print("B pinned to worker_b?", b_info["node_id"] == worker_b)
    print("A and B same node?", a_info["node_id"] == b_info["node_id"])

    ab = ray.get(a.send.remote(b, "hello"))
    ba = ray.get(b.send.remote(a, "world"))

    assert ab["sender"]["node_id"] != ab["response"]["receiver_node"]
    assert ba["sender"]["node_id"] != ba["response"]["receiver_node"]

    print("A->B:", ab)
    print("B->A:", ba)
    print(
        "A send() and B greet() same node?",
        ab["sender"]["node_id"] == ab["response"]["receiver_node"],
    )
    print(
        "B send() and A greet() same node?",
        ba["sender"]["node_id"] == ba["response"]["receiver_node"],
    )

    print("ALL OK")

| `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs | | |
| `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT | | |
| `networkIsolation` _[NetworkIsolationConfig](#networkisolationconfig)_ | NetworkIsolation specifies optional configuration for network isolation.<br />When set, NetworkPolicies will be created to control traffic to/from Ray pods.<br />The reconciler always ensures intra-cluster and KubeRay operator communication is permitted. | | |
| `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS). | | |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a redundant line break.

Image

And there are some in this file.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look thanks!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit.

@fscnick
Copy link
Copy Markdown
Collaborator

fscnick commented Mar 19, 2026

I'm not sure if this has been discussed previously why the controller create self-signed CA certificate per RayCluster instead of using a self-signed CA certificate per controller to sign the head's and worker's certificate?

@chipspeak
Copy link
Copy Markdown

chipspeak commented Mar 19, 2026

I'm not sure if this has been discussed previously why the controller create self-signed CA certificate per RayCluster instead of using a self-signed CA certificate per controller to sign the head's and worker's certificate?

Purely a "belt and braces" approach. This way, if the CA was compromised, it would have a fairly small blast radius(just the RayCluster) rather than the larger scope of the controller-level CA.

Comment on lines +1441 to +1445
headFQDN := ""
if utils.IsTLSEnabled(&instance.Spec) {
headFQDN = fqdnRayIP
}
podConf := common.DefaultHeadPodTemplate(ctx, instance, instance.Spec.HeadGroupSpec, podName, headPort, headFQDN)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the user would like to apply a RayJob with BYOC. How to sign the certificate? The RayJob creates the corresponding RayCluster with hash suffix which is hard to predict. So the name of service is hard to predict, too.

I tried to overwrite node-ip-address with another predictable service but it didn't work. That service would be created later.

Or, did I miss something?

Copy link
Copy Markdown

@laurafitzgerald laurafitzgerald Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fscnick can i ask what you mean by BYOC? Bring your own cert? The proposal and implementation relies on the use of certificates created via cert-manager which is handled inside the controller where the name of the cluster and service are known.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already an existing limitation. This is essentially the same conceit in terms of this already not working for IP SANS, just for the DNS instead.

RE node-ip-address, that's consumed by Ray at startup time. If it's set to a service that doesn't exist, Ray can't resolve the DNS and it should fail. The service will get created by KubeRay after the fact but that's after the failure. Our solution gets around this by relying on head service which is created prior to the head pod's creation ensuring that by the time it comes up, the service exists and node-ip-address can be resolved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurafitzgerald Yeh, please reference this ray-cluster.mtls-byoc.yaml.

Take the above RayJob rayjob-mtls-verification as example. It will create RayCluster rayjob-mtls-verification-pc79r which suffix a truncated hash. As the result, it also create services as follow:

rayjob-mtls-verification-head-svc         ClusterIP   None            <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   5s
rayjob-mtls-verification-pc79r-head-svc   ClusterIP   None            <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   28s

In the current solution, it leverages --node-ip-address=rayjob-mtls-verification-pc79r-head-svc.default.svc.cluster.local to make it work.

In BYOC mode, it should put DNS in the SAN of the certificate. However, the hash suffix of the corresponding RayCluster is unknown before creating it.

Additionally, it didn't work to use rayjob-mtls-verification-head-svc as mentioned by
chipspeak it was created by RayCluster which is after the failure.

@machichima
Copy link
Copy Markdown
Collaborator

Hi @chipspeak @laurafitzgerald,

Sorry if I missed any prior discussion. I was wondering if we could split this large PR into two: one for network isolation and the other for mTLS. Smaller PRs are preferable in OSS, as they are much easier to review compared to large ones.

For this PR, having to consider both network isolation and mTLS at the same time will make the review process much longer and harder to discuss.

Thank you!

cc @rueian @Future-Outlier @andrewsykim

@machichima
Copy link
Copy Markdown
Collaborator

Want to check whether you have considered the following scenario:

  1. The head and worker Pods are already running with a correct certificate.
  2. A worker Pod is then deleted and recreated, or a new worker Pod is added during autoscaling.
  3. The new worker Pod may get mounted with the existing worker Secret before the mTLS controller updates the worker certificate SAN list.

Since the mTLS controller appears to reconcile periodically (30s / 1m), there may be a window where the new worker starts with a certificate whose SANs do not yet include its current Pod IP.

Have you tested this case explicitly? If so, could you share how this race is avoided?
Thank you!

if instance.DeletionTimestamp != nil && !instance.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(instance, utils.MTLSCleanupFinalizer) {
// Only delete secrets in auto-generate mode; BYOC secrets belong to the user.
if !utils.IsTLSBYOC(&instance.Spec) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a cluster starts in auto-generate mode (secrets created, finalizer added), then the user switches to BYOC via kubectl apply, and later runs kubectl delete raycluster.
Would CA/head/worker secrets permanently orphaned in the namespace since IsTLSBYOC is true and deleteTLSSecrets() is skipped ?

sort.Strings(desiredDNSNames)
desiredIPAddresses := normalizeIPs(podIPs)

// Known limitation: Ray reads TLS files once at startup and does not hot-reload them.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this behavior? The known limitation is acknowledged in code comments but does not appear in any docs (docs/reference/api.md). Otherwise users might assume certificates are renewed end-to-end automatically without pod restarts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. I just have it in the comments for now. I wanted to wait until everything was locked in before updating documentation as I wanted to find out what the preference is from you folks in terms of docs.

@laurafitzgerald
Copy link
Copy Markdown

Hi team. We see a request to split out the pr again. We will take a look at that. We will need to split and rename the feature gate as there is currently one controlling both features. I’ll make a stab at a name for those but they can be changed.

On a seperate note, and apologies if this was discussed in threads, but i see a lot of discussion around BYOC. In our original proposal we did not consider BYOC as being in scope as part of this change as the idea is that a user can switch this on and have all the work done for them. So I have a few questions about BYOC.

What use case might a user bring their own cert but then also want to rely on the automated mTLS config that’s being added?

If there is a valid use case could we ask that a design for this is added to the existing or a new proposal doc and agreed before we start to make changes for that use case as part of this pr.  Or is there potential to document it as out of scope for now to reduce the scope of the changes for this pr - aka cert-manager dependent mTLS configuration/no BYOC. (edited)

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

// and recreated with the same name, it gets a new CA secret rather than reusing a
// potentially stale one from a previous instance.
func GetCASecretName(clusterName string, clusterUID types.UID) string {
uidSuffix := string(clusterUID)[:8]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential panic slicing UID shorter than eight characters

Low Severity

GetCASecretName slices clusterUID with [:8] without a length guard. If the UID is empty or shorter than 8 characters (possible in unit tests or edge cases), this panics with an index-out-of-range error. Real Kubernetes UIDs are 36 characters, but the function has no defensive check.

Fix in Cursor Fix in Web

@chipspeak
Copy link
Copy Markdown

networkPolicy PR can be found here
cc. @rueian @Future-Outlier @andrewsykim @machichima @kevin85421 @MortalHappiness @fscnick

@laurafitzgerald laurafitzgerald force-pushed the enhanced-security-features branch 2 times, most recently from 3bcc39d to dd43b59 Compare March 23, 2026 16:41
@fscnick
Copy link
Copy Markdown
Collaborator

fscnick commented Mar 28, 2026

Hi team. We see a request to split out the pr again. We will take a look at that. We will need to split and rename the feature gate as there is currently one controlling both features. I’ll make a stab at a name for those but they can be changed.

On a seperate note, and apologies if this was discussed in threads, but i see a lot of discussion around BYOC. In our original proposal we did not consider BYOC as being in scope as part of this change as the idea is that a user can switch this on and have all the work done for them. So I have a few questions about BYOC.

What use case might a user bring their own cert but then also want to rely on the automated mTLS config that’s being added?

If there is a valid use case could we ask that a design for this is added to the existing or a new proposal doc and agreed before we start to make changes for that use case as part of this pr.  Or is there potential to document it as out of scope for now to reduce the scope of the changes for this pr - aka cert-manager dependent mTLS configuration/no BYOC. (edited)

Hi @laurafitzgerald, thanks for your reply. It seems the design doc has mentioned BYOC. Thanks for @AndySung320 to support.

SecretName
The name of the secret to be used for the RayCluster, if the user chooses to bring their own certificates

After discussing with the team offline, would you mind to split the BYOC related part into another PR ? It could help us to focus auto-generation certificates in this PR.

@chipspeak chipspeak force-pushed the enhanced-security-features branch from dd43b59 to 829d8a1 Compare March 30, 2026 08:31
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

- list
- patch
- update
- watch
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm template has duplicate networking.k8s.io ingresses RBAC entry

Medium Severity

The helm template now has two separate RBAC rules granting permissions on ingresses: one under the extensions apiGroup (lines 260-271, which previously also included networking.k8s.io) and a new duplicate entry under networking.k8s.io (lines 291-302). The original combined extensions + networking.k8s.io entry was split, but networking.k8s.io was removed from the first entry without removing the now-redundant second copy. Meanwhile, the generated role.yaml correctly keeps them combined (lines 117-129). This divergence between the helm chart and kustomize RBAC definitions means the helm-deployed operator gets a redundant rule, and future RBAC changes may drift.

Additional Locations (1)
Fix in Cursor Fix in Web

// and recreated with the same name, it gets a new CA secret rather than reusing a
// potentially stale one from a previous instance.
func GetCASecretName(clusterName string, clusterUID types.UID) string {
uidSuffix := string(clusterUID)[:8]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetCASecretName panics if UID shorter than 8 chars

Low Severity

GetCASecretName slices clusterUID with [:8] without a length guard. If the UID string is ever shorter than 8 characters (e.g., in tests with synthetic UIDs or unexpected states), this causes an out-of-bounds panic. While Kubernetes UIDs are typically 36 characters, a defensive bounds check would prevent a runtime crash.

Fix in Cursor Fix in Web

- name: AsyncJobInfoQuery
enabled: false
- name: RayClusterMTLS
enabled: false
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm values.yaml feature gate list indentation is wrong

Medium Severity

The new feature gate entries AsyncJobInfoQuery and RayClusterMTLS appear to be at the wrong indentation level compared to the preceding list items (lines 135-144). The existing entries use - name: with consistent indentation under a parent key, but the new entries lack the leading indentation, which would place them outside the feature gates list and cause incorrect helm rendering.

Fix in Cursor Fix in Web

@chipspeak
Copy link
Copy Markdown

I've dropped the recent, BYOC-specific material from this branch so what's left should be the auto-generating mTLS work.

@machichima
Copy link
Copy Markdown
Collaborator

Thank you! @chipspeak
Could you please also update the PR subject and description as this PR is only focusing on mTLS support now?

| `headServiceAnnotations` _object (keys:string, values:string)_ | | | |
| `enableInTreeAutoscaling` _boolean_ | EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs | | |
| `gcsFaultToleranceOptions` _[GcsFaultToleranceOptions](#gcsfaulttoleranceoptions)_ | GcsFaultToleranceOptions for enabling GCS FT | | |
| `tlsOptions` _[TLSOptions](#tlsoptions)_ | TLSOptions specifies optional TLS encryption settings for the RayCluster.<br />If omitted, TLS is disabled. When set, the mode field controls the<br />security level (defaults to "mTLS" for mutual TLS).<br />Requires the RayClusterMTLS feature gate on the operator. | | |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a redundant line break.

| Field | Description | Default | Validation |
| --- | --- | --- | --- |
| `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> |
| `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. | | |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some redundant line breaks.

| --- | --- | --- | --- |
| `mode` _string_ | Mode controls the TLS security level.<br />- "mTLS": Enables mutual TLS (client & server authentication). | mTLS | Enum: [mTLS] <br /> |
| `certificateSecretName` _string_ | CertificateSecretName is a user-provided Kubernetes Secret containing<br />tls.crt, tls.key, and ca.crt for the head node (and workers, if<br />WorkerCertificateSecretName is not set).<br />When WorkerCertificateSecretName is also set, this secret is mounted only<br />on head pods. When WorkerCertificateSecretName is omitted, this single secret<br />is mounted on both head and worker pods (shared-secret BYOC mode).<br />The certificate SANs must cover the head node identities<br />(head service DNS, pod IPs or wildcards, localhost, 127.0.0.1).<br />When set, the operator skips cert-manager PKI and mounts this secret directly. | | |
| `workerCertificateSecretName` _string_ | WorkerCertificateSecretName is an optional user-provided Kubernetes Secret<br />containing tls.crt, tls.key, and ca.crt for worker nodes.<br />When set, workers use this secret instead of CertificateSecretName, giving<br />head and worker pods separate TLS identities. This prevents a compromised<br />worker key from impersonating the head node at the TLS layer.<br />The certificate SANs must cover worker node identities<br />(worker pod IPs or wildcards). Both secrets must share the same CA. | | |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some redundant line breaks.

apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-byoc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be moved to another PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.