feat: protect status server with network policy by arryllopez · Pull Request #3372 · kubeflow/trainer

arryllopez · 2026-03-22T00:25:25Z

What this PR does / why we need it:
Adds a NetworkPolicy to restrict ingress to the status server (port 10443) to only TrainJob pods. Also adds pod-level labels to runtime YAMLs so the NetworkPolicy can match against them via podSelector.

Which issue(s) this PR fixes:
Fixes #3345

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2026-03-22T00:25:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-22T00:25:34Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

…bels namespaceSelector set to default until decided on how the namespace value is set Signed-off-by: arryllopez <arryllopez7@gmail.com>

Signed-off-by: arryllopez <arryllopez7@gmail.com>

Copilot

Pull request overview

Adds a Kubernetes NetworkPolicy intended to limit access to the trainer controller’s status server (10443) and updates runtime manifests so TrainJob pods carry labels that the policy can match.

Changes:

Added a NetworkPolicy selecting the manager pods and allowing ingress to port 10443 from labeled TrainJob pods.
Added pod-template labels (trainer.kubeflow.org/trainjob-ancestor-step: trainer) across runtime manifests so TrainJob pods can be selected by podSelector.
Wired the new NetworkPolicy into the manager kustomization.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
manifests/base/manager/network_policy.yaml	Introduces ingress restrictions for the status server port.
manifests/base/manager/kustomization.yaml	Includes the new NetworkPolicy in the manager base.
manifests/base/runtimes/torch_distributed.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/xgboost_distributed.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/jax_distributed.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/mlx_distributed.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/deepspeed_distributed.yaml	Adds pod-template label (where applicable) for NetworkPolicy matching.
manifests/base/runtimes/data-cache/torch_distributed_with_cache.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/torchtune/qwen2_5/qwen2_5_1.5B.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml	Adds pod-template label for NetworkPolicy matching.
manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml	Adds pod-template label for NetworkPolicy matching.

Copilot · 2026-03-22T00:28:34Z

manifests/base/manager/network_policy.yaml

+          trainer.kubeflow.org/trainjob-ancestor-step: trainer
+    ports:
+    - protocol: TCP
+      port: 10443


This NetworkPolicy currently allows ingress only to port 10443; since it selects the controller-manager pods, it will also block required inbound traffic to the webhook (9443 via Service port 443) and metrics (8443), potentially breaking admission and monitoring—add explicit ingress rules for the other required ports while keeping 10443 restricted.

Suggested change

port: 10443

port: 10443

- ports:

- protocol: TCP

port: 443

- ports:

- protocol: TCP

port: 8443

Copilot · 2026-03-22T00:28:34Z

manifests/base/manager/network_policy.yaml

+    - namespaceSelector:
+        matchLabels:
+          trainer.kubeflow.org/trainjob-namespace: "true"


The namespaceSelector matches only namespaces labeled trainer.kubeflow.org/trainjob-namespace="true", but that label is not applied anywhere else in this repo; as-is, status updates from TrainJob pods in typical user namespaces will be denied unless users manually label their namespaces—either ensure this label is automatically applied/documented, or broaden the namespaceSelector (e.g., match all namespaces) and rely on the podSelector alone.

Suggested change

- namespaceSelector:

matchLabels:

trainer.kubeflow.org/trainjob-namespace: "true"

- namespaceSelector: {}

Copilot AI review requested due to automatic review settings March 22, 2026 00:25

google-oss-prow bot requested review from jinchihe and kuizhiqing March 22, 2026 00:25

google-oss-prow bot added the size/M label Mar 22, 2026

Copilot started reviewing on behalf of arryllopez March 22, 2026 00:25 View session

arryllopez changed the title ~~Fix/status server network policy~~ feat: protect status server with network policy Mar 22, 2026

arryllopez added 2 commits March 21, 2026 20:27

v1: bump labels down to pod level allowing podSelector to match on la…

9416c87

…bels namespaceSelector set to default until decided on how the namespace value is set Signed-off-by: arryllopez <arryllopez7@gmail.com>

tidying yaml structure

68fc131

Signed-off-by: arryllopez <arryllopez7@gmail.com>

Copilot AI reviewed Mar 22, 2026

View reviewed changes

arryllopez force-pushed the fix/status-server-network-policy branch from f2e62c0 to 68fc131 Compare March 22, 2026 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: protect status server with network policy#3372

feat: protect status server with network policy#3372
arryllopez wants to merge 2 commits intokubeflow:masterfrom
arryllopez:fix/status-server-network-policy

arryllopez commented Mar 22, 2026

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arryllopez commented Mar 22, 2026

Uh oh!

google-oss-prow bot commented Mar 22, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants