feat: protect status server with network policy#3372
feat: protect status server with network policy#3372arryllopez wants to merge 2 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
…bels namespaceSelector set to default until decided on how the namespace value is set Signed-off-by: arryllopez <arryllopez7@gmail.com>
Signed-off-by: arryllopez <arryllopez7@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adds a Kubernetes NetworkPolicy intended to limit access to the trainer controller’s status server (10443) and updates runtime manifests so TrainJob pods carry labels that the policy can match.
Changes:
- Added a
NetworkPolicyselecting the manager pods and allowing ingress to port10443from labeled TrainJob pods. - Added pod-template labels (
trainer.kubeflow.org/trainjob-ancestor-step: trainer) across runtime manifests so TrainJob pods can be selected bypodSelector. - Wired the new
NetworkPolicyinto the manager kustomization.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| manifests/base/manager/network_policy.yaml | Introduces ingress restrictions for the status server port. |
| manifests/base/manager/kustomization.yaml | Includes the new NetworkPolicy in the manager base. |
| manifests/base/runtimes/torch_distributed.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/xgboost_distributed.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/jax_distributed.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/mlx_distributed.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/deepspeed_distributed.yaml | Adds pod-template label (where applicable) for NetworkPolicy matching. |
| manifests/base/runtimes/data-cache/torch_distributed_with_cache.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/torchtune/qwen2_5/qwen2_5_1.5B.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml | Adds pod-template label for NetworkPolicy matching. |
| manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml | Adds pod-template label for NetworkPolicy matching. |
| trainer.kubeflow.org/trainjob-ancestor-step: trainer | ||
| ports: | ||
| - protocol: TCP | ||
| port: 10443 |
There was a problem hiding this comment.
This NetworkPolicy currently allows ingress only to port 10443; since it selects the controller-manager pods, it will also block required inbound traffic to the webhook (9443 via Service port 443) and metrics (8443), potentially breaking admission and monitoring—add explicit ingress rules for the other required ports while keeping 10443 restricted.
| port: 10443 | |
| port: 10443 | |
| - ports: | |
| - protocol: TCP | |
| port: 443 | |
| - ports: | |
| - protocol: TCP | |
| port: 8443 |
| - namespaceSelector: | ||
| matchLabels: | ||
| trainer.kubeflow.org/trainjob-namespace: "true" |
There was a problem hiding this comment.
The namespaceSelector matches only namespaces labeled trainer.kubeflow.org/trainjob-namespace="true", but that label is not applied anywhere else in this repo; as-is, status updates from TrainJob pods in typical user namespaces will be denied unless users manually label their namespaces—either ensure this label is automatically applied/documented, or broaden the namespaceSelector (e.g., match all namespaces) and rely on the podSelector alone.
| - namespaceSelector: | |
| matchLabels: | |
| trainer.kubeflow.org/trainjob-namespace: "true" | |
| - namespaceSelector: {} |
f2e62c0 to
68fc131
Compare
What this PR does / why we need it:
Adds a NetworkPolicy to restrict ingress to the status server (port 10443) to only TrainJob pods. Also adds pod-level labels to runtime YAMLs so the NetworkPolicy can match against them via podSelector.
Which issue(s) this PR fixes:
Fixes #3345
Checklist: