feat(deployment): Primus on local/K8s/SLURM#99
Merged
Conversation
Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation. Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers). Docs: launchers and k8s-config examples README. Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).
There was a problem hiding this comment.
Pull request overview
Adds first-class Primus launcher support across local, Kubernetes, and SLURM deployments, plus supporting infra for K8s-safe naming and improved Docker context/log handling so Primus images/configs run consistently across environments.
Changes:
- Introduce Primus backend/config inference helpers and integrate Primus launcher command/env generation for Kubernetes and SLURM.
- Add Kubernetes name/label/container sanitization utilities and update K8s templates to use sanitized labels/container names.
- Improve Docker build/run ergonomics for Primus (repo-root build context; normalize run log naming for full image refs) and add corresponding tests/docs.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_primus.py | New unit coverage for Primus backend/overlay inference and K8s Primus command generation |
| tests/unit/test_k8s.py | Adds unit tests for K8s name/label/container sanitization utilities |
| tests/unit/test_execution.py | Extends log naming tests to cover full registry refs mapping to CI tag |
| tests/integration/test_docker_integration.py | Integration test ensuring Primus Docker builds use repo-root context |
| src/madengine/orchestration/run_orchestrator.py | Restores docker_env_vars from manifest context during local run |
| src/madengine/execution/docker_builder.py | Chooses repo-root Docker build context for Primus dockerfiles by hint |
| src/madengine/execution/container_runner_helpers.py | Normalizes docker image refs for run log filename consistency |
| src/madengine/execution/container_runner.py | Adds Primus env passthrough; resolves missing Primus images via fallback; preserves resolved image in results |
| src/madengine/deployment/templates/slurm/job.sh.j2 | Avoids printing MAD_MULTI_NODE_RUNNER for launchers that don’t use it |
| src/madengine/deployment/templates/kubernetes/service.yaml.j2 | Switches model label to sanitized model_label |
| src/madengine/deployment/templates/kubernetes/job.yaml.j2 | Switches model label to model_label; uses main_container_name; includes primus/torchtitan in distributed env block |
| src/madengine/deployment/templates/kubernetes/configmap.yaml.j2 | Switches model label to sanitized model_label |
| src/madengine/deployment/slurm.py | Adds Primus launcher env generation (PRIMUS_CONFIG_PATH/CLI_EXTRA/BACKEND inference) |
| src/madengine/deployment/primus_backend.py | New helper module: merges Primus config; infers BACKEND and examples overlay subdirs |
| src/madengine/deployment/kubernetes_launcher_mixin.py | Adds K8s Primus launcher env/command generation |
| src/madengine/deployment/kubernetes.py | Adds K8s name sanitization; bundles Primus examples overlay into ConfigMap; wires Primus launcher |
| src/madengine/deployment/k8s_names.py | New K8s naming utilities for object names, label values, and container names |
| src/madengine/deployment/common.py | Adds “primus” to valid launcher list |
| examples/k8s-configs/README.md | Documents Primus-on-K8s behavior, caveats, and example configs |
| docs/launchers.md | Adds Primus launcher documentation and updates launcher lists/tables |
Comments suppressed due to low confidence (1)
src/madengine/deployment/templates/kubernetes/service.yaml.j2:12
- service.yaml.j2 uses service_name as metadata.name and job.yaml.j2 uses the same value via pod.spec.subdomain/launcher DNS. Kubernetes Service names must be DNS labels (no dots, <=63). If service_name is derived from job_name that may contain dots (e.g. '1.7b') or exceed 63, the Service will be rejected and multi-node DNS (pod-0....) will fail. Ensure service_name is sanitized as a DNS label and keep subdomain/launcher DNS consistent with that sanitized value.
name: {{ service_name }}
namespace: {{ namespace }}
labels:
app: madengine
model: {{ model_label }}
spec:
clusterIP: None # Headless service for torchrun coordination
selector:
job-name: {{ job_name }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add first-class Primus launcher support for Kubernetes and SLURM: infer backend and examples overlay from model/config, sanitize Job/Service/ ConfigMap/container names (k8s_names), and extend templates and mixin command generation.
Docker: resolve repo-root build context for Primus Dockerfiles; align container run log filenames with build logs when the image is a full registry/repo:tag reference (container_runner_helpers).
Docs: launchers and k8s-config examples README.
Tests: integration (Docker context), unit coverage for execution helpers, K8s secrets/PVC/names, and consolidated Primus tests (test_primus).