Skip to content

madengine v2 add Primus launcher support#80

Open
coketaste wants to merge 6 commits intocoketaste/refactor-disfrom
coketaste/refactor-dis-primus
Open

madengine v2 add Primus launcher support#80
coketaste wants to merge 6 commits intocoketaste/refactor-disfrom
coketaste/refactor-dis-primus

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

  • madengine: add "primus" launcher for SLURM and Kubernetes (env-only:
    PRIMUS_CONFIG_PATH, PRIMUS_CLI_EXTRA); pass PRIMUS_* in container_runner.
  • container_runner: optional fallback image ci-primus_pretrain_primus.ubuntu.amd
    for primus_pretrain/* when model-specific image missing; preserve resolved
    docker_image in run_results for perf/reports; create_run_details_dict uses
    run_results docker_image.
  • docs: Primus section in launchers.md; minimal SLURM/K8s example configs.

@coketaste coketaste self-assigned this Mar 6, 2026
@coketaste coketaste changed the title madengine v2: add Primus launcher support madengine v2 add Primus launcher support Mar 6, 2026
@coketaste coketaste changed the base branch from main to coketaste/refactor-dis March 10, 2026 22:32
coketaste and others added 3 commits April 7, 2026 15:13
Resolved conflicts and integrated Primus launcher support into refactored
codebase architecture. The refactor-dis branch introduced cleaner code
organization by extracting common utilities and launcher logic into
dedicated modules.

Conflict Resolutions:
- src/madengine/deployment/common.py: Added "primus" to VALID_LAUNCHERS
- src/madengine/deployment/kubernetes_launcher_mixin.py: Added _generate_primus_command()
- src/madengine/deployment/slurm.py: Removed duplicate functions, now imports from common.py
- src/madengine/deployment/kubernetes.py: Uses KubernetesLauncherMixin, removed duplicates
- src/madengine/execution/container_runner.py: Integrated helper functions while preserving Primus features

Key Changes:
- Primus launcher fully integrated into refactored architecture
- Maintained Primus-specific features: image resolution, config path, CLI extra args
- Merged environment variables: PRIMUS_CONFIG_PATH, PRIMUS_CLI_EXTRA, TORCH_ELASTIC_RDZV_TIMEOUT
- Used refactored helpers: resolve_run_timeout(), make_run_log_file_path(), _resolve_multiple_results_path()
- Preserved all existing launcher functionality (torchrun, vllm, sglang, deepspeed, megatron, torchtitan)

Architecture Improvements:
- Common launcher utilities in deployment/common.py
- Kubernetes launcher logic in kubernetes_launcher_mixin.py
- Container runner helpers in execution/container_runner_helpers.py
- Run details utilities in utils/run_details.py and utils/path_utils.py

Validation:
- All Python files compile without syntax errors
- All imports successful
- Launcher normalization works for all launchers including Primus
- Helper functions tested and working
- No conflict markers remaining

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coketaste coketaste changed the base branch from coketaste/refactor-dis to main April 9, 2026 16:40
@coketaste coketaste changed the base branch from main to coketaste/refactor-dis April 9, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant