Add Primus pretrain integration (primus_pretrain)#146
Add Primus pretrain integration (primus_pretrain)#146coketaste wants to merge 2 commits intoROCm:developfrom
Conversation
- Add scripts/Primus submodule (AMD-AGI/Primus) and docker/primus.ubuntu.amd.Dockerfile - Add scripts/primus_pretrain (run.sh, get_models_json.py, extract_primus_perf.py) - Register primus_pretrain in models.json before existing primus_pyt_* entries - Un-ignore tracked perf manifests and models.json in .gitignore run.sh: resolve local submodule via ../Primus (sibling of primus_pretrain under scripts/).
There was a problem hiding this comment.
Pull request overview
Integrates Primus pretraining into the MAD/madengine workflow by adding a dedicated runner script bundle, a new Primus-based Docker image, and registering a new primus_pretrain model entry so Primus pretrain jobs can be launched consistently across environments.
Changes:
- Added
scripts/primus_pretrain/wrappers to run Primus pretrain and emitmultiple_resultsperf CSV. - Added a new
docker/primus.ubuntu.amd.Dockerfileand registeredprimus_pretraininmodels.json. - Added
scripts/Primusas a git submodule and adjusted.gitignoreto track key perf/model manifests.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/primus_pretrain/run.sh |
Wrapper to resolve Primus root/config, run pretrain, and post-process perf metrics. |
scripts/primus_pretrain/get_models_json.py |
Optional discovery of Primus configs as runnable models via globbing. |
scripts/primus_pretrain/extract_primus_perf.py |
Parses training logs and writes a multiple_results CSV. |
models.json |
Registers primus_pretrain ahead of existing Primus PyTorch training entries. |
docker/primus.ubuntu.amd.Dockerfile |
Defines a Primus image intended to bake in the Primus submodule. |
.gitmodules |
Adds the scripts/Primus submodule definition. |
.gitignore |
Stops ignoring tracked perf manifests and models.json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs). | ||
| cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs" | ||
| exitcode=$? |
There was a problem hiding this comment.
Because the script uses set -e, if examples/run_pretrain.sh exits non-zero the wrapper will terminate immediately and never reach exitcode=$? or the perf extraction. If you want to always capture the exit code and run the extractor, temporarily disable -e around the training invocation (or otherwise guard the command) and then re-enable it as needed.
| # Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs). | |
| cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs" | |
| exitcode=$? | |
| # Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs). | |
| set +e | |
| cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs" | |
| exitcode=$? | |
| set -e |
| # Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs). | ||
| cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs" | ||
| exitcode=$? | ||
| # Extract tps/tflops/mfu from training log into primus_perf_output.csv (one row: model, performance, metric, tflops, model_flops_utilization) |
There was a problem hiding this comment.
This comment describes the output as a single row with extra columns ("tflops, model_flops_utilization"), but extract_primus_perf.py writes a 3-column CSV with one row per metric (model,performance,metric). Please update the comment to match the actual CSV schema so consumers (madengine multiple_results) aren’t misled.
| # Extract tps/tflops/mfu from training log into primus_perf_output.csv (one row: model, performance, metric, tflops, model_flops_utilization) | |
| # Extract tps/tflops/mfu from training log into primus_perf_output.csv (CSV rows: model, performance, metric; one row per metric) |
| models = [] | ||
| if not os.path.isdir(PRIMUS_ROOT): | ||
| return models | ||
| for yaml_path in sorted(glob.glob(CONFIGS_GLOB)): |
There was a problem hiding this comment.
CONFIGS_GLOB uses **, but glob.glob() only treats ** as “match any directories” when called with recursive=True. As written, this will typically discover zero YAMLs, so list_models() returns an empty list even when configs exist. Use glob.glob(CONFIGS_GLOB, recursive=True) (or avoid **) to make discovery work as intended.
| for yaml_path in sorted(glob.glob(CONFIGS_GLOB)): | |
| for yaml_path in sorted(glob.glob(CONFIGS_GLOB, recursive=True)): |
| # rocm/primus base often has /workspace/Primus as a full git clone (.git is a directory). | ||
| # A submodule checkout uses .git as a file (gitlink). COPY cannot replace that tree — remove first. | ||
| RUN rm -rf /workspace/Primus | ||
|
|
||
| # Bake Primus from the build context (submodule). No git clone — matches CI and local builds. | ||
| COPY scripts/Primus/ /workspace/Primus/ | ||
|
|
There was a problem hiding this comment.
COPY scripts/Primus/ /workspace/Primus/ requires the repo root as the Docker build context, but the in-repo runner (tools/run_models.py) builds images with context ./docker (docker_context = "./docker"). With that context, scripts/Primus is outside the build context and the build will fail. Either update the build tooling to use context . for this model, or move/bundle the Primus sources under docker/ so they’re within the build context.
| # rocm/primus base often has /workspace/Primus as a full git clone (.git is a directory). | |
| # A submodule checkout uses .git as a file (gitlink). COPY cannot replace that tree — remove first. | |
| RUN rm -rf /workspace/Primus | |
| # Bake Primus from the build context (submodule). No git clone — matches CI and local builds. | |
| COPY scripts/Primus/ /workspace/Primus/ | |
| # The in-repo runner builds Dockerfiles from the ./docker context, so sources under | |
| # scripts/Primus are not available to COPY here. Keep the Primus checkout provided by | |
| # the base image and validate that the expected launcher entrypoint exists. |
run.sh: resolve local submodule via ../Primus (sibling of primus_pretrain under scripts/).