Skip to content

Add Primus pretrain integration (primus_pretrain)#146

Open
coketaste wants to merge 2 commits intoROCm:developfrom
coketaste:coketaste/primus
Open

Add Primus pretrain integration (primus_pretrain)#146
coketaste wants to merge 2 commits intoROCm:developfrom
coketaste:coketaste/primus

Conversation

@coketaste
Copy link
Copy Markdown
Contributor

  • Add scripts/Primus submodule (AMD-AGI/Primus) and docker/primus.ubuntu.amd.Dockerfile
  • Add scripts/primus_pretrain (run.sh, get_models_json.py, extract_primus_perf.py)
  • Register primus_pretrain in models.json before existing primus_pyt_* entries
  • Un-ignore tracked perf manifests and models.json in .gitignore

run.sh: resolve local submodule via ../Primus (sibling of primus_pretrain under scripts/).

- Add scripts/Primus submodule (AMD-AGI/Primus) and docker/primus.ubuntu.amd.Dockerfile
- Add scripts/primus_pretrain (run.sh, get_models_json.py, extract_primus_perf.py)
- Register primus_pretrain in models.json before existing primus_pyt_* entries
- Un-ignore tracked perf manifests and models.json in .gitignore

run.sh: resolve local submodule via ../Primus (sibling of primus_pretrain under scripts/).
@coketaste coketaste self-assigned this Apr 15, 2026
@coketaste coketaste requested a review from gargrahul as a code owner April 15, 2026 19:59
Copilot AI review requested due to automatic review settings April 15, 2026 19:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates Primus pretraining into the MAD/madengine workflow by adding a dedicated runner script bundle, a new Primus-based Docker image, and registering a new primus_pretrain model entry so Primus pretrain jobs can be launched consistently across environments.

Changes:

  • Added scripts/primus_pretrain/ wrappers to run Primus pretrain and emit multiple_results perf CSV.
  • Added a new docker/primus.ubuntu.amd.Dockerfile and registered primus_pretrain in models.json.
  • Added scripts/Primus as a git submodule and adjusted .gitignore to track key perf/model manifests.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/primus_pretrain/run.sh Wrapper to resolve Primus root/config, run pretrain, and post-process perf metrics.
scripts/primus_pretrain/get_models_json.py Optional discovery of Primus configs as runnable models via globbing.
scripts/primus_pretrain/extract_primus_perf.py Parses training logs and writes a multiple_results CSV.
models.json Registers primus_pretrain ahead of existing Primus PyTorch training entries.
docker/primus.ubuntu.amd.Dockerfile Defines a Primus image intended to bake in the Primus submodule.
.gitmodules Adds the scripts/Primus submodule definition.
.gitignore Stops ignoring tracked perf manifests and models.json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +82
# Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs).
cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs"
exitcode=$?
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the script uses set -e, if examples/run_pretrain.sh exits non-zero the wrapper will terminate immediately and never reach exitcode=$? or the perf extraction. If you want to always capture the exit code and run the extractor, temporarily disable -e around the training invocation (or otherwise guard the command) and then re-enable it as needed.

Suggested change
# Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs).
cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs"
exitcode=$?
# Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs).
set +e
cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs"
exitcode=$?
set -e

Copilot uses AI. Check for mistakes.
# Pass --job.dump_folder so Torchtitan writes checkpoints to RUN_DIR/outputs (not scripts/Primus/outputs).
cd "$PRIMUS_ROOT" && bash "$PRIMUS_ROOT/examples/run_pretrain.sh" "$@" --job.dump_folder "$RUN_DIR/outputs"
exitcode=$?
# Extract tps/tflops/mfu from training log into primus_perf_output.csv (one row: model, performance, metric, tflops, model_flops_utilization)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes the output as a single row with extra columns ("tflops, model_flops_utilization"), but extract_primus_perf.py writes a 3-column CSV with one row per metric (model,performance,metric). Please update the comment to match the actual CSV schema so consumers (madengine multiple_results) aren’t misled.

Suggested change
# Extract tps/tflops/mfu from training log into primus_perf_output.csv (one row: model, performance, metric, tflops, model_flops_utilization)
# Extract tps/tflops/mfu from training log into primus_perf_output.csv (CSV rows: model, performance, metric; one row per metric)

Copilot uses AI. Check for mistakes.
models = []
if not os.path.isdir(PRIMUS_ROOT):
return models
for yaml_path in sorted(glob.glob(CONFIGS_GLOB)):
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIGS_GLOB uses **, but glob.glob() only treats ** as “match any directories” when called with recursive=True. As written, this will typically discover zero YAMLs, so list_models() returns an empty list even when configs exist. Use glob.glob(CONFIGS_GLOB, recursive=True) (or avoid **) to make discovery work as intended.

Suggested change
for yaml_path in sorted(glob.glob(CONFIGS_GLOB)):
for yaml_path in sorted(glob.glob(CONFIGS_GLOB, recursive=True)):

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +39
# rocm/primus base often has /workspace/Primus as a full git clone (.git is a directory).
# A submodule checkout uses .git as a file (gitlink). COPY cannot replace that tree — remove first.
RUN rm -rf /workspace/Primus

# Bake Primus from the build context (submodule). No git clone — matches CI and local builds.
COPY scripts/Primus/ /workspace/Primus/

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY scripts/Primus/ /workspace/Primus/ requires the repo root as the Docker build context, but the in-repo runner (tools/run_models.py) builds images with context ./docker (docker_context = "./docker"). With that context, scripts/Primus is outside the build context and the build will fail. Either update the build tooling to use context . for this model, or move/bundle the Primus sources under docker/ so they’re within the build context.

Suggested change
# rocm/primus base often has /workspace/Primus as a full git clone (.git is a directory).
# A submodule checkout uses .git as a file (gitlink). COPY cannot replace that tree — remove first.
RUN rm -rf /workspace/Primus
# Bake Primus from the build context (submodule). No git clone — matches CI and local builds.
COPY scripts/Primus/ /workspace/Primus/
# The in-repo runner builds Dockerfiles from the ./docker context, so sources under
# scripts/Primus are not available to COPY here. Keep the Primus checkout provided by
# the base image and validate that the expected launcher entrypoint exists.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants