remove environment-specific manifests from public branch; #87
Open
mkuznet1 wants to merge 12 commits intoROCm:aicomnet_devfrom
Open
remove environment-specific manifests from public branch; #87mkuznet1 wants to merge 12 commits intoROCm:aicomnet_devfrom
mkuznet1 wants to merge 12 commits intoROCm:aicomnet_devfrom
Conversation
* Added therock model for testing TheRock image * Added therock model * Modified the Dockerfile of TheRock only install core runtime and hip runtime * Fixe the generate-sys-env-details arg in mad * Redsign the rocEnvTool to work with TheRock based image * Keep compatible to the csv parser * Fixed the csv parser * Updated README of rocEnvTool accordingly
* Implemented a module to parse config inputs and creat perf_entry_super.json and upload dataset to MongoDB * Implement update perf superset * fix unit tests of super set * Fixed the perf superset data collection and MongoDB update
Resolve merge conflicts by keeping refactor-dis (v2) and discarding main (v1) changes: - Remove src/madengine/mad.py and src/madengine/tools/run_models.py (deleted in v2, accept deletion over main's modifications) - Resolve rocenv_tool.py conflict: keep current-branch version for unknown GPU device handling - Resolve tests/fixtures/dummy/models.json: keep v2 fixture set (dummy_superset and full model list) over main's therock-only entry
…file has been restored
There was a problem hiding this comment.
Pull request overview
This PR appears to clean up local/runtime artifacts by removing previously committed run-manifest JSON files and an environment file, and updating .gitignore to ignore additional generated content.
Changes:
- Removed two committed
run_manifest_*.jsonfiles undermanifests/. - Removed
manifests/mad.env(shell environment exports). - Updated
.gitignore(adds*.json, and fixes formatting for.madengine_session_start).
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json |
Removed a committed run-manifest JSON (likely local/generated). |
manifests/run_manifest_primus_2node_qwen_localimage.json |
Removed a committed run-manifest JSON (likely local/generated). |
manifests/mad.env |
Removed a committed environment export file (likely local setup). |
.gitignore |
Ignores additional files (notably *.json) and normalizes an entry’s formatting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Introduced per-node artifact staging to a dedicated results directory. - Implemented a mechanism to wait for all nodes to complete staging before merging results. - Added logic to merge performance CSV files from multiple nodes, selecting the best file based on content. - Updated the master node's result collection process to reflect these changes, ensuring comprehensive data aggregation. This update aims to improve the reliability and accuracy of performance reporting in distributed SLURM runs.
- Removed redundant file-based synchronization mechanism for node readiness. - Simplified the barrier waiting process by directly utilizing TCP for image readiness. - Adjusted timeout handling to ensure consistent behavior across node synchronization. This change enhances the efficiency of multi-node operations by streamlining the readiness check process.
- Added logic to prefer files with the most non-empty performance values during result aggregation. - Implemented dynamic column index retrieval for the "performance" column in CSV files, ensuring accurate counting of non-empty performance entries. - Maintained backward compatibility by falling back to the previous method if the performance column is not found. This update aims to enhance the accuracy of performance metrics in multi-node training scenarios.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The .gitignore file has been restored, and the manifest files have been deleted