Skip to content

[Stagging] Aurora tweaks 2#1713

Draft
tdavidcl wants to merge 28 commits intoShamrock-code:mainfrom
tdavidcl:aurora-test4
Draft

[Stagging] Aurora tweaks 2#1713
tdavidcl wants to merge 28 commits intoShamrock-code:mainfrom
tdavidcl:aurora-test4

Conversation

@tdavidcl
Copy link
Member

No description provided.

@github-actions
Copy link
Contributor

Thanks @tdavidcl for opening this PR!

You can do multiple things directly here:
1 - Comment pre-commit.ci run to run pre-commit checks.
2 - Comment pre-commit.ci autofix to apply fixes.
3 - Add label autofix.ci to fix authorship & pre-commit for every commit made.
4 - Add label light-ci to only trigger a reduced & faster version of the CI (need the full one before merge).
5 - Add label trigger-ci to create an empty commit to trigger the CI.

Once the workflow completes a message will appear displaying informations related to the run.

Also the PR gets automatically reviewed by gemini, you can:
1 - Comment /gemini review to trigger a review
2 - Comment /gemini summary for a summary
3 - Tag it using @gemini-code-assist either in the PR or in review comments on files

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant improvements to how system performance metrics, especially energy consumption and wall time, are collected, processed, and reported. It centralizes the logic for gathering and aggregating these metrics across MPI ranks, making the system more robust and easier to extend. Additionally, it enhances the output of a key benchmark to provide more comprehensive performance data and updates the Aurora build configuration for better debugging.

Highlights

  • System Metrics Refactoring: Refactored the system metrics collection, aggregation, and reporting logic into dedicated utility functions within the shamsys namespace. This centralizes the handling of energy consumption and wall time data across ranks.
  • Enhanced Benchmark Output: Updated the SPH weak scale benchmark script to output detailed system metrics, including calculated power consumption, in a structured dictionary format for easier analysis.
  • Aurora Configuration Updates: Modified the Aurora environment setup to enable debug symbols (-g) and CPPTRACE in CMake flags, improving crash reporting and debugging capabilities on the Aurora system.
  • Accurate Timing for System Metrics: Adjusted the order of system metrics and MPI timer calls in the SPH solver to ensure precise measurements, particularly by accounting for MPI barriers.
Changelog
  • env/machine/argonne/aurora/env_oneapi.sh
    • Added a comment explaining the purpose of the -g flag for debug symbols on Aurora.
    • Included the -g flag in CMAKE_CXX_FLAGS for debug symbol generation.
    • Enabled SHAMROCK_USE_CPPTRACE for improved stack tracing.
  • examples/benchmarks/sph_weak_scale_test.py
    • Introduced a dic_out dictionary to store and output system metrics.
    • Modified average power calculation to use metrics_duration instead of step_time.
    • Added system metrics and power values to the dic_out dictionary.
  • src/shammodels/common/src/timestep_report.cpp
    • Replaced manual power gathering with shamsys::gather_rank_metrics.
    • Replaced manual power aggregation and formatting with shamsys::aggregate_rank_metrics and shamsys::format_system_metrics.
    • Updated table population to use the new formatted system metrics structures.
  • src/shammodels/sph/src/Solver.cpp
    • Reordered system_metrics_start initialization to occur before other timers, with a comment explaining the need for barrier synchronization.
    • Adjusted the timing of delta_mpi_timer calculation relative to mem_perf_infos_end and system_metrics_end.
  • src/shammodels/sph/src/SolverLog.cpp
    • Replaced manual gathering and aggregation of system metrics with calls to shamsys::gather_rank_metrics and shamsys::aggregate_rank_metrics.
  • src/shamsys/include/shamsys/system_metrics.hpp
    • Added wall_time member to the SystemMetrics struct.
    • Modified get_system_metrics signature to accept an optional barrier parameter.
    • Declared new functions: gather_rank_metrics, aggregate_rank_metrics, and format_system_metrics.
    • Defined FormattedSystemMetrics struct for string-formatted metric output.
    • Updated operator- for SystemMetrics to include wall_time subtraction.
  • src/shamsys/src/system_metrics.cpp
    • Included new headers: shambase/stacktrace.hpp, shamalgs/collective/reduction.hpp, shamcomm/wrapper.hpp.
    • Implemented get_system_metrics to record wall_time and optionally use MPI barriers.
    • Implemented gather_rank_metrics to collect SystemMetrics from all MPI ranks.
    • Implemented aggregate_rank_metrics to sum energy metrics and find the maximum wall time across ranks.
    • Implemented format_system_metrics to convert SystemMetrics into human-readable strings, including power calculations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several tweaks for the Aurora machine, focusing on improving metrics reporting and performance analysis. The changes include enabling debug symbols for better crash analysis, refactoring system metrics collection into a centralized module to reduce code duplication, and adding wall_time to metrics for more accurate power calculations. My review identifies a correctness bug in the new metrics formatting logic where energy units were incorrect, and also points out opportunities to further reduce code duplication in both C++ and Python code to improve maintainability.

@github-actions
Copy link
Contributor

Workflow report

workflow report corresponding to commit 9079339
Commiter email is timothee.davidcleris@proton.me

Light CI is enabled. This will only run the basic tests and not the full tests.
Merging a PR require the job "on PR / all" to pass which is disabled in this case.

Pre-commit check report

Pre-commit check: ✅

trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check for broken symlinks................................................Passed
check yaml...............................................................Passed
detect private key.......................................................Passed
No-tabs checker..........................................................Passed
Tabs remover.............................................................Passed
Validate GitHub Workflows................................................Passed
clang-format.............................................................Passed
ruff check...............................................................Passed
ruff format..............................................................Passed
Check doxygen headers....................................................Passed
Check license headers....................................................Passed
Check #pragma once.......................................................Passed
Check SYCL #include......................................................Passed
No ssh in git submodules remote..........................................Passed
No UTF-8 in files (except for authors)...................................Passed

Test pipeline can run.

Doxygen diff with main

Removed warnings : 9
New warnings : 9
Warnings count : 8366 → 8366 (0.0%)

Detailed changes :
- src/shamalgs/src/collective/sparse_exchange.cpp:113: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send, size_t max_alloc_size) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:113: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send, size_t max_alloc_size) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:116: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send, size_t max_alloc_size) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:116: warning: Member build_sparse_exchange_table(const std::vector< CommMessageInfo > &messages_send, size_t max_alloc_size) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:247: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, const std::vector< const u8 * > &bytebuffer_send, const std::vector< u8 * > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:250: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, const std::vector< const u8 * > &bytebuffer_send, const std::vector< u8 * > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:296: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:296: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:299: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:299: warning: Member sparse_exchange(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, target > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:384: warning: Member sparse_exchange< sham::device >(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::device > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::device > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:387: warning: Member sparse_exchange< sham::device >(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::device > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::device > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
- src/shamalgs/src/collective/sparse_exchange.cpp:390: warning: Member sparse_exchange< sham::host >(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::host > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::host > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
+ src/shamalgs/src/collective/sparse_exchange.cpp:393: warning: Member sparse_exchange< sham::host >(std::shared_ptr< sham::DeviceScheduler > dev_sched, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::host > > > &bytebuffer_send, std::vector< std::unique_ptr< sham::DeviceBuffer< u8, sham::host > > > &bytebuffer_recv, const CommTable &comm_table) (function) of namespace shamalgs::collective is not documented.
- src/shamrock/include/shamrock/scheduler/SerialPatchTree.hpp:312: warning: Member dump_dat() (function) of class SerialPatchTree is not documented.
+ src/shamrock/include/shamrock/scheduler/SerialPatchTree.hpp:317: warning: Member dump_dat() (function) of class SerialPatchTree is not documented.
- src/shamrock/include/shamrock/scheduler/SerialPatchTree.hpp:335: warning: Member compute_patch_owner(sham::DeviceScheduler_ptr dev_sched, sham::DeviceBuffer< fp_prec_vec > &position_buffer, u32 len) (function) of class SerialPatchTree is not documented.
+ src/shamrock/include/shamrock/scheduler/SerialPatchTree.hpp:340: warning: Member compute_patch_owner(sham::DeviceScheduler_ptr dev_sched, sham::DeviceBuffer< fp_prec_vec > &position_buffer, u32 len) (function) of class SerialPatchTree is not documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant