Skip to content

[WIP] Spatial parallelism unit test#719

Draft
elynnwu wants to merge 62 commits intomainfrom
update/sp-unit-test-torchrun
Draft

[WIP] Spatial parallelism unit test#719
elynnwu wants to merge 62 commits intomainfrom
update/sp-unit-test-torchrun

Conversation

@elynnwu
Copy link
Copy Markdown
Contributor

@elynnwu elynnwu commented Jan 5, 2026

Short description of why the PR is needed and how it satisfies those requirements, in sentence form.

Changes:

  • symbol (e.g. fme.core.my_function) or script and concise description of changes or added feature

  • Can group multiple related symbols on a single bullet

  • Tests added

  • If dependencies changed, "deps only" image rebuilt and "latest_deps_only_image.txt" file updated

Resolves # (delete if none)

odiazib and others added 30 commits December 2, 2025 08:46
…run ACE using PhysicsNemo. It works, but it does not utilize spatial parallelism yet.
… unit test that divides the dataset into four parts, subsequently comparing the results with the original dataset.
…s implementation using unit tests based on those developed by Makani.
…g with spatial parallelism. The unit tests ran, but I have not checked for correctness.
- Ensure the distribute class, which produces a global singleton, is initialized only once.

- Set spatial parallelism parameters (i.e., h and w) as environmental variables.

- Emphasize the necessity of saving and loading checkpoints.

- Allow part of the save_checkpoint routine to be executed by all processors for spatial parallelism.
Co-authored-by: Jeremy McGibbon <mcgibbon@uw.edu>
…training slower by 10 seconds for each epoch.
@elynnwu elynnwu mentioned this pull request Jan 5, 2026
2 tasks
@elynnwu
Copy link
Copy Markdown
Contributor Author

elynnwu commented Jan 14, 2026

Here's the command used to run the unit tests:

export PHYSICSNEMO_DISTRIBUTED_INITIALIZATION_METHOD=ENV
torchrun --standalone --nproc_per_node=4 --no_python -- pytest fme/ace/test_spatial_parallelism/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants