[Feature] Multi-GPU support#55

Open

yoshikisd wants to merge 19 commits intomasterfrom

Collaborator

yoshikisd commented Nov 9, 2025

The spiritual successor to #17, this PR enables reconstruction scripts to be executed as a multi-GPU job initialized with torchrun or torch.multiprocessing.spawn.

Currently, only AdamReconstructor can take advantage of multi-GPU acceleration

Multi-GPU processing is based off of distributed data parallelism, where each subprocess/GPU

works on the same model
works on different parts of the dataset but, together, samples all the diffraction patterns within each epoch
synchronizes gradients (after loss.backwards() calls), losses, and learning rates (if scheduler is enabled) across all participating GPUs

What's new

cdtools.tools.multigpu: Contains functions that enables CDTools reconstructions to run as multi-GPU jobs
Plotting and saving is compatible with multi-GPU: Ptycho2DDataset, CDataset, and CDIModel only saves/plots if they're running on the 'Rank 0' GPU subprocess (i.e., plots are generated only by one subprocess rather than all subprocesses).
Multi-GPU jobs can be started with torchrun or spawn: Two example scripts gold_ball_ptycho_torchrun.py and gold_ball_ptycho_spawn.py shows how to set up the scripts accordingly.
PyTests for multi-GPU: The flag --runmultigpu starts up the multi-GPU pytests, which uses up to 2 GPUs.
Check GPU-dependent reconstruction performance with cdtools.tools.multigpu.run_speed_test: The example script gold_ball_ptycho_speedtest.py shows how to set up the test. The implementation is also much simpler than in [feature] Multi-GPU support #17.

yoshikisd added 15 commits

October 19, 2025 21:05


          Created the distributed module

0268f5b


          Added additional functions to distributed

fded573


          Renamed cdtools.tools.distributed to cdtools.tools.multigpu. Also add…

10229e4

…ed several functions to multigpu


          Multi-GPU compatability added for base Reconstructor and AdamReconstr…

8d8da15

…uctor


          Prevent dataset from plotting/saving outside of Rank 0

b0604d4


          Added a check for init_method in multigpu.setup

0f1f990


          Multi-GPU compatibility added for CDIModel

b784620


          Added example scripts for running multi-GPU jobs with spawn or torchrun

6f15c36


          Added example script for running a multi-GPU speed test

842bb49


          Modified runtime error message in multigpu.setup

1a3d0cd


          Fixed linting issues on multigpu

cad3d9c


          multigpu.setup now recognizes MASTER_ADDR and MASTER_PORT when settin…

cd584fc

…g up spawn-based jobs


          Simplified the logic in MASTER_ADDR and MASTER_PORT definition in mul…

a8d2bfa

…tigpu.setup


          Added multi-GPU pytests

32e9256


          Merge branch 'master' into multigpu

e5996a4

yoshikisd linked an issue

that may be closed by this pull request

Add support for multi-GPU #8

Open

yoshikisd added the enhancement label

yoshikisd mentioned this pull request

Add support for multi-GPU #8

Open

yoshikisd added 4 commits

December 12, 2025 00:33


          Merge remote-tracking branch 'origin/master' into multigpu

ebe5b54


          Merge remote-tracking branch 'origin/master' into multigpu

9792b14


          Make the timeout default 30 seconds and enforce timeout in units of s…

e059c18

…econds


          Merge remote-tracking branch 'origin/master' into multigpu

9795f80

allevitan reviewed

View reviewed changes

Collaborator

allevitan left a comment

So, I finally got a chance to read through the code. All in all I think it's really close, but I'm still seeing too many issues to be comfortable pulling it in now.

The two big overall issues that I'm seeing:

It doesn't provide a speedup when run on the HPC node I have access to, a 4-GPU node with V100 cards, which was allocated exclusively. See below:

* It occasionally fails to kill the child processes on control-C

And there are a few smaller comments as well.

@yoshikisd, I know you won't have time going into the future to continue development, so my plan will be to take over the development of this branch (if that's okay with you), and try to work through these comments and issues, before trying to fold it in at a later time. Sorry for not having more time to get to the review earlier!

src/cdtools/tools/multigpu/multigpu.py

+                  for param in model.parameters():
+                      if param.requires_grad:
+                          dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
+                          param.grad.data /= world_size

Collaborator

allevitan Mar 19, 2026

Are we sure this should be an average, and not a sum? I would have expected that we'd want to sum gradients from the models across different GPUs.

src/cdtools/tools/multigpu/multigpu.py

+              def sync_loss(loss):
+                  """
+                  Synchronizes Rank 0 GPU's learning rate to all participating GPUs.

Collaborator

allevitan Mar 19, 2026

Typo - synchronizes loss

src/cdtools/tools/multigpu/multigpu.py

+                      world_size = get_world_size()
+                  t.cuda.set_device(rank)
+                  if rank == 0:

Collaborator

allevitan Mar 19, 2026

needs a check for verbosity (and the other print statements)

src/cdtools/tools/multigpu/multigpu.py

		print('[INFO]: RNG seed synchronized across all subprocesses.')


		def cleanup():

Collaborator

allevitan Mar 19, 2026

add a verbosity flag, default True

src/cdtools/tools/multigpu/multigpu.py

+                        backend: str = 'nccl',
+                        timeout: int = 30,
+                        seed: int = None,
+                        verbose: bool = False):

Collaborator

allevitan Mar 19, 2026

I think a default of "True" will be best

src/cdtools/models/base.py

Collaborator

allevitan Mar 19, 2026

This looks good overall, but the more I think about it, the more I feel like the multigpu print protections might be more compact appearing elsewhere. I want to think about it a bit

src/cdtools/reconstructors/base.py

+                                  # Avg and sync gradients for multi-GPU jobs
+                                  if self.multi_gpu_used:
+                                      multigpu.sync_and_avg_grads(model=self.model,

Collaborator

allevitan Mar 19, 2026

Check on consequences of averaging vs summing - look here as well when potentially changing the earlier line

examples/gold_ball_ptycho_torchrun.py

		@@ -0,0 +1,70 @@
		import cdtools

Collaborator

allevitan Mar 19, 2026

Add a comment or docstring indicating how it should be run

src/cdtools/models/base.py

Collaborator

allevitan Mar 19, 2026

I think I want to suggest to move much of the plot protection/save protection somewhere else, but I need to think on it a bit more.

tests/multi_gpu/test_multi_gpu.py

+                  loss_mean_list, loss_std_list, \
+                      _, _, speed_up_mean_list, speed_up_std_list\
+                      = multigpu.run_speed_test(fn=reconstruct,

Collaborator

allevitan Mar 19, 2026

I get an error on this line when running the test,

RuntimeError: torch.multiprocessing.spawn was detected as the launching 
E       method, but either rank, world_size, master_addr, or 
E       master_port has not been explicitly defined. Please ensure 
E       that either these parameters have been explicitly defined,
E       MASTER_ADDR/MASTER_PORT have been defined as environment 
E       variables, or launch the multi-GPU job with torchrun.

Collaborator Author

yoshikisd commented Mar 19, 2026

Hey Abe, please go ahead and work your magic with this project! I won't have time in the future to work with this on a GPU cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels