Skip to content

[Feature] Multi-GPU support#55

Open
yoshikisd wants to merge 19 commits intomasterfrom
multigpu
Open

[Feature] Multi-GPU support#55
yoshikisd wants to merge 19 commits intomasterfrom
multigpu

Conversation

@yoshikisd
Copy link
Collaborator

The spiritual successor to #17, this PR enables reconstruction scripts to be executed as a multi-GPU job initialized with torchrun or torch.multiprocessing.spawn.

Currently, only AdamReconstructor can take advantage of multi-GPU acceleration

Multi-GPU processing is based off of distributed data parallelism, where each subprocess/GPU

  1. works on the same model
  2. works on different parts of the dataset but, together, samples all the diffraction patterns within each epoch
  3. synchronizes gradients (after loss.backwards() calls), losses, and learning rates (if scheduler is enabled) across all participating GPUs

What's new

  • cdtools.tools.multigpu: Contains functions that enables CDTools reconstructions to run as multi-GPU jobs
  • Plotting and saving is compatible with multi-GPU: Ptycho2DDataset, CDataset, and CDIModel only saves/plots if they're running on the 'Rank 0' GPU subprocess (i.e., plots are generated only by one subprocess rather than all subprocesses).
  • Multi-GPU jobs can be started with torchrun or spawn: Two example scripts gold_ball_ptycho_torchrun.py and gold_ball_ptycho_spawn.py shows how to set up the scripts accordingly.
  • PyTests for multi-GPU: The flag --runmultigpu starts up the multi-GPU pytests, which uses up to 2 GPUs.
  • Check GPU-dependent reconstruction performance with cdtools.tools.multigpu.run_speed_test: The example script gold_ball_ptycho_speedtest.py shows how to set up the test. The implementation is also much simpler than in [feature] Multi-GPU support #17.

@yoshikisd yoshikisd linked an issue Nov 9, 2025 that may be closed by this pull request
@yoshikisd yoshikisd added the enhancement New feature or request label Nov 9, 2025
Copy link
Collaborator

@allevitan allevitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I finally got a chance to read through the code. All in all I think it's really close, but I'm still seeing too many issues to be comfortable pulling it in now.

The two big overall issues that I'm seeing:

  • It doesn't provide a speedup when run on the HPC node I have access to, a 4-GPU node with V100 cards, which was allocated exclusively. See below:
Image * It occasionally fails to kill the child processes on control-C

And there are a few smaller comments as well.

@yoshikisd, I know you won't have time going into the future to continue development, so my plan will be to take over the development of this branch (if that's okay with you), and try to work through these comments and issues, before trying to fold it in at a later time. Sorry for not having more time to get to the review earlier!

for param in model.parameters():
if param.requires_grad:
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
param.grad.data /= world_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this should be an average, and not a sum? I would have expected that we'd want to sum gradients from the models across different GPUs.


def sync_loss(loss):
"""
Synchronizes Rank 0 GPU's learning rate to all participating GPUs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo - synchronizes loss

world_size = get_world_size()

t.cuda.set_device(rank)
if rank == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a check for verbosity (and the other print statements)

print('[INFO]: RNG seed synchronized across all subprocesses.')


def cleanup():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a verbosity flag, default True

backend: str = 'nccl',
timeout: int = 30,
seed: int = None,
verbose: bool = False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a default of "True" will be best

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, but the more I think about it, the more I feel like the multigpu print protections might be more compact appearing elsewhere. I want to think about it a bit


# Avg and sync gradients for multi-GPU jobs
if self.multi_gpu_used:
multigpu.sync_and_avg_grads(model=self.model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check on consequences of averaging vs summing - look here as well when potentially changing the earlier line

@@ -0,0 +1,70 @@
import cdtools
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment or docstring indicating how it should be run

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to suggest to move much of the plot protection/save protection somewhere else, but I need to think on it a bit more.


loss_mean_list, loss_std_list, \
_, _, speed_up_mean_list, speed_up_std_list\
= multigpu.run_speed_test(fn=reconstruct,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error on this line when running the test,

RuntimeError: torch.multiprocessing.spawn was detected as the launching 
E       method, but either rank, world_size, master_addr, or 
E       master_port has not been explicitly defined. Please ensure 
E       that either these parameters have been explicitly defined,
E       MASTER_ADDR/MASTER_PORT have been defined as environment 
E       variables, or launch the multi-GPU job with torchrun.

@yoshikisd
Copy link
Collaborator Author

Hey Abe, please go ahead and work your magic with this project! I won't have time in the future to work with this on a GPU cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for multi-GPU

2 participants