Make SFT hardware-agnostic#749
Conversation
|
thanks for the PR @DamianSzwichtenberg , this is very useful. We are on an offsite this week, can i get back to you in a few days? |
|
@felipemello1 No problem, enjoy! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
b5a18aa to
576ec7a
Compare
|
@felipemello1 Just a gentle bump on this. |
| # Set device isolation using the appropriate environment variable | ||
| env_vars.update(DeviceProxy.get_isolation_env_vars(gpu_ids)) |
There was a problem hiding this comment.
The key is still tested, just indirectly. Here's the chain:
- the autouse fixture mocks current accelerator to return cuda
- mock_patch_visible_devices_var sets CUDA_VISIBLE_DEVICES
| ) # (name, future, submission_index) | ||
| self._durations: list[tuple[str, float]] = [] | ||
| self._chain_start: torch.cuda.Event | None = None | ||
| self._chain_start: torch.Event | None = None |
There was a problem hiding this comment.
not familiar with these APIs, is this correct? Just double checking
There was a problem hiding this comment.
Yes, Event will query the current accelerator type.
| torch.accelerator.reset_peak_memory_stats() | ||
| self._start_mem = torch.accelerator.memory_allocated() |
There was a problem hiding this comment.
curious to see an wandb with main vs this branch and compare results to make sure they are the same
There was a problem hiding this comment.
at first glance the PR looks great! The provisioner looks much cleaner.
I would feel more confident merging it if we had an wandb comparing main vs this branch for the same setup, anything with >1 gpu. Would you have the resources to provide that, if its not asking too much? (you can say no)
I also feel like i should have claude take a look at this and see if we missed some important .cuda vs .accelerator, and if the test changes are not "masking" any errors. I will get this done this week.
thanks for contributing with this PR!

This PR enables the SFT (Supervised Fine-Tuning) application to work on XPU hardware by abstracting device management from CUDA-specific APIs to hardware-agnostic PyTorch accelerator APIs.
Changes: