Skip to content

Make SFT hardware-agnostic#749

Merged
felipemello1 merged 8 commits intometa-pytorch:mainfrom
DamianSzwichtenberg:dev/sft
Feb 23, 2026
Merged

Make SFT hardware-agnostic#749
felipemello1 merged 8 commits intometa-pytorch:mainfrom
DamianSzwichtenberg:dev/sft

Conversation

@DamianSzwichtenberg
Copy link
Copy Markdown
Contributor

This PR enables the SFT (Supervised Fine-Tuning) application to work on XPU hardware by abstracting device management from CUDA-specific APIs to hardware-agnostic PyTorch accelerator APIs.

Changes:

  • Replaced CUDA-specific API calls (torch.cuda.) with hardware-agnostic accelerator APIs (torch.accelerator.) throughout the codebase
  • Introduced a DeviceProxy class to handle device counting and environment variable mapping for different hardware backends (CUDA, XPU)
  • Updated test files to use generic device visibility terminology and mock accelerator APIs instead of CUDA-specific mocks

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 3, 2026
@felipemello1
Copy link
Copy Markdown
Contributor

thanks for the PR @DamianSzwichtenberg , this is very useful. We are on an offsite this week, can i get back to you in a few days?

@DamianSzwichtenberg
Copy link
Copy Markdown
Contributor Author

@felipemello1 No problem, enjoy!

@DamianSzwichtenberg
Copy link
Copy Markdown
Contributor Author

@felipemello1 Just a gentle bump on this.

Comment on lines +410 to +411
# Set device isolation using the appropriate environment variable
env_vars.update(DeviceProxy.get_isolation_env_vars(gpu_ids))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how sure are we that the key here is correct, since we dont test the key anymore?

Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is still tested, just indirectly. Here's the chain:

) # (name, future, submission_index)
self._durations: list[tuple[str, float]] = []
self._chain_start: torch.cuda.Event | None = None
self._chain_start: torch.Event | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not familiar with these APIs, is this correct? Just double checking

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Event will query the current accelerator type.

Comment on lines +188 to +189
torch.accelerator.reset_peak_memory_stats()
self._start_mem = torch.accelerator.memory_allocated()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious to see an wandb with main vs this branch and compare results to make sure they are the same

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you go: link

Copy link
Copy Markdown
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first glance the PR looks great! The provisioner looks much cleaner.

I would feel more confident merging it if we had an wandb comparing main vs this branch for the same setup, anything with >1 gpu. Would you have the resources to provide that, if its not asking too much? (you can say no)

I also feel like i should have claude take a look at this and see if we missed some important .cuda vs .accelerator, and if the test changes are not "masking" any errors. I will get this done this week.

thanks for contributing with this PR!

Copy link
Copy Markdown
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested with RL too, everything looks good, thanks for the great PR!

@felipemello1 felipemello1 merged commit 3b233c1 into meta-pytorch:main Feb 23, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants