Make SFT hardware-agnostic by DamianSzwichtenberg · Pull Request #749 · meta-pytorch/torchforge

DamianSzwichtenberg · 2026-02-03T09:55:25Z

This PR enables the SFT (Supervised Fine-Tuning) application to work on XPU hardware by abstracting device management from CUDA-specific APIs to hardware-agnostic PyTorch accelerator APIs.

Changes:

Replaced CUDA-specific API calls (torch.cuda.) with hardware-agnostic accelerator APIs (torch.accelerator.) throughout the codebase
Introduced a DeviceProxy class to handle device counting and environment variable mapping for different hardware backends (CUDA, XPU)
Updated test files to use generic device visibility terminology and mock accelerator APIs instead of CUDA-specific mocks

felipemello1 · 2026-02-05T17:30:24Z

thanks for the PR @DamianSzwichtenberg , this is very useful. We are on an offsite this week, can i get back to you in a few days?

DamianSzwichtenberg · 2026-02-06T07:35:41Z

@felipemello1 No problem, enjoy!

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

DamianSzwichtenberg · 2026-02-16T10:36:19Z

@felipemello1 Just a gentle bump on this.

felipemello1 · 2026-02-17T21:20:08Z

+                # Set device isolation using the appropriate environment variable
+                env_vars.update(DeviceProxy.get_isolation_env_vars(gpu_ids))


how sure are we that the key here is correct, since we dont test the key anymore?

The key is still tested, just indirectly. Here's the chain:

the autouse fixture mocks current accelerator to return cuda

mock_patch_visible_devices_var sets CUDA_VISIBLE_DEVICES

felipemello1 · 2026-02-17T21:21:00Z

        )  # (name, future, submission_index)
        self._durations: list[tuple[str, float]] = []
-        self._chain_start: torch.cuda.Event | None = None
+        self._chain_start: torch.Event | None = None


not familiar with these APIs, is this correct? Just double checking

Yes, Event will query the current accelerator type.

felipemello1 · 2026-02-17T21:22:13Z

+            torch.accelerator.reset_peak_memory_stats()
+            self._start_mem = torch.accelerator.memory_allocated()


curious to see an wandb with main vs this branch and compare results to make sure they are the same

Here you go: link

felipemello1

at first glance the PR looks great! The provisioner looks much cleaner.

I would feel more confident merging it if we had an wandb comparing main vs this branch for the same setup, anything with >1 gpu. Would you have the resources to provide that, if its not asking too much? (you can say no)

I also feel like i should have claude take a look at this and see if we missed some important .cuda vs .accelerator, and if the test changes are not "masking" any errors. I will get this done this week.

thanks for contributing with this PR!

felipemello1

tested with RL too, everything looks good, thanks for the great PR!

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 3, 2026

felipemello1 requested review from felipemello1 and joecummings February 5, 2026 17:31

DamianSzwichtenberg and others added 8 commits February 16, 2026 12:32

Make SFT work on XPU

027a320

Update UTs

60742cf

Update comments

0b78911

Update UTs 2

539f7cf

Enable perf tracking on XPU

acc116d

Apply linter

9f70c5d

Update src/forge/observability/perf_tracker.py

c0b3cf9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/unit_tests/observability/test_perf_tracker.py

576ec7a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

DamianSzwichtenberg force-pushed the dev/sft branch from b5a18aa to 576ec7a Compare February 16, 2026 10:33

felipemello1 reviewed Feb 17, 2026

View reviewed changes

felipemello1 approved these changes Feb 23, 2026

View reviewed changes

felipemello1 merged commit 3b233c1 into meta-pytorch:main Feb 23, 2026
10 checks passed

DamianSzwichtenberg mentioned this pull request Apr 16, 2026

[RFC] Fine-tuning on Intel hardware in the PyTorch ecosystem — where to invest? #773

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make SFT hardware-agnostic#749

Make SFT hardware-agnostic#749
felipemello1 merged 8 commits intometa-pytorch:mainfrom
DamianSzwichtenberg:dev/sft

DamianSzwichtenberg commented Feb 3, 2026

Uh oh!

felipemello1 commented Feb 5, 2026

Uh oh!

DamianSzwichtenberg commented Feb 6, 2026

Uh oh!

DamianSzwichtenberg commented Feb 16, 2026

Uh oh!

felipemello1 Feb 17, 2026

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Uh oh!

felipemello1 Feb 17, 2026

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Uh oh!

felipemello1 Feb 17, 2026

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Uh oh!

felipemello1 left a comment •

edited

Loading

Uh oh!

felipemello1 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Set device isolation using the appropriate environment variable
		env_vars.update(DeviceProxy.get_isolation_env_vars(gpu_ids))

		torch.accelerator.reset_peak_memory_stats()
		self._start_mem = torch.accelerator.memory_allocated()

Conversation

DamianSzwichtenberg commented Feb 3, 2026

Changes:

Uh oh!

felipemello1 commented Feb 5, 2026

Uh oh!

DamianSzwichtenberg commented Feb 6, 2026

Uh oh!

DamianSzwichtenberg commented Feb 16, 2026

Uh oh!

felipemello1 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

felipemello1 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

felipemello1 Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

DamianSzwichtenberg Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felipemello1 left a comment •

edited

Loading

felipemello1 left a comment •

edited

Loading