Skip to content

[Pallas] Exclude output-only tensors from pallas_call inputs#1998

Draft
norx1991 wants to merge 1 commit intomainfrom
yifeixu/pallas-vmem-fix-v2
Draft

[Pallas] Exclude output-only tensors from pallas_call inputs#1998
norx1991 wants to merge 1 commit intomainfrom
yifeixu/pallas-vmem-fix-v2

Conversation

@norx1991
Copy link
Copy Markdown
Contributor

@norx1991 norx1991 commented Apr 9, 2026

Summary

Builds on #1984 (HBM donate for output-only tensors). Output-only tensors are now excluded from pallas_call inputs entirely, and the launcher returns the pallas_call results as torch tensors. The generated host code captures the return value.

#1984 already eliminated VMEM pressure via HBM in_specs, but the donated tensor still triggered OpSplitMode::kSplitBoth in torch_tpu, inserting an empty.1 broadcast op (~127 us overhead). This PR eliminates that by making input_output_aliases empty for output-only kernels — no donation, no graph split.

Generated code before (#1984):

out = torch.empty_like(x)
_launcher(kernel, grid, x, out, _output_indices=[1], _inplace_indices=[], ...)
return out

Generated code after:

out = torch.empty_like(x)
out = _launcher(kernel, grid, x, out, _output_indices=[1], _inplace_indices=[], ...)
return out

The launcher excludes out from pallas_call inputs and returns the result directly. For multiple output-only tensors: out1, out2 = _launcher(...).

Authored with Claude Code.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026
@norx1991 norx1991 force-pushed the yifeixu/pallas-vmem-fix-v2 branch from 76fcda7 to 6a459f3 Compare April 9, 2026 21:42
Output-only tensors (allocated with empty/empty_like/new_empty and
never read by the kernel) are excluded from pallas_call inputs
entirely. The launcher returns the pallas_call results for these
tensors, and the generated host code captures the return value.

This eliminates both VMEM pressure (output-only tensor not loaded)
and the OpSplitMode::kSplitBoth graph split in torch_tpu (empty
input_output_aliases for output-only kernels).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant