AOT-Inductor compile of the full SAM3 pipeline by rbavery · Pull Request #10 · wherobots/sam3

rbavery · 2026-04-30T18:59:58Z

Stacks on #9. Adds a one-file script that loads the exported .pt2 and packages it with torch._inductor.aoti_compile_and_package.

Result

Plain AOTI compile worked on torch 2.10.0+cu128 — no split_reductions=False, no math-SDPA forcing, no other workarounds. The crash from pytorch/pytorch#174608 didn't reproduce on this graph.

Loading exported program from /tmp/full_sam3_pipeline.pt2
Compiling with torch._inductor.aoti_compile_and_package -> /tmp/full_sam3_pipeline_aoti.pt2
Saved AOTI package to /tmp/full_sam3_pipeline_aoti.pt2  (3.5 GB)

Likely why this works where #5 didn't:

Aoti compile #5 forced math SDPA via enable_math_sdp(True). PR Enable torch.export of the full SAM3 grounding pipeline #9 routes RPB cross-attn through _cross_attn_with_rpb which lets the SDPA dispatcher pick efficient-attention by default — sidesteps the triton_red_fused__safe_softmax_* codegen path that crashed.
Different export shapes: dynamic Dim("batch", min=1) and Dim("num_prompts", min=1) instead of Dim.AUTO from a length-1 example.

Verification on RTX 3090

Dynamic shapes still work after AOTI compile

runner type: AOTICompiledModel
bs=1 np=2 pairs=2:  pred_logits=(2, 200, 1) pred_boxes=(2, 200, 4) pred_masks=(2, 200, 288, 288)
bs=2 np=3 pairs=6:  ...
bs=3 np=4 pairs=12: ...

Performance (bs=2, np=3 on a 3090)

eager-exported  369 ms / call
AOTI            302 ms / call    (~18% faster)

Numerical drift on real images

input	top-1 score (eager / AOTI)	top-5 index overlap	pearson r
`truck.jpg` + "truck"	0.8477 / 0.8490	3/5	0.9998
`groceries.jpg` + "fruit"	0.0399 / 0.0394	5/5	0.988

Top-1 boxes agree to 4 decimal places on both images. Random-Gaussian inputs produced much larger drift (max diff ~41 on masks), but real images stay in-distribution and the bf16 internals don't move scores past any reasonable confidence threshold.

Setup notes (CUDA toolkit not installed system-wide)

The torch wheel doesn't bundle nvcc. From a fresh box::

uv pip install --index-url https://pypi.nvidia.com nvidia-cuda-nvcc nvidia-cuda-cccl
export CUDA_HOME=$VIRTUAL_ENV/lib/python3.12/site-packages/nvidia/cu13
export PATH=$CUDA_HOME/bin:$PATH

Plus a torch-2.10 quirk in the loader::

import torch._inductor.codecache  # required before aoti_load_package on torch 2.10

The script does this automatically; documented in the module docstring.

Test plan

On the 3090 box with PR Enable torch.export of the full SAM3 grounding pipeline #9 merged: python scripts/compile_sam3_aoti.py --in artifacts/export/full_sam3_pipeline.pt2 --out artifacts/aoti/full_sam3_pipeline_aoti.pt2
Confirm load + run with rasterflow's PT2ModelLoader (it already handles aoti_runners first, exported_programs second — see model_loaders.py)
Eyeball renders in artifacts/aoti_compare/ for sanity (from the export-pipeline-minimal branch validation)

Follow-ups (out of scope here)

The deployed .pt2 packages both model and transforms keys; this PR only AOTI-compiles model. transforms is just a F.interpolate so it's cheap, but a follow-up could AOTI-compile both and bundle them into one archive that drops in for the production .pt2.
An AOTI smoke test in tests/export/ that compiles a tiny graph (not the full 3.5 GB pipeline) to keep CI honest.

Loads the exported .pt2 produced by export_sam3_full_pipeline.py and runs torch._inductor.aoti_compile_and_package. No workarounds (e.g. split_reductions=False) — baseline run to see what 2.10 / current main does.

Plain torch._inductor.aoti_compile_and_package + aoti_load_package both need an explicit 'import torch._inductor.codecache' on torch 2.10 — the torch.export.pt2_archive._package._load_aoti hits AttributeError without it. Also document the nvidia-cuda-nvcc + nvidia-cuda-cccl install path since torch wheels don't bundle nvcc.

rbavery added 2 commits April 29, 2026 17:38

Add minimal AOTI compile script for the full SAM3 pipeline

fc67fa1

Loads the exported .pt2 produced by export_sam3_full_pipeline.py and runs torch._inductor.aoti_compile_and_package. No workarounds (e.g. split_reductions=False) — baseline run to see what 2.10 / current main does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOT-Inductor compile of the full SAM3 pipeline#10

AOT-Inductor compile of the full SAM3 pipeline#10
rbavery wants to merge 2 commits into
export-pipeline-minimalfrom
aoti-compile-on-pr9

rbavery commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

rbavery commented Apr 30, 2026

Result

Verification on RTX 3090

Dynamic shapes still work after AOTI compile

Performance (bs=2, np=3 on a 3090)

Numerical drift on real images

Setup notes (CUDA toolkit not installed system-wide)

Test plan

Follow-ups (out of scope here)

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant