feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12
Open
forkni wants to merge 1 commit intoSDTD_031_devfrom
Open
feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12forkni wants to merge 1 commit intoSDTD_031_devfrom
forkni wants to merge 1 commit intoSDTD_031_devfrom
Conversation
…optimizations
- setup.py: bump onnx→1.19.1, onnx-graphsurgeon→0.6.1, polygraphy→0.49.26;
add onnxoptimizer/onnxslim/onnxscript/onnx-ir to tensorrt extras.
- tensorrt/__init__.py: CUDA_MODULE_LOADING=LAZY at module import.
- engine_manager.py: embed --trt{ver}--cc{SM} in engine dir name so upgrades
auto-invalidate stale caches; FP8 cache key bumped fp8v2→fp8v3.
- utilities.py: SM_120+ tactic-source mask (CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK,
CUDNN excluded); max_num_tactics=64, avg_timing_iterations=8 on Blackwell;
remove pre-existing enable_all_tactics kwarg bug on Engine.build().
- fp8_quantize.py: ONNX-export path for FP8 calibration (V2V + ControlNet +
IP-Adapter); calibration tensor synthesis with correct dims; OOB Gather fix
for ipadapter_scale; padded capture to ONNX-declared static dims.
- builder.py: builder_optimization_level param (0-5, default 4); STRONGLY_TYPED
support; FP8 precision flags wired through build config.
- wrapper.py, pipeline.py, config.py: FP8 flags, cached-attention KV bucketing
(cache_maxframes), ip_adapter_scale propagation, F821 undefined name fix.
- tools/gpu_profiler.py: new NVTX-annotated profiler with Nsight Compute support.
- scripts/profiling/: profile_nsys.py, profile_ncu.py, README.md.
- tools/summarize_audit.py: Nsight audit summarizer (1706 lines).
- configs/profiling/: 7 profiling config templates (FP16/FP8 cached/flexible/full).
- configs/td_config.yaml.example: TD config reference with all new params.
- .gitignore: add workspace hygiene entries (.claude/, Debug/, StreamDiffusion-
installer/, StreamDiffusionTD/, custom_processors/, batch files, =0.19.0,
Nsight outputs, SESSION_LOG.md, audit_reports/).
Engine cache invalidation: old --trt10.12.0.36 dirs are orphaned on disk;
new builds land under --trt10.16.1.11--cc{SM}. Blackwell-specific tactic
gating is a no-op on Ada (SM 8.9) — confirmed at 28.3 FPS / 52.8 FPS (FP16
CUDA graphs) / 60.7 FPS (FP8 plain) / 30+ FPS (FP8+IPA) on RTX 4090.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
onnx==1.19.1,onnx-graphsurgeon==0.6.1,polygraphy==0.49.26,nvidia-modelopt>=0.19; TRT 10.16.1.11 is the first Blackwell-Windows-production release fixing the 78% FP8 perf regression in 10.12–10.13 on SM_120.ipadapter_scale; padded capture to static dims.CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK, CUDNN excluded);max_num_tactics=64,avg_timing_iterations=8on Blackwell; all Blackwell-specific code is a no-op on Ada (SM 8.9).--trt{ver}--cc{SM}embedded in engine dir name) — upgrades auto-invalidate stale dirs; FP8 cache key bumpedfp8v2 → fp8v3.CUDA_MODULE_LOADING=LAZYat module import;builder_optimization_levelparam (0-5, default 4);STRONGLY_TYPEDsupport.cache_maxframes), IP-Adapter scale propagation, F821 undefined name fix in wrapper.tools/gpu_profiler.py(NVTX),scripts/profiling/(Nsight Systems + Nsight Compute drivers + README),tools/summarize_audit.py, 7 profiling config templates.configs/td_config.yaml.examplewith all new parameters documented..claude/,Debug/,StreamDiffusion-installer/,StreamDiffusionTD/,custom_processors/, batch files, Nsight outputs,SESSION_LOG.md,audit_reports/.Cache Invalidation Notice
Old engine directories under the
--trt10.12.0.36prefix are orphaned on disk — they remain readable but new builds land under--trt10.16.1.11--cc{SM}. Rollback = revert this commit; old engine dirs immediately become live again.Verified Performance (RTX 4090, SDXL, Ada SM 8.9)
Review Hints
src/streamdiffusion/acceleration/tensorrt/fp8_quantize.pyandutilities.pyacceleration/tensorrt/builder.py(builder_optimization_level,STRONGLY_TYPED)acceleration/tensorrt/engine_manager.pysrc/streamdiffusion/tools/gpu_profiler.py,scripts/profiling/Test Plan
git grep "fp8v3" src/streamdiffusion/acceleration/tensorrt/engine_manager.pyreturns a matchgit grep "10.16.1.11" setup.py src/streamdiffusion/tools/install-tensorrt.pyshows the pin in both filesgh pr diff <PR#> | wc -lin 9000–12000 range (no ruff noise)--trt10.16.1.11--cc89(Ada) or--cc120(Blackwell)python -c "from streamdiffusion.acceleration.tensorrt.fp8_quantize import *; print('OK')"passes (modelopt import gate)🤖 Generated with Claude Code