feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations by forkni · Pull Request #12 · dotsimulate/StreamDiffusion

forkni · 2026-04-26T20:23:02Z

Summary

TRT 10.16.1.11 upgrade with onnx==1.19.1, onnx-graphsurgeon==0.6.1, polygraphy==0.49.26, nvidia-modelopt>=0.19; TRT 10.16.1.11 is the first Blackwell-Windows-production release fixing the 78% FP8 perf regression in 10.12–10.13 on SM_120.
FP8 E4M3 quantization via ONNX export path (V2V + ControlNet + IP-Adapter); calibration tensor synthesis with correct ONNX-declared dims; OOB Gather fix for ipadapter_scale; padded capture to static dims.
SM_120 tactic-source mask (CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK, CUDNN excluded); max_num_tactics=64, avg_timing_iterations=8 on Blackwell; all Blackwell-specific code is a no-op on Ada (SM 8.9).
TRT-keyed engine cache (--trt{ver}--cc{SM} embedded in engine dir name) — upgrades auto-invalidate stale dirs; FP8 cache key bumped fp8v2 → fp8v3.
CUDA_MODULE_LOADING=LAZY at module import; builder_optimization_level param (0-5, default 4); STRONGLY_TYPED support.
Cached-attention KV bucketing (cache_maxframes), IP-Adapter scale propagation, F821 undefined name fix in wrapper.
Profiling infrastructure: tools/gpu_profiler.py (NVTX), scripts/profiling/ (Nsight Systems + Nsight Compute drivers + README), tools/summarize_audit.py, 7 profiling config templates.
TD config reference: configs/td_config.yaml.example with all new parameters documented.
Workspace gitignore hygiene: .claude/, Debug/, StreamDiffusion-installer/, StreamDiffusionTD/, custom_processors/, batch files, Nsight outputs, SESSION_LOG.md, audit_reports/.

Cache Invalidation Notice

Old engine directories under the --trt10.12.0.36 prefix are orphaned on disk — they remain readable but new builds land under --trt10.16.1.11--cc{SM}. Rollback = revert this commit; old engine dirs immediately become live again.

Verified Performance (RTX 4090, SDXL, Ada SM 8.9)

Mode	FPS
FP16 + CUDA graphs	52.8
FP8 plain	60.7
FP8 + IP-Adapter	30+

Review Hints

Bulk of FP8 work: src/streamdiffusion/acceleration/tensorrt/fp8_quantize.py and utilities.py
Builder changes: acceleration/tensorrt/builder.py (builder_optimization_level, STRONGLY_TYPED)
Cache key schema: acceleration/tensorrt/engine_manager.py
Profiling tools: src/streamdiffusion/tools/gpu_profiler.py, scripts/profiling/

Test Plan

git grep "fp8v3" src/streamdiffusion/acceleration/tensorrt/engine_manager.py returns a match
git grep "10.16.1.11" setup.py src/streamdiffusion/tools/install-tensorrt.py shows the pin in both files
gh pr diff <PR#> | wc -l in 9000–12000 range (no ruff noise)
First engine build on fresh venv produces dir containing --trt10.16.1.11--cc89 (Ada) or --cc120 (Blackwell)
python -c "from streamdiffusion.acceleration.tensorrt.fp8_quantize import *; print('OK')" passes (modelopt import gate)

🤖 Generated with Claude Code

…optimizations - setup.py: bump onnx→1.19.1, onnx-graphsurgeon→0.6.1, polygraphy→0.49.26; add onnxoptimizer/onnxslim/onnxscript/onnx-ir to tensorrt extras. - tensorrt/__init__.py: CUDA_MODULE_LOADING=LAZY at module import. - engine_manager.py: embed --trt{ver}--cc{SM} in engine dir name so upgrades auto-invalidate stale caches; FP8 cache key bumped fp8v2→fp8v3. - utilities.py: SM_120+ tactic-source mask (CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK, CUDNN excluded); max_num_tactics=64, avg_timing_iterations=8 on Blackwell; remove pre-existing enable_all_tactics kwarg bug on Engine.build(). - fp8_quantize.py: ONNX-export path for FP8 calibration (V2V + ControlNet + IP-Adapter); calibration tensor synthesis with correct dims; OOB Gather fix for ipadapter_scale; padded capture to ONNX-declared static dims. - builder.py: builder_optimization_level param (0-5, default 4); STRONGLY_TYPED support; FP8 precision flags wired through build config. - wrapper.py, pipeline.py, config.py: FP8 flags, cached-attention KV bucketing (cache_maxframes), ip_adapter_scale propagation, F821 undefined name fix. - tools/gpu_profiler.py: new NVTX-annotated profiler with Nsight Compute support. - scripts/profiling/: profile_nsys.py, profile_ncu.py, README.md. - tools/summarize_audit.py: Nsight audit summarizer (1706 lines). - configs/profiling/: 7 profiling config templates (FP16/FP8 cached/flexible/full). - configs/td_config.yaml.example: TD config reference with all new params. - .gitignore: add workspace hygiene entries (.claude/, Debug/, StreamDiffusion- installer/, StreamDiffusionTD/, custom_processors/, batch files, =0.19.0, Nsight outputs, SESSION_LOG.md, audit_reports/). Engine cache invalidation: old --trt10.12.0.36 dirs are orphaned on disk; new builds land under --trt10.16.1.11--cc{SM}. Blackwell-specific tactic gating is a no-op on Ada (SM 8.9) — confirmed at 28.3 FPS / 52.8 FPS (FP16 CUDA graphs) / 60.7 FPS (FP8 plain) / 30+ FPS (FP8+IPA) on RTX 4090. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

forkni mentioned this pull request Apr 26, 2026

fix(tensorrt): TRT 10.16.1.11 + modelopt install + run_pip quote-fix dotsimulate/StreamDiffusion-installer#1

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12

feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12
forkni wants to merge 1 commit intoSDTD_031_devfrom
dotsimulate/feat/trt10.16-fp8-perf

forkni commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

forkni commented Apr 26, 2026

Summary

Cache Invalidation Notice

Verified Performance (RTX 4090, SDXL, Ada SM 8.9)

Review Hints

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants