Skip to content

feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12

Open
forkni wants to merge 1 commit intoSDTD_031_devfrom
dotsimulate/feat/trt10.16-fp8-perf
Open

feat: TRT 10.16.1.11 + FP8 quantization (V2V/CN/IPA) + CUDA hot-path optimizations#12
forkni wants to merge 1 commit intoSDTD_031_devfrom
dotsimulate/feat/trt10.16-fp8-perf

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented Apr 26, 2026

Summary

  • TRT 10.16.1.11 upgrade with onnx==1.19.1, onnx-graphsurgeon==0.6.1, polygraphy==0.49.26, nvidia-modelopt>=0.19; TRT 10.16.1.11 is the first Blackwell-Windows-production release fixing the 78% FP8 perf regression in 10.12–10.13 on SM_120.
  • FP8 E4M3 quantization via ONNX export path (V2V + ControlNet + IP-Adapter); calibration tensor synthesis with correct ONNX-declared dims; OOB Gather fix for ipadapter_scale; padded capture to static dims.
  • SM_120 tactic-source mask (CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK, CUDNN excluded); max_num_tactics=64, avg_timing_iterations=8 on Blackwell; all Blackwell-specific code is a no-op on Ada (SM 8.9).
  • TRT-keyed engine cache (--trt{ver}--cc{SM} embedded in engine dir name) — upgrades auto-invalidate stale dirs; FP8 cache key bumped fp8v2 → fp8v3.
  • CUDA_MODULE_LOADING=LAZY at module import; builder_optimization_level param (0-5, default 4); STRONGLY_TYPED support.
  • Cached-attention KV bucketing (cache_maxframes), IP-Adapter scale propagation, F821 undefined name fix in wrapper.
  • Profiling infrastructure: tools/gpu_profiler.py (NVTX), scripts/profiling/ (Nsight Systems + Nsight Compute drivers + README), tools/summarize_audit.py, 7 profiling config templates.
  • TD config reference: configs/td_config.yaml.example with all new parameters documented.
  • Workspace gitignore hygiene: .claude/, Debug/, StreamDiffusion-installer/, StreamDiffusionTD/, custom_processors/, batch files, Nsight outputs, SESSION_LOG.md, audit_reports/.

Cache Invalidation Notice

Old engine directories under the --trt10.12.0.36 prefix are orphaned on disk — they remain readable but new builds land under --trt10.16.1.11--cc{SM}. Rollback = revert this commit; old engine dirs immediately become live again.

Verified Performance (RTX 4090, SDXL, Ada SM 8.9)

Mode FPS
FP16 + CUDA graphs 52.8
FP8 plain 60.7
FP8 + IP-Adapter 30+

Review Hints

  • Bulk of FP8 work: src/streamdiffusion/acceleration/tensorrt/fp8_quantize.py and utilities.py
  • Builder changes: acceleration/tensorrt/builder.py (builder_optimization_level, STRONGLY_TYPED)
  • Cache key schema: acceleration/tensorrt/engine_manager.py
  • Profiling tools: src/streamdiffusion/tools/gpu_profiler.py, scripts/profiling/

Test Plan

  • git grep "fp8v3" src/streamdiffusion/acceleration/tensorrt/engine_manager.py returns a match
  • git grep "10.16.1.11" setup.py src/streamdiffusion/tools/install-tensorrt.py shows the pin in both files
  • gh pr diff <PR#> | wc -l in 9000–12000 range (no ruff noise)
  • First engine build on fresh venv produces dir containing --trt10.16.1.11--cc89 (Ada) or --cc120 (Blackwell)
  • python -c "from streamdiffusion.acceleration.tensorrt.fp8_quantize import *; print('OK')" passes (modelopt import gate)

🤖 Generated with Claude Code

…optimizations

- setup.py: bump onnx→1.19.1, onnx-graphsurgeon→0.6.1, polygraphy→0.49.26;
  add onnxoptimizer/onnxslim/onnxscript/onnx-ir to tensorrt extras.
- tensorrt/__init__.py: CUDA_MODULE_LOADING=LAZY at module import.
- engine_manager.py: embed --trt{ver}--cc{SM} in engine dir name so upgrades
  auto-invalidate stale caches; FP8 cache key bumped fp8v2→fp8v3.
- utilities.py: SM_120+ tactic-source mask (CUBLAS|CUBLAS_LT|JIT_CONV|EDGE_MASK,
  CUDNN excluded); max_num_tactics=64, avg_timing_iterations=8 on Blackwell;
  remove pre-existing enable_all_tactics kwarg bug on Engine.build().
- fp8_quantize.py: ONNX-export path for FP8 calibration (V2V + ControlNet +
  IP-Adapter); calibration tensor synthesis with correct dims; OOB Gather fix
  for ipadapter_scale; padded capture to ONNX-declared static dims.
- builder.py: builder_optimization_level param (0-5, default 4); STRONGLY_TYPED
  support; FP8 precision flags wired through build config.
- wrapper.py, pipeline.py, config.py: FP8 flags, cached-attention KV bucketing
  (cache_maxframes), ip_adapter_scale propagation, F821 undefined name fix.
- tools/gpu_profiler.py: new NVTX-annotated profiler with Nsight Compute support.
- scripts/profiling/: profile_nsys.py, profile_ncu.py, README.md.
- tools/summarize_audit.py: Nsight audit summarizer (1706 lines).
- configs/profiling/: 7 profiling config templates (FP16/FP8 cached/flexible/full).
- configs/td_config.yaml.example: TD config reference with all new params.
- .gitignore: add workspace hygiene entries (.claude/, Debug/, StreamDiffusion-
  installer/, StreamDiffusionTD/, custom_processors/, batch files, =0.19.0,
  Nsight outputs, SESSION_LOG.md, audit_reports/).

Engine cache invalidation: old --trt10.12.0.36 dirs are orphaned on disk;
new builds land under --trt10.16.1.11--cc{SM}. Blackwell-specific tactic
gating is a no-op on Ada (SM 8.9) — confirmed at 28.3 FPS / 52.8 FPS (FP16
CUDA graphs) / 60.7 FPS (FP8 plain) / 30+ FPS (FP8+IPA) on RTX 4090.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants