Bypass cuteDSL d=256 path on cuDNN 9.23+ by vedaanta · Pull Request #257 · NVIDIA/cudnn-frontend

vedaanta · 2026-05-28T17:00:05Z

cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the graph backend, so the OSS (cuteDSL) kernels at
cudnn.experimental.ops.sdpa are no longer required when the linked backend is recent enough.

Add _cudnn_supports_native_d256() gated on
cudnn.backend_version() >= 92300 and require it to be False before routing fprop/bprop through the SM100 OSS wrappers. The pre-existing SM100+ device check is kept so older cuDNN versions still light up the OSS path on Blackwell.

The test_d256_uses_oss_forward_path test now skips on cuDNN 9.23+ since the OSS bypass is intentional, and a new
test_d256_uses_graph_path_on_cudnn_9_23_plus asserts that fprop/bprop populate the cuDNN graph cache (proving the OSS path is bypassed).

Also: _skip_if_unsupported_d256 and test_d256_uses_oss_forward_path used import cudnn.sdpa inside the function body, which made cudnn a local variable and shadowed the module-level import as soon as any earlier line referenced cudnn (e.g. the new cudnn.backend_version() check). Switch to importlib.import_module("cudnn.sdpa") to avoid the binding.

) Long pytest-xdist runs (e.g. test_mhas_v2 ~2.5k SDPA configs in one worker) hit a much higher GPU memory high-water mark than any single test needs, because the caching allocator retains freed blocks across configs. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, garbage_collection_threshold:0.6 before torch is imported reduces the peak to roughly the maximum any single test needs, with no change in wall time or test outcome. Use os.environ.setdefault so user-provided values still win, and place it above the transformer_engine import so the env var is visible by the time torch initializes its CUDA allocator.

Updated the link for DSA in the README to point to the correct directory.

These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.

Two pre-existing bugs in the VariantPackTemplate, plus one defensive guard: 1. Graph copy -> dangling host pointers. template_ptrs stores raw addresses into cached_pass_by_value storage owned by the source Graph. Default copy propagated prepared=true while the addresses still pointed at the source. Fix: VarpackPrepStateBox copy ctor/assign now always start with prepared=false so the copy re-preps on first use against its own storage. 2. Re-deserialize on the same Graph -> stale template. deserialize(handle,...) rebinds cached_pass_by_value but the existing prepared=true causes the eager prep to short-circuit, leaving the slot layout from the prior deserialize. Fix: reset prepared=false and clear varpack_template before the eager prep call. 3. Null device_ptrs in raw-ptr create_variant_pack overloads. Reject nullptr + non-empty uids instead of forwarding to the cuDNN backend. Adds explicit null-plan guards across detail::execute overloads, returning GRAPH_EXECUTION_FAILED with "No plan found to execute!" instead of dereferencing plan via plan->getTag(). Ports https://gitlab-master.nvidia.com/cudnn/cudnn_frontend/-/merge_requests/2117 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses review feedback on PR NVIDIA#248: the prior fix reset prepared=false and varpack_template but left deserialized_tensor_properties, deserialized_pass_by_value, deserialized_workspace_modifications, and tensors_to_dump populated from any earlier deserialize(handle, old_data). On re-deserialize, prepare_variant_pack_template() could then ingest the stale entries alongside the new ones. Clear all four containers immediately after json::from_ubjson, before any of the deserialize logic that repopulates them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the graph backend, so the OSS (cuteDSL) kernels at `cudnn.experimental.ops.sdpa` are no longer required when the linked backend is recent enough. Add `_cudnn_supports_native_d256()` gated on `cudnn.backend_version() >= 92300` and require it to be `False` before routing fprop/bprop through the SM100 OSS wrappers. The pre-existing SM100+ device check is kept so older cuDNN versions still light up the OSS path on Blackwell. The `test_d256_uses_oss_forward_path` test now skips on cuDNN 9.23+ since the OSS bypass is intentional, and a new `test_d256_uses_graph_path_on_cudnn_9_23_plus` asserts that fprop/bprop populate the cuDNN graph cache (proving the OSS path is bypassed). Also: `_skip_if_unsupported_d256` and `test_d256_uses_oss_forward_path` used `import cudnn.sdpa` inside the function body, which made `cudnn` a local variable and shadowed the module-level import as soon as any earlier line referenced `cudnn` (e.g. the new `cudnn.backend_version()` check). Switch to `importlib.import_module("cudnn.sdpa")` to avoid the binding.

@Anerudhan

- Rename `_CUDNN_NATIVE_D256_VERSION` → `_CUDNN_BACKEND_D256_VERSION` and `_cudnn_supports_native_d256()` → `_cudnn_backend_supports_d256()` per @Anerudhan's request that we say "cuDNN backend" instead of "cuDNN native". Update the surrounding log messages and skip strings to match. - Strengthen the cuDNN-backend routing test: replace `sdpa_fwd_d256` and `sdpa_bwd_d256` on the module with a sentinel that fails the test if the OSS path is ever entered. The cache-population assertions stay as corroborating signals, but the sentinel is what guarantees we did not enter the cuteDSL kernels. Rename the test to `test_d256_uses_cudnn_backend_on_cudnn_9_23_plus`.

Signed-off-by: Ziang Li <ziangli@umich.edu>

@SInCE

…inning (NVIDIA#259) * feat(python): add get_engine_and_knobs_at_index for structured plan pinning get_plan_name_at_index returns a formatted "engN_kT=V" tag built from the engine global index and knob choices. Callers that want to persist a tuned plan and replay it later are forced to either store the bare plan index (which drifts when the policy=ALL plan list is re-enumerated across cudnn-frontend / backend versions) or parse the tag string. Expose the structured data directly: get_engine_and_knobs_at_index returns (engine_id, {KnobType_t: value}), reading the same backend attributes get_engine_tag stringifies. The result feeds straight into create_execution_plan(engine_id, knobs) to rebuild the exact same kernel on a fresh graph without a heuristics query. - detail::get_engine_id_and_knobs (cudnn_frontend_utils.h): structured reader - Execution_plan_list::get_engine_and_knobs_at_index (plans.h) - Graph::get_engine_and_knobs_at_index (graph_interface.h) - PyGraph binding (pygraph.h/.cpp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * address review: bounds-check index, add cpp unit test, trim comments - get_engine_and_knobs_at_index: reject out-of-range index (mirrors check_support_at_index) instead of indexing engine_configs OOB. - add test/cpp/get_engine_and_knobs.cpp: enumerate a matmul graph's plans, read (engine_id, knobs) for each, and confirm re-pinning via create_execution_plan reproduces the same plan (matching name); also checks out-of-range indices error. - trim the new doc comments to match neighboring style. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * knobs: add SWAP_AB / INPUT_TMA_ENABLE / OUTPUT_TMA_ENABLE to KnobType_t KnobType_t (and the to/from backend converters) stopped at WARP_SPEC_CFG (42), so engines using SWAP_AB (43, cuDNN 9.18), INPUT_TMA_ENABLE (44) or OUTPUT_TMA_ENABLE (45, cuDNN 9.22) had those knobs mapped to NOT_SET by convert_from_backend_knob_type. Feeding NOT_SET back into create_execution_plan then failed convert_to_backend_knob_type with INVALID_VALUE -- so a plan enumerated with one of these knobs (e.g. via get_engine_and_knobs_at_index) could not be pinned. Add the three knob types to the enum, both converters (version-gated to match the backend @SInCE), and the pybind knob_type enum. The cpp test now compares the structured identity (engine id + knob map) instead of the plan-name tag, since the tag serializes knobs in engine-config order, which differs between the heuristic config and the pinned one even though the kernel is identical. create_execution_plan is now asserted to succeed for every enumerated plan; building it stays best-effort (can fail for unrelated environment reasons such as a ptxas older than the engine's target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * make get_engine_tag deterministic: sort knob choices by type The plan-name tag was built by iterating CUDNN_ATTR_ENGINECFG_KNOB_CHOICES in stored order, which differs between the heuristics path and create_execution_plan (set_knob_choices iterates a std::unordered_map). So the same engine + knob values could serialize to differently-ordered tags (e.g. eng11_k2=29_k27=0...k43=0 vs eng11_k43=0_k38=0...k2=29) -- the kernel is identical but the string isn't a stable id. Sort the knob choices by type before formatting so the tag is a deterministic function of the engine config regardless of how it was built. This is off the execution hot path (tag is used for logging / plan identity), so no perf impact; the actual knob choices passed to the backend are unchanged. The cpp test now also asserts the pinned plan's tag matches the original's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yang Xu <yanxu@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Anerudhan · 2026-05-29T21:34:54Z

Should we auto direct user to cudnn backend (or) allow user to choose?

* update sdpa benchmark artifacts * update acknowledgement

Anerudhan · 2026-06-02T01:10:32Z

@cudnn-ci-bot run

cudnn-ci-bot · 2026-06-02T01:11:24Z

Backend pipeline not launched

Reason: PR base branch 'main' does not match 'develop'

…ss-oss

coderabbitai · 2026-06-02T20:47:03Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f871f94-144f-4e0e-8f95-ec3b37935da1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

vedaanta · 2026-06-02T20:47:54Z

closing for #272 instead

vedaanta and others added 8 commits May 20, 2026 21:12

Fix DSA link in README.md

7ee0491

Updated the link for DSA in the README to point to the correct directory.

Remove stale H200 benchmark artifacts (NVIDIA#252)

b4fb358

These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.

Change profile_pass from 'fwd' to 'both'

cb74feb

Bump the develop to 1.25.0

8ec19b9

vedaanta requested a review from Anerudhan May 28, 2026 17:04

Anerudhan requested changes May 28, 2026

View reviewed changes

Comment thread python/cudnn/experimental/ops/sdpa.py Outdated

Comment thread test/python/test_cudnn_sdpa_op.py Outdated

Comment thread test/python/test_cudnn_sdpa_op.py

vedaanta and others added 4 commits May 28, 2026 11:10

Add row-scale support to grouped GEMM quant

02d2471

Signed-off-by: Ziang Li <ziangli@umich.edu>

Tighten row-scale grouped GEMM quant tests

1f8cde3

Signed-off-by: Ziang Li <ziangli@umich.edu>

Update SDPA Benchmarking Artifacts (NVIDIA#265)

1ad529d

* update sdpa benchmark artifacts * update acknowledgement

Anerudhan added cat-feature Requests for new functionality, APIs, examples, or behavior improvements. mod-backend cuDNN backend API, graph execution, descriptors, engines, or backend integration. orig-nv-eng Reported or requested by NVIDIA engineering. labels Jun 2, 2026

Anerudhan and others added 2 commits June 2, 2026 11:50

Adding coderabbit review guide (initial template)

e851cc8

Merge branch 'NVIDIA:develop' into vedaanta/sdpa-d256-cudnn-9-23-bypa…

31c8558

…ss-oss

vedaanta closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass cuteDSL d=256 path on cuDNN 9.23+#257

Bypass cuteDSL d=256 path on cuDNN 9.23+#257
vedaanta wants to merge 15 commits into
NVIDIA:mainfrom
vedaanta:vedaanta/sdpa-d256-cudnn-9-23-bypass-oss

vedaanta commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anerudhan commented May 29, 2026

Uh oh!

Anerudhan commented Jun 2, 2026

Uh oh!

cudnn-ci-bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

vedaanta commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

vedaanta commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anerudhan commented May 29, 2026

Uh oh!

Anerudhan commented Jun 2, 2026

Uh oh!

cudnn-ci-bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

vedaanta commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants