Skip to content

Bypass cuteDSL d=256 path on cuDNN 9.23+#257

Closed
vedaanta wants to merge 15 commits into
NVIDIA:mainfrom
vedaanta:vedaanta/sdpa-d256-cudnn-9-23-bypass-oss
Closed

Bypass cuteDSL d=256 path on cuDNN 9.23+#257
vedaanta wants to merge 15 commits into
NVIDIA:mainfrom
vedaanta:vedaanta/sdpa-d256-cudnn-9-23-bypass-oss

Conversation

@vedaanta
Copy link
Copy Markdown
Collaborator

cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the graph backend, so the OSS (cuteDSL) kernels at
cudnn.experimental.ops.sdpa are no longer required when the linked backend is recent enough.

Add _cudnn_supports_native_d256() gated on
cudnn.backend_version() >= 92300 and require it to be False before routing fprop/bprop through the SM100 OSS wrappers. The pre-existing SM100+ device check is kept so older cuDNN versions still light up the OSS path on Blackwell.

The test_d256_uses_oss_forward_path test now skips on cuDNN 9.23+ since the OSS bypass is intentional, and a new
test_d256_uses_graph_path_on_cudnn_9_23_plus asserts that fprop/bprop populate the cuDNN graph cache (proving the OSS path is bypassed).

Also: _skip_if_unsupported_d256 and test_d256_uses_oss_forward_path used import cudnn.sdpa inside the function body, which made cudnn a local variable and shadowed the module-level import as soon as any earlier line referenced cudnn (e.g. the new cudnn.backend_version() check). Switch to importlib.import_module("cudnn.sdpa") to avoid the binding.

vedaanta and others added 8 commits May 20, 2026 21:12
)

Long pytest-xdist runs (e.g. test_mhas_v2 ~2.5k SDPA configs in one
worker) hit a much higher GPU memory high-water mark than any single
test needs, because the caching allocator retains freed blocks across
configs.

Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,
garbage_collection_threshold:0.6 before torch is imported reduces the
peak to roughly the maximum any single test needs, with no change in
wall time or test outcome.

Use os.environ.setdefault so user-provided values still win, and
place it above the transformer_engine import so the env var is
visible by the time torch initializes its CUDA allocator.
Updated the link for DSA in the README to point to the correct directory.
These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.
Two pre-existing bugs in the VariantPackTemplate, plus one defensive guard:

1. Graph copy -> dangling host pointers. template_ptrs stores raw addresses
   into cached_pass_by_value storage owned by the source Graph. Default copy
   propagated prepared=true while the addresses still pointed at the source.
   Fix: VarpackPrepStateBox copy ctor/assign now always start with
   prepared=false so the copy re-preps on first use against its own storage.
2. Re-deserialize on the same Graph -> stale template. deserialize(handle,...)
   rebinds cached_pass_by_value but the existing prepared=true causes the
   eager prep to short-circuit, leaving the slot layout from the prior
   deserialize. Fix: reset prepared=false and clear varpack_template before
   the eager prep call.
3. Null device_ptrs in raw-ptr create_variant_pack overloads. Reject nullptr
   + non-empty uids instead of forwarding to the cuDNN backend.

Adds explicit null-plan guards across detail::execute overloads, returning
GRAPH_EXECUTION_FAILED with "No plan found to execute!" instead of
dereferencing plan via plan->getTag().

Ports https://gitlab-master.nvidia.com/cudnn/cudnn_frontend/-/merge_requests/2117

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review feedback on PR NVIDIA#248: the prior fix reset prepared=false
and varpack_template but left deserialized_tensor_properties,
deserialized_pass_by_value, deserialized_workspace_modifications, and
tensors_to_dump populated from any earlier deserialize(handle, old_data).
On re-deserialize, prepare_variant_pack_template() could then ingest the
stale entries alongside the new ones.

Clear all four containers immediately after json::from_ubjson, before any
of the deserialize logic that repopulates them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the
graph backend, so the OSS (cuteDSL) kernels at
`cudnn.experimental.ops.sdpa` are no longer required when the linked
backend is recent enough.

Add `_cudnn_supports_native_d256()` gated on
`cudnn.backend_version() >= 92300` and require it to be `False` before
routing fprop/bprop through the SM100 OSS wrappers. The pre-existing
SM100+ device check is kept so older cuDNN versions still light up the
OSS path on Blackwell.

The `test_d256_uses_oss_forward_path` test now skips on cuDNN 9.23+
since the OSS bypass is intentional, and a new
`test_d256_uses_graph_path_on_cudnn_9_23_plus` asserts that fprop/bprop
populate the cuDNN graph cache (proving the OSS path is bypassed).

Also: `_skip_if_unsupported_d256` and `test_d256_uses_oss_forward_path`
used `import cudnn.sdpa` inside the function body, which made `cudnn`
a local variable and shadowed the module-level import as soon as any
earlier line referenced `cudnn` (e.g. the new `cudnn.backend_version()`
check). Switch to `importlib.import_module("cudnn.sdpa")` to avoid the
binding.
@vedaanta vedaanta requested a review from Anerudhan May 28, 2026 17:04
Comment thread python/cudnn/experimental/ops/sdpa.py Outdated
Comment thread test/python/test_cudnn_sdpa_op.py Outdated
Comment thread test/python/test_cudnn_sdpa_op.py
vedaanta and others added 4 commits May 28, 2026 11:10
- Rename `_CUDNN_NATIVE_D256_VERSION` → `_CUDNN_BACKEND_D256_VERSION`
  and `_cudnn_supports_native_d256()` → `_cudnn_backend_supports_d256()`
  per @Anerudhan's request that we say "cuDNN backend" instead of
  "cuDNN native". Update the surrounding log messages and skip strings
  to match.

- Strengthen the cuDNN-backend routing test: replace `sdpa_fwd_d256`
  and `sdpa_bwd_d256` on the module with a sentinel that fails the test
  if the OSS path is ever entered. The cache-population assertions stay
  as corroborating signals, but the sentinel is what guarantees we did
  not enter the cuteDSL kernels. Rename the test to
  `test_d256_uses_cudnn_backend_on_cudnn_9_23_plus`.
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
…inning (NVIDIA#259)

* feat(python): add get_engine_and_knobs_at_index for structured plan pinning

get_plan_name_at_index returns a formatted "engN_kT=V" tag built from the
engine global index and knob choices. Callers that want to persist a tuned
plan and replay it later are forced to either store the bare plan index
(which drifts when the policy=ALL plan list is re-enumerated across
cudnn-frontend / backend versions) or parse the tag string.

Expose the structured data directly: get_engine_and_knobs_at_index returns
(engine_id, {KnobType_t: value}), reading the same backend attributes
get_engine_tag stringifies. The result feeds straight into
create_execution_plan(engine_id, knobs) to rebuild the exact same kernel on a
fresh graph without a heuristics query.

- detail::get_engine_id_and_knobs (cudnn_frontend_utils.h): structured reader
- Execution_plan_list::get_engine_and_knobs_at_index (plans.h)
- Graph::get_engine_and_knobs_at_index (graph_interface.h)
- PyGraph binding (pygraph.h/.cpp)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* address review: bounds-check index, add cpp unit test, trim comments

- get_engine_and_knobs_at_index: reject out-of-range index (mirrors
  check_support_at_index) instead of indexing engine_configs OOB.
- add test/cpp/get_engine_and_knobs.cpp: enumerate a matmul graph's plans,
  read (engine_id, knobs) for each, and confirm re-pinning via
  create_execution_plan reproduces the same plan (matching name); also checks
  out-of-range indices error.
- trim the new doc comments to match neighboring style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* knobs: add SWAP_AB / INPUT_TMA_ENABLE / OUTPUT_TMA_ENABLE to KnobType_t

KnobType_t (and the to/from backend converters) stopped at WARP_SPEC_CFG (42),
so engines using SWAP_AB (43, cuDNN 9.18), INPUT_TMA_ENABLE (44) or
OUTPUT_TMA_ENABLE (45, cuDNN 9.22) had those knobs mapped to NOT_SET by
convert_from_backend_knob_type. Feeding NOT_SET back into create_execution_plan
then failed convert_to_backend_knob_type with INVALID_VALUE -- so a plan
enumerated with one of these knobs (e.g. via get_engine_and_knobs_at_index)
could not be pinned.

Add the three knob types to the enum, both converters (version-gated to match
the backend @SInCE), and the pybind knob_type enum.

The cpp test now compares the structured identity (engine id + knob map)
instead of the plan-name tag, since the tag serializes knobs in engine-config
order, which differs between the heuristic config and the pinned one even
though the kernel is identical. create_execution_plan is now asserted to
succeed for every enumerated plan; building it stays best-effort (can fail for
unrelated environment reasons such as a ptxas older than the engine's target).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* make get_engine_tag deterministic: sort knob choices by type

The plan-name tag was built by iterating CUDNN_ATTR_ENGINECFG_KNOB_CHOICES in
stored order, which differs between the heuristics path and
create_execution_plan (set_knob_choices iterates a std::unordered_map). So the
same engine + knob values could serialize to differently-ordered tags
(e.g. eng11_k2=29_k27=0...k43=0 vs eng11_k43=0_k38=0...k2=29) -- the kernel is
identical but the string isn't a stable id.

Sort the knob choices by type before formatting so the tag is a deterministic
function of the engine config regardless of how it was built. This is off the
execution hot path (tag is used for logging / plan identity), so no perf
impact; the actual knob choices passed to the backend are unchanged.

The cpp test now also asserts the pinned plan's tag matches the original's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yang Xu <yanxu@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Anerudhan
Copy link
Copy Markdown
Collaborator

Should we auto direct user to cudnn backend (or) allow user to choose?

* update sdpa benchmark artifacts

* update acknowledgement
@Anerudhan Anerudhan added cat-feature Requests for new functionality, APIs, examples, or behavior improvements. mod-backend cuDNN backend API, graph execution, descriptors, engines, or backend integration. orig-nv-eng Reported or requested by NVIDIA engineering. labels Jun 2, 2026
@Anerudhan
Copy link
Copy Markdown
Collaborator

@cudnn-ci-bot run

@cudnn-ci-bot
Copy link
Copy Markdown

Backend pipeline not launched

Reason: PR base branch 'main' does not match 'develop'

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f871f94-144f-4e0e-8f95-ec3b37935da1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@vedaanta
Copy link
Copy Markdown
Collaborator Author

vedaanta commented Jun 2, 2026

closing for #272 instead

@vedaanta vedaanta closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat-feature Requests for new functionality, APIs, examples, or behavior improvements. mod-backend cuDNN backend API, graph execution, descriptors, engines, or backend integration. orig-nv-eng Reported or requested by NVIDIA engineering.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants