Bypass cuteDSL d=256 path on cuDNN 9.23+#257
Conversation
) Long pytest-xdist runs (e.g. test_mhas_v2 ~2.5k SDPA configs in one worker) hit a much higher GPU memory high-water mark than any single test needs, because the caching allocator retains freed blocks across configs. Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, garbage_collection_threshold:0.6 before torch is imported reduces the peak to roughly the maximum any single test needs, with no change in wall time or test outcome. Use os.environ.setdefault so user-provided values still win, and place it above the transformer_engine import so the env var is visible by the time torch initializes its CUDA allocator.
Updated the link for DSA in the README to point to the correct directory.
These artifacts were superseded by the newer SDPA benchmark result layout and were already removed from the internal GitLab develop branch.
Two pre-existing bugs in the VariantPackTemplate, plus one defensive guard: 1. Graph copy -> dangling host pointers. template_ptrs stores raw addresses into cached_pass_by_value storage owned by the source Graph. Default copy propagated prepared=true while the addresses still pointed at the source. Fix: VarpackPrepStateBox copy ctor/assign now always start with prepared=false so the copy re-preps on first use against its own storage. 2. Re-deserialize on the same Graph -> stale template. deserialize(handle,...) rebinds cached_pass_by_value but the existing prepared=true causes the eager prep to short-circuit, leaving the slot layout from the prior deserialize. Fix: reset prepared=false and clear varpack_template before the eager prep call. 3. Null device_ptrs in raw-ptr create_variant_pack overloads. Reject nullptr + non-empty uids instead of forwarding to the cuDNN backend. Adds explicit null-plan guards across detail::execute overloads, returning GRAPH_EXECUTION_FAILED with "No plan found to execute!" instead of dereferencing plan via plan->getTag(). Ports https://gitlab-master.nvidia.com/cudnn/cudnn_frontend/-/merge_requests/2117 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review feedback on PR NVIDIA#248: the prior fix reset prepared=false and varpack_template but left deserialized_tensor_properties, deserialized_pass_by_value, deserialized_workspace_modifications, and tensors_to_dump populated from any earlier deserialize(handle, old_data). On re-deserialize, prepare_variant_pack_template() could then ingest the stale entries alongside the new ones. Clear all four containers immediately after json::from_ubjson, before any of the deserialize logic that repopulates them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the
graph backend, so the OSS (cuteDSL) kernels at
`cudnn.experimental.ops.sdpa` are no longer required when the linked
backend is recent enough.
Add `_cudnn_supports_native_d256()` gated on
`cudnn.backend_version() >= 92300` and require it to be `False` before
routing fprop/bprop through the SM100 OSS wrappers. The pre-existing
SM100+ device check is kept so older cuDNN versions still light up the
OSS path on Blackwell.
The `test_d256_uses_oss_forward_path` test now skips on cuDNN 9.23+
since the OSS bypass is intentional, and a new
`test_d256_uses_graph_path_on_cudnn_9_23_plus` asserts that fprop/bprop
populate the cuDNN graph cache (proving the OSS path is bypassed).
Also: `_skip_if_unsupported_d256` and `test_d256_uses_oss_forward_path`
used `import cudnn.sdpa` inside the function body, which made `cudnn`
a local variable and shadowed the module-level import as soon as any
earlier line referenced `cudnn` (e.g. the new `cudnn.backend_version()`
check). Switch to `importlib.import_module("cudnn.sdpa")` to avoid the
binding.
- Rename `_CUDNN_NATIVE_D256_VERSION` → `_CUDNN_BACKEND_D256_VERSION` and `_cudnn_supports_native_d256()` → `_cudnn_backend_supports_d256()` per @Anerudhan's request that we say "cuDNN backend" instead of "cuDNN native". Update the surrounding log messages and skip strings to match. - Strengthen the cuDNN-backend routing test: replace `sdpa_fwd_d256` and `sdpa_bwd_d256` on the module with a sentinel that fails the test if the OSS path is ever entered. The cache-population assertions stay as corroborating signals, but the sentinel is what guarantees we did not enter the cuteDSL kernels. Rename the test to `test_d256_uses_cudnn_backend_on_cudnn_9_23_plus`.
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
…inning (NVIDIA#259) * feat(python): add get_engine_and_knobs_at_index for structured plan pinning get_plan_name_at_index returns a formatted "engN_kT=V" tag built from the engine global index and knob choices. Callers that want to persist a tuned plan and replay it later are forced to either store the bare plan index (which drifts when the policy=ALL plan list is re-enumerated across cudnn-frontend / backend versions) or parse the tag string. Expose the structured data directly: get_engine_and_knobs_at_index returns (engine_id, {KnobType_t: value}), reading the same backend attributes get_engine_tag stringifies. The result feeds straight into create_execution_plan(engine_id, knobs) to rebuild the exact same kernel on a fresh graph without a heuristics query. - detail::get_engine_id_and_knobs (cudnn_frontend_utils.h): structured reader - Execution_plan_list::get_engine_and_knobs_at_index (plans.h) - Graph::get_engine_and_knobs_at_index (graph_interface.h) - PyGraph binding (pygraph.h/.cpp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * address review: bounds-check index, add cpp unit test, trim comments - get_engine_and_knobs_at_index: reject out-of-range index (mirrors check_support_at_index) instead of indexing engine_configs OOB. - add test/cpp/get_engine_and_knobs.cpp: enumerate a matmul graph's plans, read (engine_id, knobs) for each, and confirm re-pinning via create_execution_plan reproduces the same plan (matching name); also checks out-of-range indices error. - trim the new doc comments to match neighboring style. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * knobs: add SWAP_AB / INPUT_TMA_ENABLE / OUTPUT_TMA_ENABLE to KnobType_t KnobType_t (and the to/from backend converters) stopped at WARP_SPEC_CFG (42), so engines using SWAP_AB (43, cuDNN 9.18), INPUT_TMA_ENABLE (44) or OUTPUT_TMA_ENABLE (45, cuDNN 9.22) had those knobs mapped to NOT_SET by convert_from_backend_knob_type. Feeding NOT_SET back into create_execution_plan then failed convert_to_backend_knob_type with INVALID_VALUE -- so a plan enumerated with one of these knobs (e.g. via get_engine_and_knobs_at_index) could not be pinned. Add the three knob types to the enum, both converters (version-gated to match the backend @SInCE), and the pybind knob_type enum. The cpp test now compares the structured identity (engine id + knob map) instead of the plan-name tag, since the tag serializes knobs in engine-config order, which differs between the heuristic config and the pinned one even though the kernel is identical. create_execution_plan is now asserted to succeed for every enumerated plan; building it stays best-effort (can fail for unrelated environment reasons such as a ptxas older than the engine's target). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * make get_engine_tag deterministic: sort knob choices by type The plan-name tag was built by iterating CUDNN_ATTR_ENGINECFG_KNOB_CHOICES in stored order, which differs between the heuristics path and create_execution_plan (set_knob_choices iterates a std::unordered_map). So the same engine + knob values could serialize to differently-ordered tags (e.g. eng11_k2=29_k27=0...k43=0 vs eng11_k43=0_k38=0...k2=29) -- the kernel is identical but the string isn't a stable id. Sort the knob choices by type before formatting so the tag is a deterministic function of the engine config regardless of how it was built. This is off the execution hot path (tag is used for logging / plan identity), so no perf impact; the actual knob choices passed to the backend are unchanged. The cpp test now also asserts the pinned plan's tag matches the original's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yang Xu <yanxu@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Should we auto direct user to cudnn backend (or) allow user to choose? |
* update sdpa benchmark artifacts * update acknowledgement
|
@cudnn-ci-bot run |
|
Backend pipeline not launched Reason: PR base branch 'main' does not match 'develop' |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
closing for #272 instead |
cuDNN 9.23.0 added native d=256 SDPA fprop and bprop support in the graph backend, so the OSS (cuteDSL) kernels at
cudnn.experimental.ops.sdpaare no longer required when the linked backend is recent enough.Add
_cudnn_supports_native_d256()gated oncudnn.backend_version() >= 92300and require it to beFalsebefore routing fprop/bprop through the SM100 OSS wrappers. The pre-existing SM100+ device check is kept so older cuDNN versions still light up the OSS path on Blackwell.The
test_d256_uses_oss_forward_pathtest now skips on cuDNN 9.23+ since the OSS bypass is intentional, and a newtest_d256_uses_graph_path_on_cudnn_9_23_plusasserts that fprop/bprop populate the cuDNN graph cache (proving the OSS path is bypassed).Also:
_skip_if_unsupported_d256andtest_d256_uses_oss_forward_pathusedimport cudnn.sdpainside the function body, which madecudnna local variable and shadowed the module-level import as soon as any earlier line referencedcudnn(e.g. the newcudnn.backend_version()check). Switch toimportlib.import_module("cudnn.sdpa")to avoid the binding.