Change graph JSON serialization for repro tool by hwanseoc · Pull Request #280 · NVIDIA/cudnn-frontend

hwanseoc · 2026-06-04T22:13:27Z

@CodeRabbit ignore

Summary

add graph JSON v2.0 with graph-local gid and tensor-local tid references
serialize tensors as a list while preserving user uid fields on tensors
update cudnn_repro to parse v2 logs and drop legacy name/UID fallback paths
detect ragged repros from explicit RAGGED_OFFSET_* tensor refs as well as serialized tensor metadata

Verification

PYTHONPATH=tools/cudnn_repro uv run --no-project --with pytest --with looseversion python -m pytest -q tools/cudnn_repro/tests (31 passed, 30 skipped)
cmake --build /tmp/cudnn2-fe-codex-build --target tests -j 16
CUDA_VISIBLE_DEVICES=2 /tmp/cudnn2-fe-codex-build/bin/tests "[graph][serialize]"
WAN22 benchmark log from keel parses with d_qk=128, s_kv=7800, h_k=40, h_v=40

Summary by CodeRabbit

Release Notes

New Features
- Enhanced graph serialization with improved JSON format and version validation.
- Better tensor tracking and identification system for debugging.
- Extended logging and diagnostic capabilities for tensor dumps.
Bug Fixes
- Improved support for ragged attention patterns in backward operations.
- Enhanced payload validation and error handling in log parsing.
Tests
- Added comprehensive test coverage for new serialization and ragged attention features.

hwanseoc · 2026-06-05T01:06:27Z

@cudnn-ci-bot run

cudnn-ci-bot · 2026-06-05T01:07:20Z

🚀 Running mirror pipeline

Branch: cudnn-gh/pr-280-2347b21
Pipeline: 53726994

hwanseoc · 2026-06-05T01:20:22Z

@CodeRabbit review

coderabbitai · 2026-06-05T01:20:27Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-06-05T01:20:37Z

📝 Walkthrough

Walkthrough

This PR refactors the cuDNN frontend library and repro tool to replace graph_uid with gid identifiers and migrate tensor representation from dict-keyed to tid-indexed lists. Core library changes introduce per-tensor IDs, update JSON serialization format, and modify graph key/logging. The repro tool is comprehensively refactored to support the new tensor list API, ragged detection, and gid-based log parsing with tensor dump injection.

Changes

Core Graph and Tensor ID Infrastructure

Layer / File(s)	Summary
Tensor ID contract and accessors `include/cudnn_frontend/graph_properties.h`	Adds public `tid_t` type alias and `get_tid()` const accessor; introduces private `assign_tid(tid_t)` setter with Graph and INode friend access.
Node virtual tensor ID assignment `include/cudnn_frontend/node_interface.h`	Updates `output_tensor` to assign unique tid via `assign_tid(next_tid++)` to each virtual output tensor, maintaining internal `next_tid` counter.
Graph gid and tensor creation with tid `include/cudnn_frontend/graph_interface.h`	Replaces `graph_uid` with `gid`, updates Graph constructor to use `next_gid` atomic counter, updates `tensor()` and `tensor_like()` overloads to assign tid via `assign_tid(next_tid++)`.
JSON versioning and Tensor_attributes serialization `include/cudnn_frontend/utils/serialize.h`, `include/cudnn_frontend/graph_interface.h`	Adds `GRAPH_JSON_VERSION` constant and `check_graph_json_version()` validator; updates `to_json`/`from_json` to make `name` optional and conditionally emit ragged offset metadata, with tid assignment in parse block.
Graph JSON serialization to tid-indexed format `include/cudnn_frontend/graph_interface.h`	Overhauls Graph serialization: top-level includes `gid`, `json_version`, version strings, and emits tensors via tid-indexed array instead of uid/name dicts; refactors both deserialization paths to read `gid`, validate tensors list, build tid→tensor lookup table, and resolve node input/output references via tid.
Graph keying, logging, and tensor dumps `include/cudnn_frontend/graph_interface.h`	Updates `key()` to erase `"gid"` instead of `"graph_uid"`, changes execution logging to report `gid`, introduces new `log_tensors_to_dump_` overload accepting tensor UID lists and pointer arrays, routes tensor dump logging through new helper.
SDPA backward tensor dump collection `include/cudnn_frontend/node/scaled_dot_product_flash_attention.h`	Adds `CompositeSDPABackwardNode::collect_tensors_to_dump_node` override that deduplicates and collects ragged-offset tensors for SEQ_LEN, input/output tensors Q/K/V/O/dO/Stats/dQ/dK/dV.

Repro Tool Tensor Lookup and Payload Validation

Layer / File(s)	Summary
Repro utils tensor_entry and payload validation `tools/cudnn_repro/cudnn_repro/utils.py`	Refactors `tensor_entry` to lookup by tid from list instead of dict keys/uid/name; adds `validate_payload_version`; introduces `has_ragged_offset_inputs` and `is_ragged_payload` ragged-detection helpers; simplifies `add_ragged_tensor_names` to always set empty list.
Operations detection with payload validation `tools/cudnn_repro/cudnn_repro/operations.py`	Updates `detect_operation_key` to validate payload version before inspecting nodes.
SDPA forward with new tensor entry API `tools/cudnn_repro/cudnn_repro/sdpa_fwd.py`	Updates `build_cfg` to validate payload version, refactors tensor extraction to use `inputs`/`outputs` keys with new `tensor_entry(tensors, hint)` API (no node_name), replaces ragged detection with `is_ragged_payload` helper, includes `is_ragged` in `is_padding` condition.
SDPA backward with forward payload rewrite `tools/cudnn_repro/cudnn_repro/sdpa_bwd.py`	Removes ragged/paged detection from tensor fields; expands `_as_forward_payload` to include ragged offset fields; validates payload version in `build_cfg`; updates `Stats` tensor extraction; delegates `extract_seq_and_ragged` to `sdpa_fwd`.
SDPA FP8 forward with new APIs `tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py`	Updates `build_cfg` to validate payload version, refactors all tensor extractions to use new `tensor_entry(tensors, hint)` pattern, replaces ragged detection with `is_ragged_payload`, includes `is_ragged` in `is_padding`.
SDPA FP8 backward with new APIs `tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py`	Updates `build_cfg` and `extract_seq_and_ragged` to validate payload version, treat tensors as list, use new `tensor_entry(tensors, label)` API without node_name, replace ragged detection with `is_ragged_payload`.
Log parser gid markers and tensor dump injection `tools/cudnn_repro/cudnn_repro/log_parser.py`	Updates `iter_context_entries` to prioritize "Executing gid" markers for execution order; adds tensor dump regex and parsing helpers; injects `pass_by_value` data into matched graph entries; falls back to serialized order when no execution markers.

Test Infrastructure and Payload Updates

Layer / File(s)	Summary
Tensor list helper for payload formatting `tools/cudnn_repro/tests/helpers.py`	Introduces `tensor_list(...)` helper that converts tensor mapping to deterministically ordered list with tid keys.
SDPA backward test payloads and ragged test `tools/cudnn_repro/tests/test_cudnn_repro_bwd.py`	Updates backward config test payloads to json_version/gid/tensor_list format; adds `test_build_bwd_cfg_supports_ragged` to verify ragged+padding behavior with sequence and offset tensors; removes old negative test.
CLI test payload with gid and tensor_list `tools/cudnn_repro/tests/test_cudnn_repro_cli.py`	Updates `fwd_payload` to accept `gid` instead of `graph_uid`, emit json_version/gid in payload, use tensor_list for tensors; updates debug assertions to check `gid`.
Closed loop payload tensor list normalization `tools/cudnn_repro/tests/test_cudnn_repro_closed_loop.py`	Updates `_normalize_payload` to treat tensors as list with tid keys, builds tid-lookup for reference resolution, normalizes tensor array from list.
FP8 test payloads and ragged detection test `tools/cudnn_repro/tests/test_cudnn_repro_fp8.py`	Updates FP8 backward/forward payload generators to include json_version/gid; refactors backward ragged handling to use tensor_list; adds `test_build_fp8_fwd_cfg_detects_ragged_from_offset_inputs` to verify ragged detection from offset inputs.
Log parser test module with gid markers `tools/cudnn_repro/tests/test_cudnn_repro_log_parser.py`	Adds new test module with `payload(...)` helper and four tests validating gid-based execution ordering, fallback sequential association, tensor dump isolation, and most-recent tensor dump preference; replaces old graph_uid tests.
Schema test payload and new version/ragged tests `tools/cudnn_repro/tests/test_cudnn_repro_schema.py`	Updates `fwd_payload` to use `gid` and `tensor_list`; adds `test_select_operation_rejects_old_graph_json`, `test_build_cfg_detects_ragged_from_offset_inputs`; renames and updates ragged-log test; removes graph_uid/legacy variants.

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 With tid and gid so bright,
The tensors now dance in lists—no more dicts!
From backward to forward, from log to payload,
Each tensor finds home by its ID, well played!
🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.26% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Revamp graph JSON serialization for repro tool' is clear, concise, and accurately summarizes the main change: a comprehensive overhaul of graph JSON format with new gid/tid schema and updates to the repro tool.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tools/cudnn_repro/tests/test_cudnn_repro_fp8.py (1)

179-186: ⚡ Quick win

Replace lambda assignment with a def statement.

Lambda assignments reduce readability and are flagged by Ruff E731. Use a named function instead.

♻️ Refactor to use def

-    payload = lambda tag: {"json_version": "2.0", "gid": 1, "nodes": [{"tag": tag}], "tensors": []}
-
-    assert operations.detect_operation_key(payload("SDPA_FWD")) == "sdpa_fwd"
-    assert operations.detect_operation_key(payload("SDPA_BWD")) == "sdpa_bwd"
-    assert operations.detect_operation_key(payload("SDPA_FP8_FWD")) == "sdpa_fp8_fwd"
-    assert operations.detect_operation_key(payload("SDPA_FP8_BWD")) == "sdpa_fp8_bwd"
-    assert operations.detect_operation_key(payload("SDPA_MXFP8_FWD")) == "sdpa_fp8_fwd"
-    assert operations.detect_operation_key(payload("SDPA_MXFP8_BWD")) == "sdpa_fp8_bwd"
+    def payload(tag):
+        return {"json_version": "2.0", "gid": 1, "nodes": [{"tag": tag}], "tensors": []}
+
+    assert operations.detect_operation_key(payload("SDPA_FWD")) == "sdpa_fwd"
+    assert operations.detect_operation_key(payload("SDPA_BWD")) == "sdpa_bwd"
+    assert operations.detect_operation_key(payload("SDPA_FP8_FWD")) == "sdpa_fp8_fwd"
+    assert operations.detect_operation_key(payload("SDPA_FP8_BWD")) == "sdpa_fp8_bwd"
+    assert operations.detect_operation_key(payload("SDPA_MXFP8_FWD")) == "sdpa_fp8_fwd"
+    assert operations.detect_operation_key(payload("SDPA_MXFP8_BWD")) == "sdpa_fp8_bwd"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cudnn_repro/tests/test_cudnn_repro_fp8.py` around lines 179 - 186, The
test defines payload as a lambda (payload = lambda tag: {...}) which triggers
Ruff E731; replace it with a normal named function (def payload(tag): return
{...}) keeping the same body and usages so the subsequent asserts calling
payload("SDPA_FWD") etc. and operations.detect_operation_key remain unchanged;
update only the payload definition in tests/test_cudnn_repro_fp8.py to remove
the lambda and use def to satisfy the linter.

tools/cudnn_repro/cudnn_repro/log_parser.py (1)

36-41: 💤 Low value

Minor type hint style inconsistency.

This function uses lowercase tuple and list for generics, while _parse_context_entry at line 25 and _apply_tensor_dumps at line 43 use Tuple from typing. Consider using a consistent style throughout the module—either the imported Tuple/List or the built-in lowercase forms.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cudnn_repro/cudnn_repro/log_parser.py` around lines 36 - 41, The return
type annotation in _parse_tensor_dump currently mixes built-in generics
(tuple[int, list[int]] | None) while other functions like _parse_context_entry
and _apply_tensor_dumps use typing.Tuple/typing.List; make the module consistent
by changing _parse_tensor_dump's signature to use the same typing forms (e.g.,
Tuple[int, List[int]] | None) or conversely update the other functions to
built-in generics—also update/ensure the appropriate imports (Tuple, List) are
present or remove them if switching all to built-ins so all functions
(_parse_tensor_dump, _parse_context_entry, _apply_tensor_dumps) use the same
style.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/cudnn_repro/cudnn_repro/log_parser.py`:
- Around line 79-86: The tensor-dump accumulator current_dumps should be reset
whenever an execution marker is seen to avoid misattributing dumps; in the
EXECUTE_GRAPH_PATTERN handling block (where gid = int(match.group(1)),
current_entry = graph_entries_by_gid.get(gid) and
execution_entries.append(_apply_tensor_dumps(current_entry, current_dumps)) is
called) move or add the statement current_dumps = {} so it executes
unconditionally when a new execution marker is encountered (i.e., outside the
inner if current_entry is not None), ensuring dumps are cleared regardless of
whether graph_entries_by_gid returned None.

In `@tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py`:
- Around line 144-152: The code returns ragged offsets using
inputs.get("RAGGED_OFFSET_Q") and inputs.get("RAGGED_OFFSET_KV") but misses the
plural key fallback; update the ragged_offset_q and ragged_offset_kv expressions
to try the singular then fall back to the plural keys (e.g. use
inputs.get("RAGGED_OFFSET_Q") or inputs.get("RAGGED_OFFSETS_Q") and similarly
for KV) before passing to utils.tensor_entry/utils.seq_len so behavior matches
sdpa_fp8_fwd.py and sdpa_fwd.extract_seq_and_ragged; keep the other symbols
(payload, tensors, inputs, utils.seq_len, utils.tensor_entry, seed) unchanged.

In `@tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py`:
- Around line 146-154: The function returning seq_len/ragged fields currently
only looks up "RAGGED_OFFSET_Q" and "RAGGED_OFFSET_KV" in the inputs dict and
misses plural keys; update the lookups to fall back to "RAGGED_OFFSETS_Q" and
"RAGGED_OFFSETS_KV" (like sdpa_fwd.extract_seq_and_ragged does) so
ragged_offset_q and ragged_offset_kv use inputs.get("RAGGED_OFFSET_Q") or
inputs.get("RAGGED_OFFSETS_Q") and inputs.get("RAGGED_OFFSET_KV") or
inputs.get("RAGGED_OFFSETS_KV") respectively when calling
utils.tensor_entry/tensor_seq_len, leaving other fields (seq_len_q, seq_len_kv,
rng_data_seed) unchanged.

---

Nitpick comments:
In `@tools/cudnn_repro/cudnn_repro/log_parser.py`:
- Around line 36-41: The return type annotation in _parse_tensor_dump currently
mixes built-in generics (tuple[int, list[int]] | None) while other functions
like _parse_context_entry and _apply_tensor_dumps use typing.Tuple/typing.List;
make the module consistent by changing _parse_tensor_dump's signature to use the
same typing forms (e.g., Tuple[int, List[int]] | None) or conversely update the
other functions to built-in generics—also update/ensure the appropriate imports
(Tuple, List) are present or remove them if switching all to built-ins so all
functions (_parse_tensor_dump, _parse_context_entry, _apply_tensor_dumps) use
the same style.

In `@tools/cudnn_repro/tests/test_cudnn_repro_fp8.py`:
- Around line 179-186: The test defines payload as a lambda (payload = lambda
tag: {...}) which triggers Ruff E731; replace it with a normal named function
(def payload(tag): return {...}) keeping the same body and usages so the
subsequent asserts calling payload("SDPA_FWD") etc. and
operations.detect_operation_key remain unchanged; update only the payload
definition in tests/test_cudnn_repro_fp8.py to remove the lambda and use def to
satisfy the linter.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24bba5a3-7814-4537-8eb9-f6ea51621510

📥 Commits

Reviewing files that changed from the base of the PR and between 035b520 and 1e0e7f4.

📒 Files selected for processing (20)

include/cudnn_frontend/graph_interface.h
include/cudnn_frontend/graph_properties.h
include/cudnn_frontend/node/scaled_dot_product_flash_attention.h
include/cudnn_frontend/node_interface.h
include/cudnn_frontend/utils/serialize.h
tools/cudnn_repro/cudnn_repro/log_parser.py
tools/cudnn_repro/cudnn_repro/operations.py
tools/cudnn_repro/cudnn_repro/sdpa_bwd.py
tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py
tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py
tools/cudnn_repro/cudnn_repro/sdpa_fwd.py
tools/cudnn_repro/cudnn_repro/utils.py
tools/cudnn_repro/tests/helpers.py
tools/cudnn_repro/tests/test_cudnn_repro_bwd.py
tools/cudnn_repro/tests/test_cudnn_repro_cli.py
tools/cudnn_repro/tests/test_cudnn_repro_closed_loop.py
tools/cudnn_repro/tests/test_cudnn_repro_fp8.py
tools/cudnn_repro/tests/test_cudnn_repro_graph_uid.py
tools/cudnn_repro/tests/test_cudnn_repro_log_parser.py
tools/cudnn_repro/tests/test_cudnn_repro_schema.py

💤 Files with no reviewable changes (1)

tools/cudnn_repro/tests/test_cudnn_repro_graph_uid.py

Anerudhan · 2026-06-05T06:12:16Z

    std::optional<pass_by_values_t> compile_time_constant_value = std::nullopt;

    TensorReordering_t reordering_type = TensorReordering_t::NONE;
+    tid_t tid                          = 0;


What is tid ? What is it used for?

Tensor ID. It's an immutable ID that's generated on creation and cannot be changed.

The reason we do this is when we dump the graph JSON at compilation time, the UIDs actually change from the UIDs that are seen at execution time.

hwanseoc changed the title ~~Fix graph JSON tensor references~~ Revamp graph JSON serialization for repro tool Jun 4, 2026

hwanseoc force-pushed the hwanseoc/graph-json-gid-schema branch from ebe0938 to 4e4d199 Compare June 4, 2026 22:42

Fix graph JSON tensor references

83fcc7b

hwanseoc force-pushed the hwanseoc/graph-json-gid-schema branch from 4e4d199 to 83fcc7b Compare June 4, 2026 22:59

hwanseoc added 4 commits June 4, 2026 17:14

Restore tensor dumps for repro logs

ed9f650

Trim graph JSON validation

31175a0

Dump SDPA backward repro tensors

8e79e99

Keep repro tensor dumps per execution

2347b21

Rename repro log parser tests

1e0e7f4

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread tools/cudnn_repro/cudnn_repro/log_parser.py

Comment thread tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py

Comment thread tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py

hwanseoc added 4 commits June 4, 2026 18:42

Address repro review comments

ca21e95

Apply clang-format to graph interface

2abda81

Use tensor tid for repro dumps

df28c9d

Use lowercase tid uid in repro dumps

8a18b07

Anerudhan self-requested a review June 5, 2026 06:11

Anerudhan reviewed Jun 5, 2026

View reviewed changes

hwanseoc changed the title ~~Revamp graph JSON serialization for repro tool~~ Change graph JSON serialization for repro tool Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change graph JSON serialization for repro tool#280

Change graph JSON serialization for repro tool#280
hwanseoc wants to merge 10 commits into
NVIDIA:developfrom
hwanseoc:hwanseoc/graph-json-gid-schema

hwanseoc commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

hwanseoc commented Jun 5, 2026

Uh oh!

cudnn-ci-bot commented Jun 5, 2026

Uh oh!

hwanseoc commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anerudhan Jun 5, 2026

Uh oh!

hwanseoc Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hwanseoc commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

hwanseoc commented Jun 5, 2026

Uh oh!

cudnn-ci-bot commented Jun 5, 2026

Uh oh!

hwanseoc commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Anerudhan Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

hwanseoc Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hwanseoc commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading