Skip to content

Change graph JSON serialization for repro tool#280

Open
hwanseoc wants to merge 10 commits into
NVIDIA:developfrom
hwanseoc:hwanseoc/graph-json-gid-schema
Open

Change graph JSON serialization for repro tool#280
hwanseoc wants to merge 10 commits into
NVIDIA:developfrom
hwanseoc:hwanseoc/graph-json-gid-schema

Conversation

@hwanseoc
Copy link
Copy Markdown
Member

@hwanseoc hwanseoc commented Jun 4, 2026

@CodeRabbit ignore

Summary

  • add graph JSON v2.0 with graph-local gid and tensor-local tid references
  • serialize tensors as a list while preserving user uid fields on tensors
  • update cudnn_repro to parse v2 logs and drop legacy name/UID fallback paths
  • detect ragged repros from explicit RAGGED_OFFSET_* tensor refs as well as serialized tensor metadata

Verification

  • PYTHONPATH=tools/cudnn_repro uv run --no-project --with pytest --with looseversion python -m pytest -q tools/cudnn_repro/tests (31 passed, 30 skipped)
  • cmake --build /tmp/cudnn2-fe-codex-build --target tests -j 16
  • CUDA_VISIBLE_DEVICES=2 /tmp/cudnn2-fe-codex-build/bin/tests "[graph][serialize]"
  • WAN22 benchmark log from keel parses with d_qk=128, s_kv=7800, h_k=40, h_v=40

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced graph serialization with improved JSON format and version validation.
    • Better tensor tracking and identification system for debugging.
    • Extended logging and diagnostic capabilities for tensor dumps.
  • Bug Fixes

    • Improved support for ragged attention patterns in backward operations.
    • Enhanced payload validation and error handling in log parsing.
  • Tests

    • Added comprehensive test coverage for new serialization and ragged attention features.

@hwanseoc hwanseoc changed the title Fix graph JSON tensor references Revamp graph JSON serialization for repro tool Jun 4, 2026
@hwanseoc hwanseoc force-pushed the hwanseoc/graph-json-gid-schema branch from ebe0938 to 4e4d199 Compare June 4, 2026 22:42
@hwanseoc hwanseoc force-pushed the hwanseoc/graph-json-gid-schema branch from 4e4d199 to 83fcc7b Compare June 4, 2026 22:59
@hwanseoc
Copy link
Copy Markdown
Member Author

hwanseoc commented Jun 5, 2026

@cudnn-ci-bot run

@cudnn-ci-bot
Copy link
Copy Markdown

🚀 Running mirror pipeline

Branch: cudnn-gh/pr-280-2347b21
Pipeline: 53726994

@hwanseoc
Copy link
Copy Markdown
Member Author

hwanseoc commented Jun 5, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors the cuDNN frontend library and repro tool to replace graph_uid with gid identifiers and migrate tensor representation from dict-keyed to tid-indexed lists. Core library changes introduce per-tensor IDs, update JSON serialization format, and modify graph key/logging. The repro tool is comprehensively refactored to support the new tensor list API, ragged detection, and gid-based log parsing with tensor dump injection.

Changes

Core Graph and Tensor ID Infrastructure

Layer / File(s) Summary
Tensor ID contract and accessors
include/cudnn_frontend/graph_properties.h
Adds public tid_t type alias and get_tid() const accessor; introduces private assign_tid(tid_t) setter with Graph and INode friend access.
Node virtual tensor ID assignment
include/cudnn_frontend/node_interface.h
Updates output_tensor to assign unique tid via assign_tid(next_tid++) to each virtual output tensor, maintaining internal next_tid counter.
Graph gid and tensor creation with tid
include/cudnn_frontend/graph_interface.h
Replaces graph_uid with gid, updates Graph constructor to use next_gid atomic counter, updates tensor() and tensor_like() overloads to assign tid via assign_tid(next_tid++).
JSON versioning and Tensor_attributes serialization
include/cudnn_frontend/utils/serialize.h, include/cudnn_frontend/graph_interface.h
Adds GRAPH_JSON_VERSION constant and check_graph_json_version() validator; updates to_json/from_json to make name optional and conditionally emit ragged offset metadata, with tid assignment in parse block.
Graph JSON serialization to tid-indexed format
include/cudnn_frontend/graph_interface.h
Overhauls Graph serialization: top-level includes gid, json_version, version strings, and emits tensors via tid-indexed array instead of uid/name dicts; refactors both deserialization paths to read gid, validate tensors list, build tid→tensor lookup table, and resolve node input/output references via tid.
Graph keying, logging, and tensor dumps
include/cudnn_frontend/graph_interface.h
Updates key() to erase "gid" instead of "graph_uid", changes execution logging to report gid, introduces new log_tensors_to_dump_ overload accepting tensor UID lists and pointer arrays, routes tensor dump logging through new helper.
SDPA backward tensor dump collection
include/cudnn_frontend/node/scaled_dot_product_flash_attention.h
Adds CompositeSDPABackwardNode::collect_tensors_to_dump_node override that deduplicates and collects ragged-offset tensors for SEQ_LEN, input/output tensors Q/K/V/O/dO/Stats/dQ/dK/dV.

Repro Tool Tensor Lookup and Payload Validation

Layer / File(s) Summary
Repro utils tensor_entry and payload validation
tools/cudnn_repro/cudnn_repro/utils.py
Refactors tensor_entry to lookup by tid from list instead of dict keys/uid/name; adds validate_payload_version; introduces has_ragged_offset_inputs and is_ragged_payload ragged-detection helpers; simplifies add_ragged_tensor_names to always set empty list.
Operations detection with payload validation
tools/cudnn_repro/cudnn_repro/operations.py
Updates detect_operation_key to validate payload version before inspecting nodes.
SDPA forward with new tensor entry API
tools/cudnn_repro/cudnn_repro/sdpa_fwd.py
Updates build_cfg to validate payload version, refactors tensor extraction to use inputs/outputs keys with new tensor_entry(tensors, hint) API (no node_name), replaces ragged detection with is_ragged_payload helper, includes is_ragged in is_padding condition.
SDPA backward with forward payload rewrite
tools/cudnn_repro/cudnn_repro/sdpa_bwd.py
Removes ragged/paged detection from tensor fields; expands _as_forward_payload to include ragged offset fields; validates payload version in build_cfg; updates Stats tensor extraction; delegates extract_seq_and_ragged to sdpa_fwd.
SDPA FP8 forward with new APIs
tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py
Updates build_cfg to validate payload version, refactors all tensor extractions to use new tensor_entry(tensors, hint) pattern, replaces ragged detection with is_ragged_payload, includes is_ragged in is_padding.
SDPA FP8 backward with new APIs
tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py
Updates build_cfg and extract_seq_and_ragged to validate payload version, treat tensors as list, use new tensor_entry(tensors, label) API without node_name, replace ragged detection with is_ragged_payload.
Log parser gid markers and tensor dump injection
tools/cudnn_repro/cudnn_repro/log_parser.py
Updates iter_context_entries to prioritize "Executing gid" markers for execution order; adds tensor dump regex and parsing helpers; injects pass_by_value data into matched graph entries; falls back to serialized order when no execution markers.

Test Infrastructure and Payload Updates

Layer / File(s) Summary
Tensor list helper for payload formatting
tools/cudnn_repro/tests/helpers.py
Introduces tensor_list(...) helper that converts tensor mapping to deterministically ordered list with tid keys.
SDPA backward test payloads and ragged test
tools/cudnn_repro/tests/test_cudnn_repro_bwd.py
Updates backward config test payloads to json_version/gid/tensor_list format; adds test_build_bwd_cfg_supports_ragged to verify ragged+padding behavior with sequence and offset tensors; removes old negative test.
CLI test payload with gid and tensor_list
tools/cudnn_repro/tests/test_cudnn_repro_cli.py
Updates fwd_payload to accept gid instead of graph_uid, emit json_version/gid in payload, use tensor_list for tensors; updates debug assertions to check gid.
Closed loop payload tensor list normalization
tools/cudnn_repro/tests/test_cudnn_repro_closed_loop.py
Updates _normalize_payload to treat tensors as list with tid keys, builds tid-lookup for reference resolution, normalizes tensor array from list.
FP8 test payloads and ragged detection test
tools/cudnn_repro/tests/test_cudnn_repro_fp8.py
Updates FP8 backward/forward payload generators to include json_version/gid; refactors backward ragged handling to use tensor_list; adds test_build_fp8_fwd_cfg_detects_ragged_from_offset_inputs to verify ragged detection from offset inputs.
Log parser test module with gid markers
tools/cudnn_repro/tests/test_cudnn_repro_log_parser.py
Adds new test module with payload(...) helper and four tests validating gid-based execution ordering, fallback sequential association, tensor dump isolation, and most-recent tensor dump preference; replaces old graph_uid tests.
Schema test payload and new version/ragged tests
tools/cudnn_repro/tests/test_cudnn_repro_schema.py
Updates fwd_payload to use gid and tensor_list; adds test_select_operation_rejects_old_graph_json, test_build_cfg_detects_ragged_from_offset_inputs; renames and updates ragged-log test; removes graph_uid/legacy variants.

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 With tid and gid so bright,
The tensors now dance in lists—no more dicts!
From backward to forward, from log to payload,
Each tensor finds home by its ID, well played!
🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.26% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Revamp graph JSON serialization for repro tool' is clear, concise, and accurately summarizes the main change: a comprehensive overhaul of graph JSON format with new gid/tid schema and updates to the repro tool.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tools/cudnn_repro/tests/test_cudnn_repro_fp8.py (1)

179-186: ⚡ Quick win

Replace lambda assignment with a def statement.

Lambda assignments reduce readability and are flagged by Ruff E731. Use a named function instead.

♻️ Refactor to use def
-    payload = lambda tag: {"json_version": "2.0", "gid": 1, "nodes": [{"tag": tag}], "tensors": []}
-
-    assert operations.detect_operation_key(payload("SDPA_FWD")) == "sdpa_fwd"
-    assert operations.detect_operation_key(payload("SDPA_BWD")) == "sdpa_bwd"
-    assert operations.detect_operation_key(payload("SDPA_FP8_FWD")) == "sdpa_fp8_fwd"
-    assert operations.detect_operation_key(payload("SDPA_FP8_BWD")) == "sdpa_fp8_bwd"
-    assert operations.detect_operation_key(payload("SDPA_MXFP8_FWD")) == "sdpa_fp8_fwd"
-    assert operations.detect_operation_key(payload("SDPA_MXFP8_BWD")) == "sdpa_fp8_bwd"
+    def payload(tag):
+        return {"json_version": "2.0", "gid": 1, "nodes": [{"tag": tag}], "tensors": []}
+
+    assert operations.detect_operation_key(payload("SDPA_FWD")) == "sdpa_fwd"
+    assert operations.detect_operation_key(payload("SDPA_BWD")) == "sdpa_bwd"
+    assert operations.detect_operation_key(payload("SDPA_FP8_FWD")) == "sdpa_fp8_fwd"
+    assert operations.detect_operation_key(payload("SDPA_FP8_BWD")) == "sdpa_fp8_bwd"
+    assert operations.detect_operation_key(payload("SDPA_MXFP8_FWD")) == "sdpa_fp8_fwd"
+    assert operations.detect_operation_key(payload("SDPA_MXFP8_BWD")) == "sdpa_fp8_bwd"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cudnn_repro/tests/test_cudnn_repro_fp8.py` around lines 179 - 186, The
test defines payload as a lambda (payload = lambda tag: {...}) which triggers
Ruff E731; replace it with a normal named function (def payload(tag): return
{...}) keeping the same body and usages so the subsequent asserts calling
payload("SDPA_FWD") etc. and operations.detect_operation_key remain unchanged;
update only the payload definition in tests/test_cudnn_repro_fp8.py to remove
the lambda and use def to satisfy the linter.
tools/cudnn_repro/cudnn_repro/log_parser.py (1)

36-41: 💤 Low value

Minor type hint style inconsistency.

This function uses lowercase tuple and list for generics, while _parse_context_entry at line 25 and _apply_tensor_dumps at line 43 use Tuple from typing. Consider using a consistent style throughout the module—either the imported Tuple/List or the built-in lowercase forms.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cudnn_repro/cudnn_repro/log_parser.py` around lines 36 - 41, The return
type annotation in _parse_tensor_dump currently mixes built-in generics
(tuple[int, list[int]] | None) while other functions like _parse_context_entry
and _apply_tensor_dumps use typing.Tuple/typing.List; make the module consistent
by changing _parse_tensor_dump's signature to use the same typing forms (e.g.,
Tuple[int, List[int]] | None) or conversely update the other functions to
built-in generics—also update/ensure the appropriate imports (Tuple, List) are
present or remove them if switching all to built-ins so all functions
(_parse_tensor_dump, _parse_context_entry, _apply_tensor_dumps) use the same
style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/cudnn_repro/cudnn_repro/log_parser.py`:
- Around line 79-86: The tensor-dump accumulator current_dumps should be reset
whenever an execution marker is seen to avoid misattributing dumps; in the
EXECUTE_GRAPH_PATTERN handling block (where gid = int(match.group(1)),
current_entry = graph_entries_by_gid.get(gid) and
execution_entries.append(_apply_tensor_dumps(current_entry, current_dumps)) is
called) move or add the statement current_dumps = {} so it executes
unconditionally when a new execution marker is encountered (i.e., outside the
inner if current_entry is not None), ensuring dumps are cleared regardless of
whether graph_entries_by_gid returned None.

In `@tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py`:
- Around line 144-152: The code returns ragged offsets using
inputs.get("RAGGED_OFFSET_Q") and inputs.get("RAGGED_OFFSET_KV") but misses the
plural key fallback; update the ragged_offset_q and ragged_offset_kv expressions
to try the singular then fall back to the plural keys (e.g. use
inputs.get("RAGGED_OFFSET_Q") or inputs.get("RAGGED_OFFSETS_Q") and similarly
for KV) before passing to utils.tensor_entry/utils.seq_len so behavior matches
sdpa_fp8_fwd.py and sdpa_fwd.extract_seq_and_ragged; keep the other symbols
(payload, tensors, inputs, utils.seq_len, utils.tensor_entry, seed) unchanged.

In `@tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py`:
- Around line 146-154: The function returning seq_len/ragged fields currently
only looks up "RAGGED_OFFSET_Q" and "RAGGED_OFFSET_KV" in the inputs dict and
misses plural keys; update the lookups to fall back to "RAGGED_OFFSETS_Q" and
"RAGGED_OFFSETS_KV" (like sdpa_fwd.extract_seq_and_ragged does) so
ragged_offset_q and ragged_offset_kv use inputs.get("RAGGED_OFFSET_Q") or
inputs.get("RAGGED_OFFSETS_Q") and inputs.get("RAGGED_OFFSET_KV") or
inputs.get("RAGGED_OFFSETS_KV") respectively when calling
utils.tensor_entry/tensor_seq_len, leaving other fields (seq_len_q, seq_len_kv,
rng_data_seed) unchanged.

---

Nitpick comments:
In `@tools/cudnn_repro/cudnn_repro/log_parser.py`:
- Around line 36-41: The return type annotation in _parse_tensor_dump currently
mixes built-in generics (tuple[int, list[int]] | None) while other functions
like _parse_context_entry and _apply_tensor_dumps use typing.Tuple/typing.List;
make the module consistent by changing _parse_tensor_dump's signature to use the
same typing forms (e.g., Tuple[int, List[int]] | None) or conversely update the
other functions to built-in generics—also update/ensure the appropriate imports
(Tuple, List) are present or remove them if switching all to built-ins so all
functions (_parse_tensor_dump, _parse_context_entry, _apply_tensor_dumps) use
the same style.

In `@tools/cudnn_repro/tests/test_cudnn_repro_fp8.py`:
- Around line 179-186: The test defines payload as a lambda (payload = lambda
tag: {...}) which triggers Ruff E731; replace it with a normal named function
(def payload(tag): return {...}) keeping the same body and usages so the
subsequent asserts calling payload("SDPA_FWD") etc. and
operations.detect_operation_key remain unchanged; update only the payload
definition in tests/test_cudnn_repro_fp8.py to remove the lambda and use def to
satisfy the linter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 24bba5a3-7814-4537-8eb9-f6ea51621510

📥 Commits

Reviewing files that changed from the base of the PR and between 035b520 and 1e0e7f4.

📒 Files selected for processing (20)
  • include/cudnn_frontend/graph_interface.h
  • include/cudnn_frontend/graph_properties.h
  • include/cudnn_frontend/node/scaled_dot_product_flash_attention.h
  • include/cudnn_frontend/node_interface.h
  • include/cudnn_frontend/utils/serialize.h
  • tools/cudnn_repro/cudnn_repro/log_parser.py
  • tools/cudnn_repro/cudnn_repro/operations.py
  • tools/cudnn_repro/cudnn_repro/sdpa_bwd.py
  • tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py
  • tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py
  • tools/cudnn_repro/cudnn_repro/sdpa_fwd.py
  • tools/cudnn_repro/cudnn_repro/utils.py
  • tools/cudnn_repro/tests/helpers.py
  • tools/cudnn_repro/tests/test_cudnn_repro_bwd.py
  • tools/cudnn_repro/tests/test_cudnn_repro_cli.py
  • tools/cudnn_repro/tests/test_cudnn_repro_closed_loop.py
  • tools/cudnn_repro/tests/test_cudnn_repro_fp8.py
  • tools/cudnn_repro/tests/test_cudnn_repro_graph_uid.py
  • tools/cudnn_repro/tests/test_cudnn_repro_log_parser.py
  • tools/cudnn_repro/tests/test_cudnn_repro_schema.py
💤 Files with no reviewable changes (1)
  • tools/cudnn_repro/tests/test_cudnn_repro_graph_uid.py

Comment thread tools/cudnn_repro/cudnn_repro/log_parser.py
Comment thread tools/cudnn_repro/cudnn_repro/sdpa_fp8_bwd.py
Comment thread tools/cudnn_repro/cudnn_repro/sdpa_fp8_fwd.py
@Anerudhan Anerudhan self-requested a review June 5, 2026 06:11
std::optional<pass_by_values_t> compile_time_constant_value = std::nullopt;

TensorReordering_t reordering_type = TensorReordering_t::NONE;
tid_t tid = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is tid ? What is it used for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor ID. It's an immutable ID that's generated on creation and cannot be changed.

The reason we do this is when we dump the graph JSON at compilation time, the UIDs actually change from the UIDs that are seen at execution time.

@hwanseoc hwanseoc changed the title Revamp graph JSON serialization for repro tool Change graph JSON serialization for repro tool Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants