Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 98 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,61 @@
# ROCmKernelWiki — AMD CDNA / RDNA Kernel Optimization Knowledge Base

> **Knowledge cutoff: 2026-05-15.** All upstream PRs, doc snapshots, and blog
> summaries are anchored to upstream state on or before this date (recorded in
> [`data/refresh-cutoff.yaml`](data/refresh-cutoff.yaml)). Tool versions are
> pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml) (ROCm 7.0.2,
> Composable Kernel 1.1.0, Triton 3.4.0, …). To advance the cutoff, re-run
> `scripts/harvest_prs.py`, regenerate indices, and bump the cutoff file.

A structured knowledge base of **AMD Instinct & Radeon GPU kernel optimization**
for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4 (gfx1201),
packaged as a Claude Code skill. The repository root **is** the skill directory —
clone it into `~/.claude/skills/` and it works out of the box.
A structured, agent-queryable knowledge base of **AMD Instinct & Radeon GPU kernel
optimization** for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4
(gfx1201), **packaged as a Claude Code skill**. The repository root **is** the skill
directory — clone it into `~/.claude/skills/` and it works out of the box.

## Acknowledgements & Citation
> **Knowledge cutoff 2026-05-15** (PRs/docs/blogs anchored at or before this date; tool
> versions pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml)). The
> gfx950 hardware/numeric facts and all 12 runnable examples were additionally
> **re-verified on real MI350X silicon (ROCm 7.2)** — see below.

This project is **inspired by and modeled on** the excellent
[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from
**MIT Han Lab** — their structured, agent-queryable knowledge base for NVIDIA
Blackwell/Hopper kernel optimization. ROCmKernelWiki adapts the same three-layer
architecture (`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page
schema, and the skill packaging, retargeting all content to the AMD/ROCm
ecosystem. The KernelWiki three-layer design itself follows
[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).
## Hardware Scope

If you use this knowledge base, please cite both:
| Marketing | gfx | Arch | FP8 | Matrix unit | Wave |
|---|---|---|---|---|---|
| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 |
| **MI350X / MI355X** | **`gfx950`** | **CDNA4** | **OCP** + FP6/FP4/MX | MFMA | wave64 |
| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 |

```bibtex
@misc{rocmkernelwiki2026,
title = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base},
author = {ROCmKernelWiki contributors},
year = {2026},
howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}},
note = {Inspired by MIT Han Lab's KernelWiki}
}
> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible with
> gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md).

@misc{kernelwiki2026,
title = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base},
author = {MIT Han Lab},
year = {2026},
howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}}
}
```
## Validated on real silicon (MI350X / gfx950)

This is a community project. It is **not** an official AMD or ROCm product.
Unlike a docs-only wiki, the gfx950 claims here were **checked on an actual AMD Instinct
MI350X** (ROCm 7.2) by compiling, running, and disassembling code — each finding re-run by
an adversarial second pass. Full evidence: [`VERIFICATION.md`](VERIFICATION.md) and
[`data/hardware-verified.yaml`](data/hardware-verified.yaml).

- **Hardware facts re-grounded on silicon** and corrected where the GPU disagreed with the
docs: gfx950 cross-lane is `v_permlane16_swap` (not the RDNA selector form); **32 waves/CU**
(not 40); direct-to-LDS is ≤16 B on gfx950 / ≤4 B on gfx942; compute modes SPX/DPX/QPX/CPX,
memory NPS1/NPS2; native `xf32` MFMA *fails to select* on gfx950.
- **All 12 runnable examples** build with `--offload-arch=gfx950` **and execute** on the GPU
(11/12 self-check; `fp8-gemm`'s `main()` only verifies the emitted MFMA, no numeric check).
- **First-party FlyDSL kernel sweep on MI350X** — every major FlyDSL gfx950 kernel was
profiled with rocprofv3 ATT + counters against matched AITER/CK/hipBLASLt baselines.
The detailed verdict table, root-cause notes, and dashboard links live in the
canonical [`ref-flydsl-kernel-profiling`](sources/refs/ref-flydsl-kernel-profiling.md)
source page; synthesized pages link back to it instead of duplicating the full summary.

## What's Here

- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter,
ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL,
ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang
- **~54 synthesized wiki pages** — hardware features, optimization techniques,
kernel case studies, problem patterns, DSL/language guides, migration guides
- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and
**9 reference-repository studies** (FlyDSL, the FlyDSL MI350X profiling sweep,
gcnasm, Composable Kernel, rocWMMA, AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator)
- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude
decision for every scanned PR
- **6 auto-generated cross-reference indices** under `queries/`
- **959 real upstream PR diffs** under `artifacts/prs/<repo>/PR-<N>/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`)
- **12 runnable kernel examples** under `examples/` — compiled with hipcc; all 12 build with
`--offload-arch=gfx950` and run on an MI350X (see [`VERIFICATION.md`](VERIFICATION.md))

## Install as a Claude Code Skill

Expand All @@ -53,49 +66,16 @@ pip install -r ~/.claude/skills/ROCmKernelWiki/requirements.txt

The skill auto-registers (`SKILL.md` lives at the clone root) and the query
scripts auto-resolve the wiki root to their own directory — no environment
variable required.
variable required. Optional override: `export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki`.

Smoke test:

```bash
cd ~/.claude/skills/ROCmKernelWiki
python3 scripts/query.py --tag mfma --type hardware --compact
python3 scripts/get_page.py kernel-fp8-gemm --frontmatter-only
```

Optional override for relocating the scripts:

```bash
export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki
python3 scripts/get_page.py kernel-flydsl-flash-attention --frontmatter-only
```

## What's Here

- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter,
ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL,
ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang
- **~53 synthesized wiki pages** — hardware features, optimization techniques,
kernel case studies, problem patterns, DSL/language guides, migration guides
- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and
**8 reference-repository studies** (FlyDSL, gcnasm, Composable Kernel, rocWMMA,
AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator)
- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude
decision for every scanned PR
- **6 auto-generated cross-reference indices** under `queries/`
- **959 real upstream PR diffs** under `artifacts/prs/<repo>/PR-<N>/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`) for the highest-signal kernel PRs
- **12 runnable kernel examples** under `examples/` — compiled with hipcc; **all 12 now build with `--offload-arch=gfx950` and run on an AMD Instinct MI350X (gfx950), ROCm 7.2** (11/12 self-checks pass; `fp8-gemm`'s `main()` only verifies the emitted MFMA instruction, it does not run a numeric GEMM). Portable ones also run + self-check on RDNA4/gfx1201. See [`VERIFICATION.md`](VERIFICATION.md).

## Hardware Scope

| Marketing | gfx | Arch | FP8 | Matrix unit | Wave |
|---|---|---|---|---|---|
| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 |
| MI350X / MI355X | `gfx950` | CDNA4 | **OCP** + FP6/FP4/MX | MFMA | wave64 |
| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 |

> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible
> with gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md).

## Query Tools

| Tool | Purpose |
Expand All @@ -106,7 +86,6 @@ export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki

```bash
python3 scripts/query.py "flash attention ck-tile" --limit 5
python3 scripts/query.py --tag XDLOP --type hardware --compact # alias → mfma
python3 scripts/query.py --architecture MI355X --type kernel # alias → gfx950
python3 scripts/get_page.py kernel-flash-attention-ck --follow-sources
python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki
Expand All @@ -116,22 +95,20 @@ python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki

Three layers (after MIT Han Lab's KernelWiki, in turn after Karpathy's LLM-wiki):

1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and
reference repos. Cross-referenced by `id`.
2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter
(subfolders: `hardware`, `techniques`, `kernels`, `patterns`, `languages`,
`migration`).
<p align="center"><img src="docs/architecture.svg" alt="ROCmKernelWiki three-layer architecture: sources → wiki → queries, gated by data/ and scripts/" width="780"></p>

1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and reference
repos. Cross-referenced by `id`.
2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter (subfolders:
`hardware`, `techniques`, `kernels`, `patterns`, `languages`, `migration`).
3. **`queries/`** — Auto-generated cross-reference indices. Do not edit by hand;
regenerate via `scripts/generate-indices.py`.

Supporting files:
- `data/schemas.yaml` — required/optional fields per page type
- `data/tags.yaml` — controlled vocabulary (validator-enforced)
- `data/aliases.yaml` — canonical → synonym map (MI300→gfx942, XDLOP→mfma, …)
- `data/inclusion-policy.yaml` — PR harvest classification policy
- `data/tool-versions.yaml`, `data/refresh-cutoff.yaml` — version/cutoff anchors
- `candidates/` — per-repo PR candidate ledgers
- `references/` — primer, schema, worked examples
Supporting files: `data/` holds the schema and controlled vocabulary
(`schemas.yaml`, `tags.yaml`, `aliases.yaml`, `inclusion-policy.yaml`,
`tool-versions.yaml`, `refresh-cutoff.yaml`, `hardware-verified.yaml`);
`candidates/` holds per-repo PR ledgers; `references/` holds the primer, schema, and
worked examples.

## Maintenance Tooling

Expand All @@ -140,8 +117,7 @@ Supporting files:
| `scripts/harvest_prs.py` | Harvest merged PRs from tracked ROCm repos (gh GraphQL) |
| `scripts/backfill_diffs.py` | Fetch real upstream diffs for top-ranked kernel PRs |
| `scripts/enrich_facets.py` | Infer techniques/hardware_features/kernel_types from paths + diffs |
| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge (`implemented_by` / `related`) |
| `scripts/summarize_diffs.py` | Write compact `diff_summary.md` per PR bundle |
| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge |
| `scripts/gen_source_anchors.py` | (Re)generate doc/blog/ref source anchor pages |
| `scripts/generate-indices.py` | Regenerate `queries/*.md` from frontmatter |
| `scripts/validate.py` | Validate frontmatter, vocabulary, links, version-claims, freshness |
Expand All @@ -155,17 +131,16 @@ python3 scripts/validate.py # schema + vocabulary + link integrity
python3 scripts/generate-indices.py # regenerate query indices
```

## Quality Gates (cutoff 2026-05-15)
### Quality Gates

- 0 validation errors (schema, controlled vocabulary, link integrity)
- Every hardware fact traces to an official AMD ISA doc / whitepaper
- Every technique/kernel/language page has a compilable code snippet
- Every PR page carries `inclusion_reason` and `status: merged`
- `verified` pages carry `evidence_basis` (official-doc + upstream-code/paper)
- 0 dangling internal references (frontmatter ids **and** in-body relative links)
- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2),
double-checked by an adversarial re-run** — see [`VERIFICATION.md`](VERIFICATION.md)
and [`data/hardware-verified.yaml`](data/hardware-verified.yaml)
- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2)** —
see [`VERIFICATION.md`](VERIFICATION.md) and [`data/hardware-verified.yaml`](data/hardware-verified.yaml)

## License

Expand All @@ -174,4 +149,34 @@ Wiki synthesis pages are derivative works that cite their upstream sources; PR
summary pages link to and summarize publicly available upstream PR metadata, with
the upstream repositories remaining the authoritative source of truth. AMD,
Instinct, Radeon, CDNA, and ROCm are trademarks of Advanced Micro Devices, Inc.;
this project is unaffiliated with AMD.
this project is unaffiliated with AMD. It is **not** an official AMD or ROCm product.

## Acknowledgements & Citation

This project is **inspired by and modeled on** the excellent
[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from **MIT Han Lab** —
their structured, agent-queryable knowledge base for NVIDIA Blackwell/Hopper kernel
optimization. ROCmKernelWiki adapts the same three-layer architecture
(`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page schema, and the skill
packaging, retargeting all content to the AMD/ROCm ecosystem. The KernelWiki three-layer
design itself follows
[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).

If you use this knowledge base, please cite both:

```bibtex
@misc{rocmkernelwiki2026,
title = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base},
author = {ROCmKernelWiki contributors},
year = {2026},
howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}},
note = {Inspired by MIT Han Lab's KernelWiki}
}

@misc{kernelwiki2026,
title = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base},
author = {MIT Han Lab},
year = {2026},
howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}}
}
```
1 change: 1 addition & 0 deletions data/schemas.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ wiki-kernel:
- shape
- utilization
- baseline
- bucket

wiki-pattern:
required:
Expand Down
4 changes: 4 additions & 0 deletions data/tags.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ misc_tags:
- lds
- swizzle
- epilogue-fusion
- profiling # runtime profiling / perf analysis
- rocprofv3 # rocprofv3 ATT + counter collection
- kernel-profiling # per-kernel trace/counter sweep
- register-pressure # occupancy capped by VGPR/AGPR live set

confidence:
- verified
Expand Down
67 changes: 67 additions & 0 deletions docs/architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading