jhinpan · jhinpan · Jun 11, 2026 · Jun 8, 2026 · Jun 8, 2026
diff --git a/README.md b/README.md
@@ -1,48 +1,61 @@
 # ROCmKernelWiki — AMD CDNA / RDNA Kernel Optimization Knowledge Base
 
-> **Knowledge cutoff: 2026-05-15.** All upstream PRs, doc snapshots, and blog
-> summaries are anchored to upstream state on or before this date (recorded in
-> [`data/refresh-cutoff.yaml`](data/refresh-cutoff.yaml)). Tool versions are
-> pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml) (ROCm 7.0.2,
-> Composable Kernel 1.1.0, Triton 3.4.0, …). To advance the cutoff, re-run
-> `scripts/harvest_prs.py`, regenerate indices, and bump the cutoff file.
-
-A structured knowledge base of **AMD Instinct & Radeon GPU kernel optimization**
-for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4 (gfx1201),
-packaged as a Claude Code skill. The repository root **is** the skill directory —
-clone it into `~/.claude/skills/` and it works out of the box.
+A structured, agent-queryable knowledge base of **AMD Instinct & Radeon GPU kernel
+optimization** for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4
+(gfx1201), **packaged as a Claude Code skill**. The repository root **is** the skill
+directory — clone it into `~/.claude/skills/` and it works out of the box.
 
-## Acknowledgements & Citation
+> **Knowledge cutoff 2026-05-15** (PRs/docs/blogs anchored at or before this date; tool
+> versions pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml)). The
+> gfx950 hardware/numeric facts and all 12 runnable examples were additionally
+> **re-verified on real MI350X silicon (ROCm 7.2)** — see below.
 
-This project is **inspired by and modeled on** the excellent
-[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from
-**MIT Han Lab** — their structured, agent-queryable knowledge base for NVIDIA
-Blackwell/Hopper kernel optimization. ROCmKernelWiki adapts the same three-layer
-architecture (`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page
-schema, and the skill packaging, retargeting all content to the AMD/ROCm
-ecosystem. The KernelWiki three-layer design itself follows
-[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).
+## Hardware Scope
 
-If you use this knowledge base, please cite both:
+| Marketing | gfx | Arch | FP8 | Matrix unit | Wave |
+|---|---|---|---|---|---|
+| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 |
+| **MI350X / MI355X** | **`gfx950`** | **CDNA4** | **OCP** + FP6/FP4/MX | MFMA | wave64 |
+| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 |
 
-```bibtex
-@misc{rocmkernelwiki2026,
-  title  = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base},
-  author = {ROCmKernelWiki contributors},
-  year   = {2026},
-  howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}},
-  note   = {Inspired by MIT Han Lab's KernelWiki}
-}
+> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible with
+> gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md).
 
-@misc{kernelwiki2026,
-  title  = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base},
-  author = {MIT Han Lab},
-  year   = {2026},
-  howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}}
-}
-```
+## Validated on real silicon (MI350X / gfx950)
 
-This is a community project. It is **not** an official AMD or ROCm product.
+Unlike a docs-only wiki, the gfx950 claims here were **checked on an actual AMD Instinct
+MI350X** (ROCm 7.2) by compiling, running, and disassembling code — each finding re-run by
+an adversarial second pass. Full evidence: [`VERIFICATION.md`](VERIFICATION.md) and
+[`data/hardware-verified.yaml`](data/hardware-verified.yaml).
+
+- **Hardware facts re-grounded on silicon** and corrected where the GPU disagreed with the
+  docs: gfx950 cross-lane is `v_permlane16_swap` (not the RDNA selector form); **32 waves/CU**
+  (not 40); direct-to-LDS is ≤16 B on gfx950 / ≤4 B on gfx942; compute modes SPX/DPX/QPX/CPX,
+  memory NPS1/NPS2; native `xf32` MFMA *fails to select* on gfx950.
+- **All 12 runnable examples** build with `--offload-arch=gfx950` **and execute** on the GPU
+  (11/12 self-check; `fp8-gemm`'s `main()` only verifies the emitted MFMA, no numeric check).
+- **First-party FlyDSL kernel sweep on MI350X** — every major FlyDSL gfx950 kernel was
+  profiled with rocprofv3 ATT + counters against matched AITER/CK/hipBLASLt baselines.
+  The detailed verdict table, root-cause notes, and dashboard links live in the
+  canonical [`ref-flydsl-kernel-profiling`](sources/refs/ref-flydsl-kernel-profiling.md)
+  source page; synthesized pages link back to it instead of duplicating the full summary.
+
+## What's Here
+
+- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter,
+  ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL,
+  ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang
+- **~54 synthesized wiki pages** — hardware features, optimization techniques,
+  kernel case studies, problem patterns, DSL/language guides, migration guides
+- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and
+  **9 reference-repository studies** (FlyDSL, the FlyDSL MI350X profiling sweep,
+  gcnasm, Composable Kernel, rocWMMA, AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator)
+- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude
+  decision for every scanned PR
+- **6 auto-generated cross-reference indices** under `queries/`
+- **959 real upstream PR diffs** under `artifacts/prs/<repo>/PR-<N>/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`)
+- **12 runnable kernel examples** under `examples/` — compiled with hipcc; all 12 build with
+  `--offload-arch=gfx950` and run on an MI350X (see [`VERIFICATION.md`](VERIFICATION.md))
 
 ## Install as a Claude Code Skill
 
@@ -53,49 +66,16 @@ pip install -r ~/.claude/skills/ROCmKernelWiki/requirements.txt
 
 The skill auto-registers (`SKILL.md` lives at the clone root) and the query
 scripts auto-resolve the wiki root to their own directory — no environment
-variable required.
+variable required. Optional override: `export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki`.
 
 Smoke test:
 
 ```bash
 cd ~/.claude/skills/ROCmKernelWiki
 python3 scripts/query.py --tag mfma --type hardware --compact
-python3 scripts/get_page.py kernel-fp8-gemm --frontmatter-only
-```
-
-Optional override for relocating the scripts:
-
-```bash
-export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki
+python3 scripts/get_page.py kernel-flydsl-flash-attention --frontmatter-only
 ```
 
-## What's Here
-
-- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter,
-  ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL,
-  ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang
-- **~53 synthesized wiki pages** — hardware features, optimization techniques,
-  kernel case studies, problem patterns, DSL/language guides, migration guides
-- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and
-  **8 reference-repository studies** (FlyDSL, gcnasm, Composable Kernel, rocWMMA,
-  AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator)
-- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude
-  decision for every scanned PR
-- **6 auto-generated cross-reference indices** under `queries/`
-- **959 real upstream PR diffs** under `artifacts/prs/<repo>/PR-<N>/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`) for the highest-signal kernel PRs
-- **12 runnable kernel examples** under `examples/` — compiled with hipcc; **all 12 now build with `--offload-arch=gfx950` and run on an AMD Instinct MI350X (gfx950), ROCm 7.2** (11/12 self-checks pass; `fp8-gemm`'s `main()` only verifies the emitted MFMA instruction, it does not run a numeric GEMM). Portable ones also run + self-check on RDNA4/gfx1201. See [`VERIFICATION.md`](VERIFICATION.md).
-
-## Hardware Scope
-
-| Marketing | gfx | Arch | FP8 | Matrix unit | Wave |
-|---|---|---|---|---|---|
-| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 |
-| MI350X / MI355X | `gfx950` | CDNA4 | **OCP** + FP6/FP4/MX | MFMA | wave64 |
-| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 |
-
-> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible
-> with gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md).
-
 ## Query Tools
 
 | Tool | Purpose |
@@ -106,7 +86,6 @@ export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki
 
 ```bash
 python3 scripts/query.py "flash attention ck-tile" --limit 5
-python3 scripts/query.py --tag XDLOP --type hardware --compact     # alias → mfma
 python3 scripts/query.py --architecture MI355X --type kernel       # alias → gfx950
 python3 scripts/get_page.py kernel-flash-attention-ck --follow-sources
 python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki
@@ -116,22 +95,20 @@ python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki
 
 Three layers (after MIT Han Lab's KernelWiki, in turn after Karpathy's LLM-wiki):
 
-1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and
-   reference repos. Cross-referenced by `id`.
-2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter
-   (subfolders: `hardware`, `techniques`, `kernels`, `patterns`, `languages`,
-   `migration`).
+<p align="center"><img src="docs/architecture.svg" alt="ROCmKernelWiki three-layer architecture: sources → wiki → queries, gated by data/ and scripts/" width="780"></p>
+
+1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and reference
+   repos. Cross-referenced by `id`.
+2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter (subfolders:
+   `hardware`, `techniques`, `kernels`, `patterns`, `languages`, `migration`).
 3. **`queries/`** — Auto-generated cross-reference indices. Do not edit by hand;
    regenerate via `scripts/generate-indices.py`.
 
-Supporting files:
-- `data/schemas.yaml` — required/optional fields per page type
-- `data/tags.yaml` — controlled vocabulary (validator-enforced)
-- `data/aliases.yaml` — canonical → synonym map (MI300→gfx942, XDLOP→mfma, …)
-- `data/inclusion-policy.yaml` — PR harvest classification policy
-- `data/tool-versions.yaml`, `data/refresh-cutoff.yaml` — version/cutoff anchors
-- `candidates/` — per-repo PR candidate ledgers
-- `references/` — primer, schema, worked examples
+Supporting files: `data/` holds the schema and controlled vocabulary
+(`schemas.yaml`, `tags.yaml`, `aliases.yaml`, `inclusion-policy.yaml`,
+`tool-versions.yaml`, `refresh-cutoff.yaml`, `hardware-verified.yaml`);
+`candidates/` holds per-repo PR ledgers; `references/` holds the primer, schema, and
+worked examples.
 
 ## Maintenance Tooling
 
@@ -140,8 +117,7 @@ Supporting files:
 | `scripts/harvest_prs.py` | Harvest merged PRs from tracked ROCm repos (gh GraphQL) |
 | `scripts/backfill_diffs.py` | Fetch real upstream diffs for top-ranked kernel PRs |
 | `scripts/enrich_facets.py` | Infer techniques/hardware_features/kernel_types from paths + diffs |
-| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge (`implemented_by` / `related`) |
-| `scripts/summarize_diffs.py` | Write compact `diff_summary.md` per PR bundle |
+| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge |
 | `scripts/gen_source_anchors.py` | (Re)generate doc/blog/ref source anchor pages |
 | `scripts/generate-indices.py` | Regenerate `queries/*.md` from frontmatter |
 | `scripts/validate.py` | Validate frontmatter, vocabulary, links, version-claims, freshness |
@@ -155,17 +131,16 @@ python3 scripts/validate.py            # schema + vocabulary + link integrity
 python3 scripts/generate-indices.py    # regenerate query indices
 ```
 
-## Quality Gates (cutoff 2026-05-15)
+### Quality Gates
 
 - 0 validation errors (schema, controlled vocabulary, link integrity)
 - Every hardware fact traces to an official AMD ISA doc / whitepaper
 - Every technique/kernel/language page has a compilable code snippet
 - Every PR page carries `inclusion_reason` and `status: merged`
 - `verified` pages carry `evidence_basis` (official-doc + upstream-code/paper)
 - 0 dangling internal references (frontmatter ids **and** in-body relative links)
-- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2),
-  double-checked by an adversarial re-run** — see [`VERIFICATION.md`](VERIFICATION.md)
-  and [`data/hardware-verified.yaml`](data/hardware-verified.yaml)
+- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2)** —
+  see [`VERIFICATION.md`](VERIFICATION.md) and [`data/hardware-verified.yaml`](data/hardware-verified.yaml)
 
 ## License
 
@@ -174,4 +149,34 @@ Wiki synthesis pages are derivative works that cite their upstream sources; PR
 summary pages link to and summarize publicly available upstream PR metadata, with
 the upstream repositories remaining the authoritative source of truth. AMD,
 Instinct, Radeon, CDNA, and ROCm are trademarks of Advanced Micro Devices, Inc.;
-this project is unaffiliated with AMD.
+this project is unaffiliated with AMD. It is **not** an official AMD or ROCm product.
+
+## Acknowledgements & Citation
+
+This project is **inspired by and modeled on** the excellent
+[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from **MIT Han Lab** —
+their structured, agent-queryable knowledge base for NVIDIA Blackwell/Hopper kernel
+optimization. ROCmKernelWiki adapts the same three-layer architecture
+(`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page schema, and the skill
+packaging, retargeting all content to the AMD/ROCm ecosystem. The KernelWiki three-layer
+design itself follows
+[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f).
+
+If you use this knowledge base, please cite both:
+
+```bibtex
+@misc{rocmkernelwiki2026,
+  title  = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base},
+  author = {ROCmKernelWiki contributors},
+  year   = {2026},
+  howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}},
+  note   = {Inspired by MIT Han Lab's KernelWiki}
+}
+
+@misc{kernelwiki2026,
+  title  = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base},
+  author = {MIT Han Lab},
+  year   = {2026},
+  howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}}
+}
+```
diff --git a/data/schemas.yaml b/data/schemas.yaml
@@ -146,6 +146,7 @@ wiki-kernel:
       - shape
       - utilization
       - baseline
+      - bucket
 
 wiki-pattern:
   required:

diff --git a/data/tags.yaml b/data/tags.yaml
@@ -162,6 +162,10 @@ misc_tags:
   - lds
   - swizzle
   - epilogue-fusion
+  - profiling                # runtime profiling / perf analysis
+  - rocprofv3                # rocprofv3 ATT + counter collection
+  - kernel-profiling         # per-kernel trace/counter sweep
+  - register-pressure        # occupancy capped by VGPR/AGPR live set
 
 confidence:
   - verified

diff --git a/docs/architecture.svg b/docs/architecture.svg