From a49e9973f02230411093594faa60265f6d81a557 Mon Sep 17 00:00:00 2001 From: Jin Pan Date: Mon, 8 Jun 2026 15:45:42 +0000 Subject: [PATCH 1/2] docs: lead README with the important stuff, add architecture SVG, surface validated MI350/FlyDSL content MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit README restructure (important-first; Acknowledgements moved to the very end): - Lead with what-it-is → Hardware Scope (MI350/gfx950) → "Validated on real silicon" → What's Here → Install/Query → Architecture → Maintenance/Quality → License → Acknowledgements. - Embed a hand-authored architecture diagram (docs/architecture.svg) in the Architecture section. Validated MI350/FlyDSL content (first-party, MI350X silicon): - New source anchor sources/refs/ref-flydsl-kernel-profiling.md — the rocprofv3 ATT sweep + GitHub Pages dashboard (17 kernels, AITER/CK/hipBLASLt baselines, ROCm 7.2). - New wiki page wiki/kernels/flydsl-flash-attention.md — generic vs gfx950 dual-wave SWP, PR arc #225→#334→#462→#629→#661 (layout MMA-atom API), measured ~0.92x vs CK-tile (register-pressure-capped occupancy). - Augment wiki/languages/flydsl.md with the FA/atom-API note + a "Measured on MI350X" section. - data/tags.yaml: add profiling/rocprofv3/kernel-profiling/register-pressure misc tags. - index.md: list the new page + a silicon-validation pointer. - Regenerated queries/*.md indices. validate.py: 0 errors. tests/test_validate.py: pass. generate-indices.py: clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 194 ++++++++++---------- data/tags.yaml | 4 + docs/architecture.svg | 67 +++++++ index.md | 5 + queries/by-hardware-feature.md | 7 + queries/by-kernel-type.md | 4 + queries/by-language.md | 6 + queries/by-technique.md | 7 + sources/refs/ref-flydsl-kernel-profiling.md | 87 +++++++++ wiki/kernels/flydsl-flash-attention.md | 179 ++++++++++++++++++ wiki/languages/flydsl.md | 38 ++++ 11 files changed, 505 insertions(+), 93 deletions(-) create mode 100644 docs/architecture.svg create mode 100644 sources/refs/ref-flydsl-kernel-profiling.md create mode 100644 wiki/kernels/flydsl-flash-attention.md diff --git a/README.md b/README.md index 2160cc6d..0d756ac6 100644 --- a/README.md +++ b/README.md @@ -1,48 +1,64 @@ # ROCmKernelWiki — AMD CDNA / RDNA Kernel Optimization Knowledge Base -> **Knowledge cutoff: 2026-05-15.** All upstream PRs, doc snapshots, and blog -> summaries are anchored to upstream state on or before this date (recorded in -> [`data/refresh-cutoff.yaml`](data/refresh-cutoff.yaml)). Tool versions are -> pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml) (ROCm 7.0.2, -> Composable Kernel 1.1.0, Triton 3.4.0, …). To advance the cutoff, re-run -> `scripts/harvest_prs.py`, regenerate indices, and bump the cutoff file. - -A structured knowledge base of **AMD Instinct & Radeon GPU kernel optimization** -for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4 (gfx1201), -packaged as a Claude Code skill. The repository root **is** the skill directory — -clone it into `~/.claude/skills/` and it works out of the box. +A structured, agent-queryable knowledge base of **AMD Instinct & Radeon GPU kernel +optimization** for CDNA3 (gfx942 / MI300), CDNA4 (gfx950 / MI350–MI355X), and RDNA4 +(gfx1201), **packaged as a Claude Code skill**. The repository root **is** the skill +directory — clone it into `~/.claude/skills/` and it works out of the box. -## Acknowledgements & Citation +> **Knowledge cutoff 2026-05-15** (PRs/docs/blogs anchored at or before this date; tool +> versions pinned in [`data/tool-versions.yaml`](data/tool-versions.yaml)). The +> gfx950 hardware/numeric facts and all 12 runnable examples were additionally +> **re-verified on real MI350X silicon (ROCm 7.2)** — see below. -This project is **inspired by and modeled on** the excellent -[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from -**MIT Han Lab** — their structured, agent-queryable knowledge base for NVIDIA -Blackwell/Hopper kernel optimization. ROCmKernelWiki adapts the same three-layer -architecture (`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page -schema, and the skill packaging, retargeting all content to the AMD/ROCm -ecosystem. The KernelWiki three-layer design itself follows -[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f). +## Hardware Scope -If you use this knowledge base, please cite both: +| Marketing | gfx | Arch | FP8 | Matrix unit | Wave | +|---|---|---|---|---|---| +| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 | +| **MI350X / MI355X** | **`gfx950`** | **CDNA4** | **OCP** + FP6/FP4/MX | MFMA | wave64 | +| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 | -```bibtex -@misc{rocmkernelwiki2026, - title = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base}, - author = {ROCmKernelWiki contributors}, - year = {2026}, - howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}}, - note = {Inspired by MIT Han Lab's KernelWiki} -} +> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible with +> gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md). + +## Validated on real silicon (MI350X / gfx950) + +Unlike a docs-only wiki, the gfx950 claims here were **checked on an actual AMD Instinct +MI350X** (ROCm 7.2) by compiling, running, and disassembling code — each finding re-run by +an adversarial second pass. Full evidence: [`VERIFICATION.md`](VERIFICATION.md) and +[`data/hardware-verified.yaml`](data/hardware-verified.yaml). + +- **Hardware facts re-grounded on silicon** and corrected where the GPU disagreed with the + docs: gfx950 cross-lane is `v_permlane16_swap` (not the RDNA selector form); **32 waves/CU** + (not 40); direct-to-LDS is ≤16 B on gfx950 / ≤4 B on gfx942; compute modes SPX/DPX/QPX/CPX, + memory NPS1/NPS2; native `xf32` MFMA *fails to select* on gfx950. +- **All 12 runnable examples** build with `--offload-arch=gfx950` **and execute** on the GPU + (11/12 self-check; `fp8-gemm`'s `main()` only verifies the emitted MFMA, no numeric check). +- **First-party FlyDSL kernel sweep on MI350X** — every major FlyDSL gfx950 kernel profiled + with rocprofv3 ATT + counters against AITER/CK/hipBLASLt baselines + ([dashboard](https://jhinpan.github.io/flydsl-kernel-profiling/) · + [`ref-flydsl-kernel-profiling`](sources/refs/ref-flydsl-kernel-profiling.md) · + [FlyDSL flash-attention page](wiki/kernels/flydsl-flash-attention.md)): + - **Wins:** softmax **2.05×** (vs Triton), hgemm_splitk **1.66×**, moe_gemm **1.11×**. + - **Headroom:** flash_attn 0.92×, paged-attention 0.48×, **topk_gating 0.22×**, **rope 0.17×** + — the attention/GEMM losers are **register-pressure-capped** at 1–2 waves/SIMD. -@misc{kernelwiki2026, - title = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base}, - author = {MIT Han Lab}, - year = {2026}, - howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}} -} -``` +## What's Here -This is a community project. It is **not** an official AMD or ROCm product. +- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter, + ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL, + ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang +- **~54 synthesized wiki pages** — hardware features, optimization techniques, + kernel case studies, problem patterns, DSL/language guides, migration guides +- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and + **9 reference-repository studies** (FlyDSL, the FlyDSL MI350X profiling sweep, + gcnasm, Composable Kernel, rocWMMA, AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator) +- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude + decision for every scanned PR +- **6 auto-generated cross-reference indices** under `queries/` +- **959 real upstream PR diffs** under `artifacts/prs//PR-/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`) +- **12 runnable kernel examples** under `examples/` — compiled with hipcc; all 12 build with + `--offload-arch=gfx950` and run on an MI350X (see [`VERIFICATION.md`](VERIFICATION.md)) ## Install as a Claude Code Skill @@ -53,49 +69,16 @@ pip install -r ~/.claude/skills/ROCmKernelWiki/requirements.txt The skill auto-registers (`SKILL.md` lives at the clone root) and the query scripts auto-resolve the wiki root to their own directory — no environment -variable required. +variable required. Optional override: `export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki`. Smoke test: ```bash cd ~/.claude/skills/ROCmKernelWiki python3 scripts/query.py --tag mfma --type hardware --compact -python3 scripts/get_page.py kernel-fp8-gemm --frontmatter-only -``` - -Optional override for relocating the scripts: - -```bash -export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki +python3 scripts/get_page.py kernel-flydsl-flash-attention --frontmatter-only ``` -## What's Here - -- **7,400+ PR reference pages** from ROCm/composable_kernel, ROCm/aiter, - ROCm/hipBLASLt, ROCm/Tensile, ROCm/rocBLAS, ROCm/flash-attention, ROCm/FlyDSL, - ROCm/triton, plus ROCm-filtered vllm-project/vllm and sgl-project/sglang -- **~53 synthesized wiki pages** — hardware features, optimization techniques, - kernel case studies, problem patterns, DSL/language guides, migration guides -- **20 doc/blog summaries** (AMD CDNA3/CDNA4 ISA, whitepapers, ROCm blogs) and - **8 reference-repository studies** (FlyDSL, gcnasm, Composable Kernel, rocWMMA, - AITER, hipBLASLt, Tensile, the Matrix Instruction Calculator) -- **9 candidate ledgers** in `candidates/` recording the include/defer/exclude - decision for every scanned PR -- **6 auto-generated cross-reference indices** under `queries/` -- **959 real upstream PR diffs** under `artifacts/prs//PR-/` (byte-capped, SHA-256-pinned via `PROVENANCE.yaml`) for the highest-signal kernel PRs -- **12 runnable kernel examples** under `examples/` — compiled with hipcc; **all 12 now build with `--offload-arch=gfx950` and run on an AMD Instinct MI350X (gfx950), ROCm 7.2** (11/12 self-checks pass; `fp8-gemm`'s `main()` only verifies the emitted MFMA instruction, it does not run a numeric GEMM). Portable ones also run + self-check on RDNA4/gfx1201. See [`VERIFICATION.md`](VERIFICATION.md). - -## Hardware Scope - -| Marketing | gfx | Arch | FP8 | Matrix unit | Wave | -|---|---|---|---|---|---| -| MI300A / MI300X / MI325X | `gfx942` | CDNA3 | **FNUZ** | MFMA | wave64 | -| MI350X / MI355X | `gfx950` | CDNA4 | **OCP** + FP6/FP4/MX | MFMA | wave64 | -| Radeon AI PRO R9700 | `gfx1201` | RDNA4 | OCP | **WMMA** | wave32/64 | - -> The headline portability gotcha: **gfx942 FP8 (FNUZ) is not bit-compatible -> with gfx950 FP8 (OCP)**. See [`wiki/migration/gfx942-to-gfx950.md`](wiki/migration/gfx942-to-gfx950.md). - ## Query Tools | Tool | Purpose | @@ -106,7 +89,6 @@ export ROCM_WIKI_ROOT=/path/to/ROCmKernelWiki ```bash python3 scripts/query.py "flash attention ck-tile" --limit 5 -python3 scripts/query.py --tag XDLOP --type hardware --compact # alias → mfma python3 scripts/query.py --architecture MI355X --type kernel # alias → gfx950 python3 scripts/get_page.py kernel-flash-attention-ck --follow-sources python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki @@ -116,22 +98,20 @@ python3 scripts/grep_wiki.py "v_mfma_f32_16x16x128_f8f6f4" --only wiki Three layers (after MIT Han Lab's KernelWiki, in turn after Karpathy's LLM-wiki): -1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and - reference repos. Cross-referenced by `id`. -2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter - (subfolders: `hardware`, `techniques`, `kernels`, `patterns`, `languages`, - `migration`). +

ROCmKernelWiki three-layer architecture: sources → wiki → queries, gated by data/ and scripts/

+ +1. **`sources/`** — Raw data. Immutable summaries of PRs, docs, blogs, and reference + repos. Cross-referenced by `id`. +2. **`wiki/`** — Synthesized knowledge pages with YAML frontmatter (subfolders: + `hardware`, `techniques`, `kernels`, `patterns`, `languages`, `migration`). 3. **`queries/`** — Auto-generated cross-reference indices. Do not edit by hand; regenerate via `scripts/generate-indices.py`. -Supporting files: -- `data/schemas.yaml` — required/optional fields per page type -- `data/tags.yaml` — controlled vocabulary (validator-enforced) -- `data/aliases.yaml` — canonical → synonym map (MI300→gfx942, XDLOP→mfma, …) -- `data/inclusion-policy.yaml` — PR harvest classification policy -- `data/tool-versions.yaml`, `data/refresh-cutoff.yaml` — version/cutoff anchors -- `candidates/` — per-repo PR candidate ledgers -- `references/` — primer, schema, worked examples +Supporting files: `data/` holds the schema and controlled vocabulary +(`schemas.yaml`, `tags.yaml`, `aliases.yaml`, `inclusion-policy.yaml`, +`tool-versions.yaml`, `refresh-cutoff.yaml`, `hardware-verified.yaml`); +`candidates/` holds per-repo PR ledgers; `references/` holds the primer, schema, and +worked examples. ## Maintenance Tooling @@ -140,8 +120,7 @@ Supporting files: | `scripts/harvest_prs.py` | Harvest merged PRs from tracked ROCm repos (gh GraphQL) | | `scripts/backfill_diffs.py` | Fetch real upstream diffs for top-ranked kernel PRs | | `scripts/enrich_facets.py` | Infer techniques/hardware_features/kernel_types from paths + diffs | -| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge (`implemented_by` / `related`) | -| `scripts/summarize_diffs.py` | Write compact `diff_summary.md` per PR bundle | +| `scripts/link_prs.py` | Build the bidirectional PR↔wiki bridge | | `scripts/gen_source_anchors.py` | (Re)generate doc/blog/ref source anchor pages | | `scripts/generate-indices.py` | Regenerate `queries/*.md` from frontmatter | | `scripts/validate.py` | Validate frontmatter, vocabulary, links, version-claims, freshness | @@ -155,7 +134,7 @@ python3 scripts/validate.py # schema + vocabulary + link integrity python3 scripts/generate-indices.py # regenerate query indices ``` -## Quality Gates (cutoff 2026-05-15) +### Quality Gates - 0 validation errors (schema, controlled vocabulary, link integrity) - Every hardware fact traces to an official AMD ISA doc / whitepaper @@ -163,9 +142,8 @@ python3 scripts/generate-indices.py # regenerate query indices - Every PR page carries `inclusion_reason` and `status: merged` - `verified` pages carry `evidence_basis` (official-doc + upstream-code/paper) - 0 dangling internal references (frontmatter ids **and** in-body relative links) -- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2), - double-checked by an adversarial re-run** — see [`VERIFICATION.md`](VERIFICATION.md) - and [`data/hardware-verified.yaml`](data/hardware-verified.yaml) +- **gfx950 hardware/numeric claims re-verified on real MI350X silicon (ROCm 7.2)** — + see [`VERIFICATION.md`](VERIFICATION.md) and [`data/hardware-verified.yaml`](data/hardware-verified.yaml) ## License @@ -174,4 +152,34 @@ Wiki synthesis pages are derivative works that cite their upstream sources; PR summary pages link to and summarize publicly available upstream PR metadata, with the upstream repositories remaining the authoritative source of truth. AMD, Instinct, Radeon, CDNA, and ROCm are trademarks of Advanced Micro Devices, Inc.; -this project is unaffiliated with AMD. +this project is unaffiliated with AMD. It is **not** an official AMD or ROCm product. + +## Acknowledgements & Citation + +This project is **inspired by and modeled on** the excellent +[**KernelWiki**](https://github.com/mit-han-lab/KernelWiki) from **MIT Han Lab** — +their structured, agent-queryable knowledge base for NVIDIA Blackwell/Hopper kernel +optimization. ROCmKernelWiki adapts the same three-layer architecture +(`sources/` → `wiki/` → `queries/`), the YAML-frontmatter page schema, and the skill +packaging, retargeting all content to the AMD/ROCm ecosystem. The KernelWiki three-layer +design itself follows +[Karpathy's LLM-wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f). + +If you use this knowledge base, please cite both: + +```bibtex +@misc{rocmkernelwiki2026, + title = {ROCmKernelWiki: An AMD CDNA/RDNA GPU Kernel Optimization Knowledge Base}, + author = {ROCmKernelWiki contributors}, + year = {2026}, + howpublished = {\url{https://github.com/jhinpan/ROCmKernelWiki}}, + note = {Inspired by MIT Han Lab's KernelWiki} +} + +@misc{kernelwiki2026, + title = {KernelWiki: Blackwell \& Hopper Kernel Optimization Knowledge Base}, + author = {MIT Han Lab}, + year = {2026}, + howpublished = {\url{https://github.com/mit-han-lab/KernelWiki}} +} +``` diff --git a/data/tags.yaml b/data/tags.yaml index a9d5ba12..e0f3601b 100644 --- a/data/tags.yaml +++ b/data/tags.yaml @@ -162,6 +162,10 @@ misc_tags: - lds - swizzle - epilogue-fusion + - profiling # runtime profiling / perf analysis + - rocprofv3 # rocprofv3 ATT + counter collection + - kernel-profiling # per-kernel trace/counter sweep + - register-pressure # occupancy capped by VGPR/AGPR live set confidence: - verified diff --git a/docs/architecture.svg b/docs/architecture.svg new file mode 100644 index 00000000..2ce226ae --- /dev/null +++ b/docs/architecture.svg @@ -0,0 +1,67 @@ + + + + + + + + + ROCmKernelWiki — three-layer knowledge base + AMD CDNA3 (gfx942) · CDNA4 (gfx950 / MI350) · RDNA4 (gfx1201) — packaged as a Claude Code skill + + + + + sources/ + raw, immutable summaries + prs/ — 7,400+ merged PRs + docs/ · blogs/ · refs/ + cross-referenced by id + + + + + wiki/ + synthesized pages + frontmatter + hardware · techniques · kernels + patterns · languages · migration + performance_claims · evidence_basis + + + + + queries/ + auto-generated indices + by-problem · by-technique + by-hardware · by-kernel · by-repo + never hand-edited + + + + synthesize + + generate + + + + data/ — controlled vocabulary & schema + schemas.yaml · tags.yaml · aliases.yaml + inclusion-policy · tool-versions · refresh-cutoff + hardware-verified.yaml ← MI350X silicon checks + + + scripts/ — tooling & CI gate + query.py · get_page.py · grep_wiki.py + validate.py (schema · vocab · links) + generate-indices.py · harvest_prs.py + + + + validates → + + regenerates → + + + + SKILL.md @ root → Claude Code skill + diff --git a/index.md b/index.md index d8580c7f..6b4cd8fe 100644 --- a/index.md +++ b/index.md @@ -7,6 +7,10 @@ > > Inspired by and modeled on MIT Han Lab's > [KernelWiki](https://github.com/mit-han-lab/KernelWiki). +> +> **Validated on real MI350X silicon (gfx950, ROCm 7.2)** — see [VERIFICATION.md](VERIFICATION.md) +> and the first-party [FlyDSL kernel profiling sweep](sources/refs/ref-flydsl-kernel-profiling.md) +> ([dashboard](https://jhinpan.github.io/flydsl-kernel-profiling/)). ## Recommended Query Tools (for LLM agents) @@ -65,6 +69,7 @@ See [references/examples.md](references/examples.md) for worked query patterns. - [kernel-ck-hgemm](wiki/kernels/ck-hgemm.md) — FP16 GEMM via CK / MFMA - [kernel-fp8-gemm](wiki/kernels/fp8-gemm.md) — FP8 block-scaled GEMM (gfx950) - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 (CK-tile) +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL flash attention (gfx950 dual-wave) - [kernel-paged-attention](wiki/kernels/paged-attention.md) — Paged attention decode - [kernel-fused-moe](wiki/kernels/fused-moe.md) — Fused MoE - [kernel-mla-decode](wiki/kernels/mla-decode.md) — MLA decode (DeepSeek) diff --git a/queries/by-hardware-feature.md b/queries/by-hardware-feature.md index 934d9378..fb74e0f6 100644 --- a/queries/by-hardware-feature.md +++ b/queries/by-hardware-feature.md @@ -14,6 +14,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-composable_kernel-48](sources/prs/composable_kernel/PR-48.md) — ckProfiler and device-level XDL GEMM operator `[source-pr]` - [pr-flash-attention-179](sources/prs/flash-attention/PR-179.md) — [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [technique-kernel-fusion](wiki/techniques/kernel-fusion.md) — Kernel Fusion — Epilogues and Adjacent-Op Merging `[wiki-technique]` - [technique-lds-swizzling](wiki/techniques/lds-swizzling.md) — LDS Swizzling — Conflict-Free A/B Tile Staging for MFMA `[wiki-technique]` - [technique-mfma-pipelining](wiki/techniques/mfma-pipelining.md) — MFMA Software Pipelining (interleaving loads and matrix issue) `[wiki-technique]` @@ -312,6 +313,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-vllm-43219](sources/prs/vllm/PR-43219.md) — [EPLB] Make async EPLB default `[source-pr]` - [pr-vllm-43303](sources/prs/vllm/PR-43303.md) — [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-vector-add-asm](wiki/kernels/vector-add-asm.md) — Persistent Vector Add in GCN Assembly (async direct-to-LDS, double-buffered) `[wiki-kernel]` - [technique-lds-double-buffering](wiki/techniques/lds-double-buffering.md) — LDS Double / Multi-Buffering (Overlapping HBM Loads with MFMA) `[wiki-technique]` - [technique-mfma-pipelining](wiki/techniques/mfma-pipelining.md) — MFMA Software Pipelining (interleaving loads and matrix issue) `[wiki-technique]` @@ -2342,6 +2344,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-triton-439](sources/prs/triton/PR-439.md) — [Backend] Refactor sharedToDotOperandMFMA lowering `[source-pr]` - [pr-triton-635](sources/prs/triton/PR-635.md) — Move utility tools from triton-mlir to main_perf branch `[source-pr]` - [pr-triton-775](sources/prs/triton/PR-775.md) — plot_layout.py Refactoring `[source-pr]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [technique-bank-conflict-avoidance](wiki/techniques/bank-conflict-avoidance.md) — LDS Bank-Conflict Avoidance (padding, swizzle, ds_read2) `[wiki-technique]` - [technique-lds-double-buffering](wiki/techniques/lds-double-buffering.md) — LDS Double / Multi-Buffering (Overlapping HBM Loads with MFMA) `[wiki-technique]` - [technique-lds-swizzling](wiki/techniques/lds-swizzling.md) — LDS Swizzling — Conflict-Free A/B Tile Staging for MFMA `[wiki-technique]` @@ -7082,6 +7085,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-vllm-43571](sources/prs/vllm/PR-43571.md) — [BugFix][Platform] Fix import vllm.platforms.rocm error on non-CUDA test_gpt_oss.py `[source-pr]` - [pr-vllm-43881](sources/prs/vllm/PR-43881.md) — [ROCm] cmake: support PYTORCH_FOUND_HIP for torch 2.13 native HIP language support `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [kernel-rmsnorm](wiki/kernels/rmsnorm.md) — Fused RMSNorm (+ residual / quant) on CDNA `[wiki-kernel]` - [kernel-vector-add-asm](wiki/kernels/vector-add-asm.md) — Persistent Vector Add in GCN Assembly (async direct-to-LDS, double-buffered) `[wiki-kernel]` @@ -7739,6 +7743,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-triton-879](sources/prs/triton/PR-879.md) — Adding more support for machine model cycles per dot/mfma `[source-pr]` - [pr-vllm-40161](sources/prs/vllm/PR-40161.md) — [bugfix] Use only onlines CPUs in lscpu `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [technique-fine-grained-quantization](wiki/techniques/fine-grained-quantization.md) — Fine-Grained FP8 Quantization & Block Scaling `[wiki-technique]` - [technique-kernel-fusion](wiki/techniques/kernel-fusion.md) — Kernel Fusion — Epilogues and Adjacent-Op Merging `[wiki-technique]` @@ -8158,6 +8163,7 @@ Pages grouped by AMD GPU hardware feature. ## vgpr - [kernel-bandwidth-microbench](wiki/kernels/bandwidth-microbench.md) — HBM Bandwidth Microbenchmark (float4 non-temporal persistent read) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [technique-kernel-fusion](wiki/techniques/kernel-fusion.md) — Kernel Fusion — Epilogues and Adjacent-Op Merging `[wiki-technique]` - [technique-occupancy-tuning](wiki/techniques/occupancy-tuning.md) — Occupancy Tuning — Waves per SIMD vs ILP on CDNA `[wiki-technique]` - [technique-profiling-workflow](wiki/techniques/profiling-workflow.md) — Profiling & Roofline Workflow on ROCm (rocprof / Omniperf) `[wiki-technique]` @@ -8256,6 +8262,7 @@ Pages grouped by AMD GPU hardware feature. - [pr-sglang-26208](sources/prs/sglang/PR-26208.md) — [AMD] Dsv4/pr2 compressor opt `[source-pr]` - [pr-triton-153](sources/prs/triton/PR-153.md) — Adding wave64 support to Triton `[source-pr]` - [kernel-bandwidth-microbench](wiki/kernels/bandwidth-microbench.md) — HBM Bandwidth Microbenchmark (float4 non-temporal persistent read) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-rmsnorm](wiki/kernels/rmsnorm.md) — Fused RMSNorm (+ residual / quant) on CDNA `[wiki-kernel]` - [kernel-vector-add-asm](wiki/kernels/vector-add-asm.md) — Persistent Vector Add in GCN Assembly (async direct-to-LDS, double-buffered) `[wiki-kernel]` - [technique-occupancy-tuning](wiki/techniques/occupancy-tuning.md) — Occupancy Tuning — Waves per SIMD vs ILP on CDNA `[wiki-technique]` diff --git a/queries/by-kernel-type.md b/queries/by-kernel-type.md index 5aaaebb4..a1123660 100644 --- a/queries/by-kernel-type.md +++ b/queries/by-kernel-type.md @@ -965,8 +965,10 @@ - [pr-vllm-43817](sources/prs/vllm/PR-43817.md) — [ROCm] Add attention sink support to AITer flash attention backend `[source-pr]` - [pr-vllm-43898](sources/prs/vllm/PR-43898.md) — [ROCm][DSv4] Remove device pipeline stall in sparse attention `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-mla-decode](wiki/kernels/mla-decode.md) — MLA Decode (DeepSeek Multi-Latent Attention) on MI300 `[wiki-kernel]` - [kernel-paged-attention](wiki/kernels/paged-attention.md) — Paged Attention Decode (vLLM / AITER) on MI300 `[wiki-kernel]` +- [lang-flydsl](wiki/languages/flydsl.md) — FlyDSL — Python + MLIR Layout DSL for AMD Kernels `[wiki-language]` ## bandwidth-bench @@ -3689,6 +3691,8 @@ - [pr-vllm-43727](sources/prs/vllm/PR-43727.md) — [MoE] Remove inplace fused experts mechanism `[source-pr]` - [pr-vllm-43817](sources/prs/vllm/PR-43817.md) — [ROCm] Add attention sink support to AITer flash attention backend `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` +- [lang-flydsl](wiki/languages/flydsl.md) — FlyDSL — Python + MLIR Layout DSL for AMD Kernels `[wiki-language]` - [technique-lds-double-buffering](wiki/techniques/lds-double-buffering.md) — LDS Double / Multi-Buffering (Overlapping HBM Loads with MFMA) `[wiki-technique]` - [technique-vgpr-budgeting](wiki/techniques/vgpr-budgeting.md) — VGPR Budgeting — ArchVGPR + AGPR Pressure vs Occupancy `[wiki-technique]` diff --git a/queries/by-language.md b/queries/by-language.md index ea9401a1..02779d88 100644 --- a/queries/by-language.md +++ b/queries/by-language.md @@ -6968,7 +6968,9 @@ - [pr-FlyDSL-91](sources/prs/FlyDSL/PR-91.md) — fix lds bug `[source-pr]` - [pr-FlyDSL-92](sources/prs/FlyDSL/PR-92.md) — [Bug] Fix missing stream_ptr parameter in MoeGemm2ReduceWrapper `[source-pr]` - [pr-FlyDSL-98](sources/prs/FlyDSL/PR-98.md) — fix a4w4 gemm precision `[source-pr]` +- [ref-flydsl-kernel-profiling](sources/refs/ref-flydsl-kernel-profiling.md) — FlyDSL Kernel Profiling — MI350X rocprofv3 ATT Sweep & Dashboard `[source-ref]` - [ref-flydsl](sources/refs/ref-flydsl.md) — FlyDSL — Flexible Layout DSL for AMD GPUs `[source-ref]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [lang-flydsl](wiki/languages/flydsl.md) — FlyDSL — Python + MLIR Layout DSL for AMD Kernels `[wiki-language]` - [technique-preshuffle-layout](wiki/techniques/preshuffle-layout.md) — Pre-shuffle Weight Layout for MFMA `[wiki-technique]` @@ -14492,7 +14494,9 @@ - [pr-triton-95](sources/prs/triton/PR-95.md) — Remove torch import in setup.py to check for ROCm. `[source-pr]` - [pr-triton-96](sources/prs/triton/PR-96.md) — Triton mlir ifu 2023 30 1 `[source-pr]` - [pr-triton-99](sources/prs/triton/PR-99.md) — Fix llir IR restoration from cache to convert to string. `[source-pr]` +- [ref-flydsl-kernel-profiling](sources/refs/ref-flydsl-kernel-profiling.md) — FlyDSL Kernel Profiling — MI350X rocprofv3 ATT Sweep & Dashboard `[source-ref]` - [ref-flydsl](sources/refs/ref-flydsl.md) — FlyDSL — Flexible Layout DSL for AMD GPUs `[source-ref]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [lang-flydsl](wiki/languages/flydsl.md) — FlyDSL — Python + MLIR Layout DSL for AMD Kernels `[wiki-language]` - [lang-triton-amd](wiki/languages/triton-amd.md) — Triton on AMD — tl.dot → MFMA and the AMD backend knobs `[wiki-language]` @@ -16949,9 +16953,11 @@ - [pr-vllm-43817](sources/prs/vllm/PR-43817.md) — [ROCm] Add attention sink support to AITer flash attention backend `[source-pr]` - [pr-vllm-43881](sources/prs/vllm/PR-43881.md) — [ROCm] cmake: support PYTORCH_FOUND_HIP for torch 2.13 native HIP language support `[source-pr]` - [pr-vllm-43898](sources/prs/vllm/PR-43898.md) — [ROCm][DSv4] Remove device pipeline stall in sparse attention `[source-pr]` +- [ref-flydsl-kernel-profiling](sources/refs/ref-flydsl-kernel-profiling.md) — FlyDSL Kernel Profiling — MI350X rocprofv3 ATT Sweep & Dashboard `[source-ref]` - [ref-flydsl](sources/refs/ref-flydsl.md) — FlyDSL — Flexible Layout DSL for AMD GPUs `[source-ref]` - [ref-matrix-calculator](sources/refs/ref-matrix-calculator.md) — AMD Matrix Instruction Calculator `[source-ref]` - [ref-tensile](sources/refs/ref-tensile.md) — Tensile — assembly GEMM kernel generator `[source-ref]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [lang-flydsl](wiki/languages/flydsl.md) — FlyDSL — Python + MLIR Layout DSL for AMD Kernels `[wiki-language]` - [lang-triton-amd](wiki/languages/triton-amd.md) — Triton on AMD — tl.dot → MFMA and the AMD backend knobs `[wiki-language]` diff --git a/queries/by-technique.md b/queries/by-technique.md index bcf01143..1900c8c7 100644 --- a/queries/by-technique.md +++ b/queries/by-technique.md @@ -103,6 +103,7 @@ Every page (wiki + PR sources) that uses each optimization technique. ## direct-to-lds +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-vector-add-asm](wiki/kernels/vector-add-asm.md) — Persistent Vector Add in GCN Assembly (async direct-to-LDS, double-buffered) `[wiki-kernel]` ## epilogue-fusion @@ -2008,6 +2009,7 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-triton-872](sources/prs/triton/PR-872.md) — Shared/oai oss bf16mx4 moe launch `[source-pr]` - [pr-triton-905](sources/prs/triton/PR-905.md) — replace copysign to avoid potential LLVM opt `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-flydsl-preshuffle-gemm](wiki/kernels/flydsl-preshuffle-gemm.md) — FlyDSL Preshuffle GEMM (layout-DSL example) `[wiki-kernel]` - [kernel-vector-add-asm](wiki/kernels/vector-add-asm.md) — Persistent Vector Add in GCN Assembly (async direct-to-LDS, double-buffered) `[wiki-kernel]` @@ -2722,6 +2724,7 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-triton-879](sources/prs/triton/PR-879.md) — Adding more support for machine model cycles per dot/mfma `[source-pr]` - [pr-vllm-40161](sources/prs/vllm/PR-40161.md) — [bugfix] Use only onlines CPUs in lscpu `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` ## nontemporal-loads @@ -2904,6 +2907,7 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-triton-668](sources/prs/triton/PR-668.md) — rmsnorm optimization for M = 1 `[source-pr]` - [pr-triton-698](sources/prs/triton/PR-698.md) — Fold fp_to_fp op with zero constant input (#5007) `[source-pr]` - [pr-triton-722](sources/prs/triton/PR-722.md) — Fix pid remapping logic when GRID_MN cannot divide NUM_XCDS `[source-pr]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` ## persistent-kernel @@ -4772,6 +4776,7 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-vllm-41751](sources/prs/vllm/PR-41751.md) — [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL `[source-pr]` - [pr-vllm-41901](sources/prs/vllm/PR-41901.md) — [Model] Use AutoWeightsLoader for AXK1 `[source-pr]` - [pr-vllm-43898](sources/prs/vllm/PR-43898.md) — [ROCm][DSv4] Remove device pipeline stall in sparse attention `[source-pr]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` ## split-k @@ -5728,6 +5733,7 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-triton-635](sources/prs/triton/PR-635.md) — Move utility tools from triton-mlir to main_perf branch `[source-pr]` - [pr-triton-733](sources/prs/triton/PR-733.md) — rmsnorm backward optimizations `[source-pr]` - [pr-triton-775](sources/prs/triton/PR-775.md) — plot_layout.py Refactoring `[source-pr]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` ## wave-reduce @@ -6470,5 +6476,6 @@ Every page (wiki + PR sources) that uses each optimization technique. - [pr-vllm-43717](sources/prs/vllm/PR-43717.md) — [9/n] Migrate attention and cache kernels to torch stable ABI (continued) `[source-pr]` - [pr-vllm-43727](sources/prs/vllm/PR-43727.md) — [MoE] Remove inplace fused experts mechanism `[source-pr]` - [kernel-flash-attention-ck](wiki/kernels/flash-attention-ck.md) — FlashAttention-2 via CK-tile on CDNA (MI300X) `[wiki-kernel]` +- [kernel-flydsl-flash-attention](wiki/kernels/flydsl-flash-attention.md) — FlyDSL Flash Attention — generic + gfx950 dual-wave fast path `[wiki-kernel]` - [kernel-rmsnorm](wiki/kernels/rmsnorm.md) — Fused RMSNorm (+ residual / quant) on CDNA `[wiki-kernel]` diff --git a/sources/refs/ref-flydsl-kernel-profiling.md b/sources/refs/ref-flydsl-kernel-profiling.md new file mode 100644 index 00000000..dc2f834d --- /dev/null +++ b/sources/refs/ref-flydsl-kernel-profiling.md @@ -0,0 +1,87 @@ +--- +id: ref-flydsl-kernel-profiling +title: FlyDSL Kernel Profiling — MI350X rocprofv3 ATT Sweep & Dashboard +repo: jhinpan/flydsl-kernel-profiling +url: https://jhinpan.github.io/flydsl-kernel-profiling/ +author: Jin Pan +source_category: reference-repo +architectures: +- gfx950 +tags: +- flydsl +- profiling +- rocprofv3 +- kernel-profiling +- register-pressure +- mfma +languages: +- flydsl +- python +- mlir +retrieved_at: '2026-06-08' +--- + +# FlyDSL Kernel Profiling — MI350X rocprofv3 ATT Sweep & Dashboard + +A first-party profiling study of **every major FlyDSL gfx950 kernel** captured with +**rocprofv3 ATT (Advanced Thread Trace) + hardware counters on real AMD Instinct +MI350X silicon**, with matched-shape baselines from AITER / Composable Kernel / +hipBLASLt. Each kernel ships a reproducible bundle (REPORT.md + ATT trace + +counters + source) and the results are browsable as an interactive GitHub Pages +dashboard. + +- **Dashboard:** +- **Repo:** + +## Method + +- **GPU:** 8× AMD Instinct MI350X (`gfx950`, CDNA4). **Stack:** ROCm 7.2.0. +- **FlyDSL:** 0.1.9.dev594 @ `18c5a7ed`. +- **Capture:** rocprofv3 ATT (95–100% source-mapped per kernel) + counter pass; + `FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1` for source attribution. +- **17 kernels** ATT-profiled; **15** compared against matched-shape AITER / CK / + hipBLASLt baselines. Ratios below are **FlyDSL throughput ÷ baseline throughput** + (`>1` = FlyDSL faster). + +## Verdicts (MI350X, gfx950) + +| Bucket | Kernel | FlyDSL vs baseline | Baseline | +|---|---|---|---| +| **WIN** | softmax | **2.05×** | Triton | +| **WIN** | hgemm_splitk | **1.66×** | CK / hipBLASLt | +| **WIN** | moe_gemm | **1.11×** (stage2-atomic 1.30×) | AITER | +| PARITY | layernorm, quant, moe_reduce | ~1.0× | AITER | +| HEADROOM | moe_blockscale | 0.82× | tuned-CK | +| HEADROOM | rmsnorm | 0.89× | AITER | +| HEADROOM | mla (decode) | 0.90× | AITER | +| HEADROOM | flash_attn | 0.92× | CK-tile | +| HEADROOM | paged-attention | 0.48× | AITER | +| **HEADROOM** | topk_gating | **0.22×** | AITER | +| **HEADROOM** | rope | **0.17×** | AITER | + +GEMM re-measured at compute-bound shapes: preshuffle **0.77×**, blockscale **0.66×** +vs tuned-CK; an internal v2 path is **1.20×** over v1. + +## Key findings + +- **Register-pressure-capped occupancy** is the dominant headroom on the attention / + GEMM losers (mla, pa, flash_attn, moe_blockscale): only **1–2 waves/SIMD** resident, + **VGPR 175–251** live. Cutting the live VGPR set to admit a 2nd wave is the lever. + (See [register/AGPR budgeting](../../wiki/techniques/vgpr-budgeting.md).) +- **rope / topk_gating** are slow because cross-lane reductions serialize on + `shuffle_xor` / `ds_bpermute` against `LGKMCNT`; the fix is a DPP / `v_permlane16` + [wave reduction](../../wiki/techniques/wave-reduce.md). +- **softmax fast path (`BufferCopy128b`), filed as FlyDSL #627 / fixed in #650:** was + dead-coded behind `False and`. Re-enabling it is **not** a 2× win — measured A/B + (fast vs scalar, MI350X) is **on-par to +7% on large bf16, neutral/slightly-negative + on f32**; both paths already saturate HBM (~5 TB/s) and register-buffer the whole row, + so vectorization only trims instruction count. (The 2.05× headline is FlyDSL-vs-Triton, + not fast-vs-scalar.) +- **`fp8_gemm_4wave` (rowscale) fails to compile** — `flyc.compile(): missing + _reusable_slot_spec` on the fast-dispatch path. A real, config-independent regression. + +## Reference + +- Dashboard: +- Source / bundles: +- Silicon facts cross-check: [`VERIFICATION.md`](../../VERIFICATION.md) diff --git a/wiki/kernels/flydsl-flash-attention.md b/wiki/kernels/flydsl-flash-attention.md new file mode 100644 index 00000000..0836a02e --- /dev/null +++ b/wiki/kernels/flydsl-flash-attention.md @@ -0,0 +1,179 @@ +--- +id: kernel-flydsl-flash-attention +title: FlyDSL Flash Attention — generic + gfx950 dual-wave fast path +type: kernel +architectures: +- gfx942 +- gfx950 +tags: +- flash-attention +- attention +- flydsl +- mfma +- software-pipelining +- lds-double-buffering +- bf16 +- softmax +- register-pressure +- kernel-profiling +confidence: source-reported +reproducibility: snippet +kernel_types: +- flash-attention +- attention +languages: +- flydsl +- mlir +- python +hardware_features: +- mfma +- lds +- async-copy +- ds-instructions +- agpr +- vgpr +- wave64 +techniques: +- mfma-pipelining +- software-pipelining +- lds-double-buffering +- wave-reduce +- direct-to-lds +- vgpr-budgeting +- occupancy-tuning +related: +- lang-flydsl +- kernel-flash-attention-ck +- hw-mfma +- hw-async-copy-lds +- technique-mfma-pipelining +- technique-vgpr-budgeting +- technique-occupancy-tuning +- kernel-flydsl-preshuffle-gemm +sources: +- ref-flydsl +- ref-flydsl-kernel-profiling +- doc-flash-attention-2 +- hw-mfma +- lang-flydsl +performance_claims: +- gpu: MI350X + dtype: bf16 + metric: FlyDSL fwd kernel-time vs CK-tile FlashAttention (throughput ratio, >1 = FlyDSL faster) + value: ~0.92x (HEADROOM) + shape: D=128, causal, seq%256==0 + utilization: register-pressure-capped — 1-2 waves/SIMD resident, VGPR 175-251 live + source_id: ref-flydsl-kernel-profiling +implemented_by: +- pr-FlyDSL-225 +- pr-FlyDSL-334 +- pr-FlyDSL-346 +- pr-FlyDSL-462 +--- +# FlyDSL Flash Attention — generic + gfx950 dual-wave fast path + +## Overview + +FlyDSL ships a forward FlashAttention-2 kernel written in its +[Python + MLIR layout DSL](../languages/flydsl.md) rather than in C++/CK. The same +online-softmax math as the [CK-tile implementation](flash-attention-ck.md) — see +that page for the recurrence — but the kernel is **built** by tracing Python into +the `fly` dialect and lowering to ROCDL/MFMA. It exists as **two kernels behind one +dispatcher**: + +1. **`flash_attn_generic.py`** — the portable, compiler-scheduled kernel. The public + builder `build_flash_attn_func_module_primary(...)` auto-selects `BLOCK_M=128` + (4 waves / 256 threads) or `BLOCK_M=256` (8 waves / 512 threads) by `B·S`, runs + **GEMM-1 `K·Qᵀ`** so scores land in MFMA-32 register layout, keeps **P resident in + registers** (no LDS round-trip), and feeds **GEMM-2 `Vᵀ·P`**. Online softmax (running + max/sum, exp2, causal mask) is in registers. +2. **`flash_attn_gfx950.py`** — `build_flash_attn_dualwave_swp_module(...)`, a + **hand-scheduled dual-wave software-pipelined** fast path. Same math as the + `BLOCK_M=256` generic path, dispatched only when `gpu_arch ≥ gfx950`, + `head_dim == 128`, `dtype ∈ {bf16, f16}`, and at runtime `seq_len % 256 == 0`, + `seq_len ≥ 384`. + +The generic builder imports and routes to the gfx950 module when those gates hold, +falling back gracefully otherwise. + +## The gfx950 dual-wave fast path + +What the hand-written schedule buys over the compiler-scheduled generic kernel: + +- **Dual-wave time-multiplexing.** Two wave-groups co-resident on one EU are + phase-shifted by an extra `s_barrier` in the prologue so group B's compute hides + group A's KV-load latency; they realign at the epilogue. +- **Explicit software pipeline.** Prologue loads K0; an 8-cluster main loop runs with + a **2-tile lookahead** (K/V prefetched two iterations ahead); a 14-cluster epilogue + drains the tail. +- **gfx950 hardware intrinsics.** `ds_read_tr16_b64` (HW-transpose LDS read for V), + `buffer_load…lds` ([direct-to-LDS](../hardware/async-copy-lds.md) DMA, gfx950 widens + to 16 B), `permlane32_swap` wave reductions, and manual `sched_group_barrier` / + `s_setprio` to pin MFMA↔VALU↔EXP issue windows. +- **Lazy rescale.** The O accumulator is rescaled only when `m_new − m_old > 8` + (ballot-gated), skipping the correction on the common path. +- **log2-space softmax.** Q is pre-scaled by `1/√d · log2(e)` so the inner loop uses + `exp2` and the operand stays in the gfx950 `exp2` accuracy window. + +## How it is built (illustrative) + +FlyDSL kernels are *built*, not just called — the builder emits a specialized MLIR +module per config. Conceptually: + +```python +from kernels.flash_attn_generic import build_flash_attn_func_module_primary + +# One entry point; it picks BLOCK_M and, on gfx950 D=128 bf16, the dual-wave +# software-pipelined fast path (flash_attn_gfx950.build_flash_attn_dualwave_swp_module). +launch = build_flash_attn_func_module_primary( + num_heads=32, num_kv_heads=8, # GQA: num_heads % num_kv_heads == 0 + head_dim=128, dtype_str="bf16", + causal=True, # masks upper-triangular KV tiles + gpu_arch="gfx950", # gates the dual-wave path +) +# launch(Q, K, V, O, seq_len) with grid = (B * num_q_tiles * H, 1, 1) +``` + +Most of the kernel body is **compile-time metaprogramming**: `const_expr(...)` branches +and `range_constexpr(...)` loops resolve when the module is built (the IR for one config +is ~40k lines), so feature toggles (`USE_HW_TR`, `USE_K16`, `ENABLE_DMA`, `CAUSAL`) +generate code rather than branch at runtime. + +## Evolution (upstream PRs) + +| PR | What it added | +|---|---| +| [#225](../../sources/prs/FlyDSL/PR-225.md) | Original FMHA kernel (MFMA-32, online softmax) | +| [#334](../../sources/prs/FlyDSL/PR-334.md) | Tile-M tuning + `BLOCK_M=128`/`256` runtime dispatch | +| [#346](../../sources/prs/FlyDSL/PR-346.md) | Dynamic-dispatch refactor of the FA path | +| [#462](../../sources/prs/FlyDSL/PR-462.md) | Clean-up: low-level MLIR → modern `fly.*`/`Vec`/Pythonic control flow | +| #629 | gfx950 dual-wave SWP kernel (`flash_attn_gfx950.py`); rename `flash_attn_func.py` → `flash_attn_generic.py` | +| #661 | Route MFMA through the **layout MMA-atom API** (`make_mma_atom` / `mma_atom_call_ssa`) — same perf, ~1k fewer lines | + +(#629 / #661 post-date this wiki's PR-harvest cutoff and have no source page yet.) + +## Performance (measured on MI350X) + +In our [first-party rocprofv3 ATT sweep](../../sources/refs/ref-flydsl-kernel-profiling.md) +on real gfx950 silicon (ROCm 7.2), the FlyDSL forward kernel runs at **~0.92×** the +throughput of the CK-tile FlashAttention baseline at `D=128` causal — i.e. close but +in the **HEADROOM** bucket, not yet a win. The trace attributes the gap to +**register-pressure-capped occupancy**: only **1–2 waves/SIMD** are resident with +**VGPR 175–251** live, so the matrix unit stalls without a second wave to hide latency. +The actionable lever is [cutting the live VGPR set](../techniques/vgpr-budgeting.md) to +admit a 2nd wave — the dual-wave SWP schedule is the structural attempt at exactly that, +trading hand-scheduled register reuse for occupancy. + +## See also + +- [FlyDSL language guide](../languages/flydsl.md) — layout algebra, `@flyc.kernel`, the MMA-atom API +- [FlashAttention-2 via CK-tile](flash-attention-ck.md) — the C++ baseline + the online-softmax math +- [VGPR / AGPR budgeting](../techniques/vgpr-budgeting.md) · [MFMA pipelining](../techniques/mfma-pipelining.md) +- [MI350X profiling sweep & dashboard](../../sources/refs/ref-flydsl-kernel-profiling.md) + +## Sources + +- [FlyDSL reference repository](https://github.com/ROCm/FlyDSL) +- [FlyDSL kernel profiling dashboard (MI350X)](https://jhinpan.github.io/flydsl-kernel-profiling/) +- [FlashAttention-2 paper](https://arxiv.org/abs/2307.08691) +- [MFMA — AMD Matrix Core Instructions](../hardware/mfma.md) diff --git a/wiki/languages/flydsl.md b/wiki/languages/flydsl.md index f75a61c7..21c16421 100644 --- a/wiki/languages/flydsl.md +++ b/wiki/languages/flydsl.md @@ -26,14 +26,18 @@ kernel_types: - gemm - fp8-gemm - fused-moe +- flash-attention +- attention related: - hw-mfma - technique-preshuffle-layout - kernel-flydsl-preshuffle-gemm +- kernel-flydsl-flash-attention - lang-composable-kernel - lang-triton-amd sources: - ref-flydsl +- ref-flydsl-kernel-profiling - ref-aiter - doc-llvm-amdgpu - blog-amd-matrix-cores @@ -205,6 +209,40 @@ three independent index expressions by hand. The end-to-end worked GEMM — including the pre-shuffle transform and a benchmark — is on the [FlyDSL pre-shuffle GEMM kernel page](../kernels/flydsl-preshuffle-gemm.md). +## Beyond GEMM: flash attention and the MMA-atom API + +FlyDSL is not GEMM-only. Its forward **flash attention** is a two-GEMM + online-softmax +kernel built the same way — and it now has two implementations behind one dispatcher: a +portable compiler-scheduled `flash_attn_generic.py` and a hand-scheduled, dual-wave +software-pipelined `flash_attn_gfx950.py` fast path (gfx950, `D=128`, bf16/f16). The full +walk-through — dispatch logic, the gfx950 hardware schedule, and the upstream PR arc — is +on the [FlyDSL flash-attention kernel page](../kernels/flydsl-flash-attention.md). + +A recent direction is the **layout MMA-atom API**: instead of emitting raw ROCDL +intrinsics (`rocdl.mfma_f32_32x32x16_bf16`, `buffer_load_dwordx4`), kernels construct a +`make_mma_atom(...)` and issue it via `mma_atom_call_ssa(...)` (and `copy_atom_call_ssa` +for loads/stores). The flash-attention kernel was migrated onto this API, cutting ~1k +lines at parity — the same `Layout` algebra that describes tiling now also describes the +matrix-core operand distribution. + +## Measured on MI350X (gfx950) + +We profiled every major FlyDSL gfx950 kernel on real **MI350X silicon (ROCm 7.2)** with +rocprofv3 ATT + counters, against matched-shape AITER / CK / hipBLASLt baselines — see the +[profiling sweep & dashboard](../../sources/refs/ref-flydsl-kernel-profiling.md). Headlines +(throughput ÷ baseline, `>1` = FlyDSL faster): + +- **Wins:** softmax **2.05×** (vs Triton), hgemm_splitk **1.66×**, moe_gemm **1.11×**. +- **Parity:** layernorm, quant, moe_reduce. +- **Headroom:** flash_attn 0.92×, mla 0.90×, rmsnorm 0.89×, paged-attention 0.48×, and the + two big ones — **topk_gating 0.22×** and **rope 0.17×**. + +Two recurring root causes, both actionable: (1) the attention/GEMM losers are +**register-pressure-capped** at 1–2 waves/SIMD (VGPR 175–251) — admit a 2nd wave by +[cutting the live set](../techniques/vgpr-budgeting.md); (2) rope/topk serialize cross-lane +reductions on `LGKMCNT` — replace with a DPP / `v_permlane16` +[wave reduction](../techniques/wave-reduce.md). + ## Limitations and gotchas - **Experimental / moving target.** Dialect op names, pass flags, and the Python From 32683e50979c4b0128f5a899252204f5a9af6947 Mon Sep 17 00:00:00 2001 From: Jin Pan Date: Mon, 8 Jun 2026 16:00:15 +0000 Subject: [PATCH 2/2] docs: address PR review feedback --- README.md | 13 +++++-------- data/schemas.yaml | 1 + docs/architecture.svg | 2 +- wiki/kernels/flydsl-flash-attention.md | 24 +++++++++++------------- wiki/languages/flydsl.md | 21 ++++++--------------- 5 files changed, 24 insertions(+), 37 deletions(-) diff --git a/README.md b/README.md index 0d756ac6..c1a52bb2 100644 --- a/README.md +++ b/README.md @@ -34,14 +34,11 @@ an adversarial second pass. Full evidence: [`VERIFICATION.md`](VERIFICATION.md) memory NPS1/NPS2; native `xf32` MFMA *fails to select* on gfx950. - **All 12 runnable examples** build with `--offload-arch=gfx950` **and execute** on the GPU (11/12 self-check; `fp8-gemm`'s `main()` only verifies the emitted MFMA, no numeric check). -- **First-party FlyDSL kernel sweep on MI350X** — every major FlyDSL gfx950 kernel profiled - with rocprofv3 ATT + counters against AITER/CK/hipBLASLt baselines - ([dashboard](https://jhinpan.github.io/flydsl-kernel-profiling/) · - [`ref-flydsl-kernel-profiling`](sources/refs/ref-flydsl-kernel-profiling.md) · - [FlyDSL flash-attention page](wiki/kernels/flydsl-flash-attention.md)): - - **Wins:** softmax **2.05×** (vs Triton), hgemm_splitk **1.66×**, moe_gemm **1.11×**. - - **Headroom:** flash_attn 0.92×, paged-attention 0.48×, **topk_gating 0.22×**, **rope 0.17×** - — the attention/GEMM losers are **register-pressure-capped** at 1–2 waves/SIMD. +- **First-party FlyDSL kernel sweep on MI350X** — every major FlyDSL gfx950 kernel was + profiled with rocprofv3 ATT + counters against matched AITER/CK/hipBLASLt baselines. + The detailed verdict table, root-cause notes, and dashboard links live in the + canonical [`ref-flydsl-kernel-profiling`](sources/refs/ref-flydsl-kernel-profiling.md) + source page; synthesized pages link back to it instead of duplicating the full summary. ## What's Here diff --git a/data/schemas.yaml b/data/schemas.yaml index 0bd71dee..461be98c 100644 --- a/data/schemas.yaml +++ b/data/schemas.yaml @@ -146,6 +146,7 @@ wiki-kernel: - shape - utilization - baseline + - bucket wiki-pattern: required: diff --git a/docs/architecture.svg b/docs/architecture.svg index 2ce226ae..bd676700 100644 --- a/docs/architecture.svg +++ b/docs/architecture.svg @@ -1,4 +1,4 @@ - + diff --git a/wiki/kernels/flydsl-flash-attention.md b/wiki/kernels/flydsl-flash-attention.md index 0836a02e..30b862b0 100644 --- a/wiki/kernels/flydsl-flash-attention.md +++ b/wiki/kernels/flydsl-flash-attention.md @@ -54,14 +54,14 @@ sources: - ref-flydsl - ref-flydsl-kernel-profiling - doc-flash-attention-2 -- hw-mfma -- lang-flydsl performance_claims: - gpu: MI350X dtype: bf16 metric: FlyDSL fwd kernel-time vs CK-tile FlashAttention (throughput ratio, >1 = FlyDSL faster) - value: ~0.92x (HEADROOM) - shape: D=128, causal, seq%256==0 + value: 0.92 + bucket: HEADROOM + baseline: CK-tile FlashAttention + shape: D=128, causal, seq % 256 == 0 utilization: register-pressure-capped — 1-2 waves/SIMD resident, VGPR 175-251 live source_id: ref-flydsl-kernel-profiling implemented_by: @@ -154,15 +154,13 @@ generate code rather than branch at runtime. ## Performance (measured on MI350X) -In our [first-party rocprofv3 ATT sweep](../../sources/refs/ref-flydsl-kernel-profiling.md) -on real gfx950 silicon (ROCm 7.2), the FlyDSL forward kernel runs at **~0.92×** the -throughput of the CK-tile FlashAttention baseline at `D=128` causal — i.e. close but -in the **HEADROOM** bucket, not yet a win. The trace attributes the gap to -**register-pressure-capped occupancy**: only **1–2 waves/SIMD** are resident with -**VGPR 175–251** live, so the matrix unit stalls without a second wave to hide latency. -The actionable lever is [cutting the live VGPR set](../techniques/vgpr-budgeting.md) to -admit a 2nd wave — the dual-wave SWP schedule is the structural attempt at exactly that, -trading hand-scheduled register reuse for occupancy. +The canonical measurement record is the +[MI350X rocprofv3 ATT sweep](../../sources/refs/ref-flydsl-kernel-profiling.md); +this page only carries the kernel-level interpretation. For `D=128` causal, the +frontmatter records a 0.92 FlyDSL/CK-tile throughput ratio in the **HEADROOM** +bucket. The actionable gap is [VGPR pressure](../techniques/vgpr-budgeting.md): +the trace is capped at 1–2 waves/SIMD, so the dual-wave SWP schedule is the +structural attempt to trade hand-scheduled register reuse for more latency hiding. ## See also diff --git a/wiki/languages/flydsl.md b/wiki/languages/flydsl.md index 21c16421..d5b80926 100644 --- a/wiki/languages/flydsl.md +++ b/wiki/languages/flydsl.md @@ -227,21 +227,12 @@ matrix-core operand distribution. ## Measured on MI350X (gfx950) -We profiled every major FlyDSL gfx950 kernel on real **MI350X silicon (ROCm 7.2)** with -rocprofv3 ATT + counters, against matched-shape AITER / CK / hipBLASLt baselines — see the -[profiling sweep & dashboard](../../sources/refs/ref-flydsl-kernel-profiling.md). Headlines -(throughput ÷ baseline, `>1` = FlyDSL faster): - -- **Wins:** softmax **2.05×** (vs Triton), hgemm_splitk **1.66×**, moe_gemm **1.11×**. -- **Parity:** layernorm, quant, moe_reduce. -- **Headroom:** flash_attn 0.92×, mla 0.90×, rmsnorm 0.89×, paged-attention 0.48×, and the - two big ones — **topk_gating 0.22×** and **rope 0.17×**. - -Two recurring root causes, both actionable: (1) the attention/GEMM losers are -**register-pressure-capped** at 1–2 waves/SIMD (VGPR 175–251) — admit a 2nd wave by -[cutting the live set](../techniques/vgpr-budgeting.md); (2) rope/topk serialize cross-lane -reductions on `LGKMCNT` — replace with a DPP / `v_permlane16` -[wave reduction](../techniques/wave-reduce.md). +FlyDSL has a first-party MI350X profiling sweep with rocprofv3 ATT traces, hardware +counters, matched baselines, and per-kernel bundles. Keep the detailed verdict table +and root-cause list in the canonical +[profiling sweep & dashboard](../../sources/refs/ref-flydsl-kernel-profiling.md); +this language page points to the study as evidence that the same layout DSL can reach +wins, parity, and still-open headroom depending on the kernel. ## Limitations and gotchas