Skip to content

2026-04-27-the-evolution-of-flashattention#131

Open
emharsha1812 wants to merge 10 commits intoiclr-blogposts:mainfrom
emharsha1812:main
Open

2026-04-27-the-evolution-of-flashattention#131
emharsha1812 wants to merge 10 commits intoiclr-blogposts:mainfrom
emharsha1812:main

Conversation

@emharsha1812
Copy link
Copy Markdown
Contributor

@emharsha1812 emharsha1812 commented Dec 7, 2025

OpenReview Submission Thread

https://openreview.net/forum?id=4V2kHdynFN

Checklist before opening a PR

  • I am opening a pull request against the main branch of the 2026 repo.

  • My post and all associated references to it are all lowercase, i.e

      2026-04-27-Sample-Test.md               -> 2026-04-27-sample-test.md
      assets/img/2026-04-27-Sample-Test/ 	-> assets/img/2026-04-27-sample-test/
    
  • The title of my PR is exactly the name of my markdown file

    • i. e. _posts/2026-04-27-the-evolution-of-flashattention.md requires PR name 2026-04-27-the-evolution-of-flashattention
  • I have anonymized my post: my author's list is Anonymous, and there is no potential
    content which can reveal my/my collaborators identities.

  • My post matches the formatting requirements, including (but not limited to):

    • I have ONLY MODIFIED files in the following locations (failure to do so will result in
      your PR automatically being closed!):
      • a Markdown file _posts/2026-04-27-the-evolution-of-flashattention. md
      • static image assets added to assets/img/2026-04-27-the-evolution-of-flashattention/
      • citations in a bibtex file assets/bibliography/2026-04-27-the-evolution-of-flashattention.bib
    • I have a short 2-3 sentence abstract in the description field of my front-matter
    • I have a table of contents, formatted using the toc field of my front-matter
    • My bibliography is correctly formatted, using a .bibtex file as per the sample post

Any other comments

Initial submission for ICLR 2026 blog post track.

Copilot AI review requested due to automatic review settings March 12, 2026 18:45
@github-actions
Copy link
Copy Markdown

⚠️ We have detected a problem with your submission! ⚠️

You can only add/change/remove files related to your post, i.e. files that match one of these patterns: <_posts/SLUG.md, assets/img/SLUG/..., assets/html/SLUG/..., assets/bibliography/SLUG.bib>. But we found that you changed the following: <.gitignore>. Also, make sure your PR's title (2026-04-27-the-evolution-of-flashattention) matches your post's slug!

Please make the aforementioned changes and re-submit :)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new (future-dated) Distill blog post on FlashAttention’s evolution, along with supporting bibliography and figures.

Changes:

  • Add the post _posts/2026-04-27-the-evolution-of-flashattention.md with detailed technical content and citations.
  • Add a per-post BibTeX file assets/bibliography/2026-04-27-the-evolution-of-flashattention.bib.
  • Add/update figure asset(s) and tweak .gitignore.

Reviewed changes

Copilot reviewed 2 out of 18 changed files in this pull request and generated 7 comments.

File Description
assets/img/2026-04-27-the-evolution-of-flashattention/Figure_8.png Adds/updates an image used in the post’s complexity analysis section.
assets/bibliography/2026-04-27-the-evolution-of-flashattention.bib Adds the bibliography backing the post’s <d-cite> references.
.gitignore Fixes .idea entry formatting and adds .github/skills/ to ignore list.
_posts/2026-04-27-the-evolution-of-flashattention.md Adds the new Distill post content and citation usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.


Finally, numerical precision is a tunable parameter, not a fixed constraint. FlashAttention-3's block quantization and FA4's software-based exponential approximation demonstrate that carefully managed low-precision computation can maintain accuracy while improving throughput. Future algorithms might adaptively select precision per-operation based on numerical sensitivity analysis, potentially using FP8 or FP4 for matmuls while maintaining higher precision only where gradients demand it.

However, these principles come with a sobering caveat: the usability wall remains high. Writing a custom attention kernel that achieves even 50% of hardware peak requires deep expertise in GPU memory hierarchies, warp scheduling, Tensor Core constraints, and low-level CUDA or PTX programming. This excludes most ML researchers from contributing to or modifying these kernels directly. Tools like Triton <d-cite key="tillet2019triton"></d-cite>, ThunderKittens <d-cite key="spector2024thunderkittenssimplefastadorable"></d-cite>, and CuTe aim to lower this barrier by providing higher-level abstractions, but a significant gap remains between algorithm design on paper and efficient GPU implementation. Closing this gap—through better DSLs, compilers, or automated tuning—is as important as the algorithmic advances themselves.
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post cites tillet2019triton, but that citation key is not present in assets/bibliography/2026-04-27-the-evolution-of-flashattention.bib (searching the bibliography yields no matches). This will break citation rendering/build for the post; add a BibTeX entry for Triton (or update the cite key to an existing entry).

Suggested change
However, these principles come with a sobering caveat: the usability wall remains high. Writing a custom attention kernel that achieves even 50% of hardware peak requires deep expertise in GPU memory hierarchies, warp scheduling, Tensor Core constraints, and low-level CUDA or PTX programming. This excludes most ML researchers from contributing to or modifying these kernels directly. Tools like Triton <d-cite key="tillet2019triton"></d-cite>, ThunderKittens <d-cite key="spector2024thunderkittenssimplefastadorable"></d-cite>, and CuTe aim to lower this barrier by providing higher-level abstractions, but a significant gap remains between algorithm design on paper and efficient GPU implementation. Closing this gap—through better DSLs, compilers, or automated tuning—is as important as the algorithmic advances themselves.
However, these principles come with a sobering caveat: the usability wall remains high. Writing a custom attention kernel that achieves even 50% of hardware peak requires deep expertise in GPU memory hierarchies, warp scheduling, Tensor Core constraints, and low-level CUDA or PTX programming. This excludes most ML researchers from contributing to or modifying these kernels directly. Tools like Triton, ThunderKittens <d-cite key="spector2024thunderkittenssimplefastadorable"></d-cite>, and CuTe aim to lower this barrier by providing higher-level abstractions, but a significant gap remains between algorithm design on paper and efficient GPU implementation. Closing this gap—through better DSLs, compilers, or automated tuning—is as important as the algorithmic advances themselves.

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +185
eprint={2009.14794},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2009.14794},
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BibTeX entry closes at line 180, but additional fields and a stray closing brace follow (lines 181–185). As written, the bibliography file is syntactically invalid and will fail to parse; remove the dangling lines or merge them correctly into the juravsky2024... entry.

Suggested change
eprint={2009.14794},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2009.14794},
}

Copilot uses AI. Check for mistakes.
Comment on lines +181 to +185
eprint={2009.14794},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2009.14794},
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines appear to be leftover BibTeX fields (eprint, archivePrefix, etc.) plus an extra closing brace that are not attached to any entry. This makes the bibliography invalid; delete them or move them inside the intended @article{...} block.

Suggested change
eprint={2009.14794},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2009.14794},
}

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +239
@misc{kwon2023efficientmemorymanagementlarge,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
year={2023},
eprint={2309.06180},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2309.06180},
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate BibTeX key kwon2023efficientmemorymanagementlarge is defined earlier in this file (as an @article at lines 168–173) and then redefined here as an @misc. Duplicate keys typically cause BibTeX/Jekyll-Scholar parsing errors or nondeterministic citation resolution; keep a single entry per key (or rename one of them).

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +249
@misc{juravsky2024hydragenhighthroughputllminference,
title={Hydragen: High-Throughput LLM Inference with Shared Prefixes},
author={Jordan Juravsky and Bradley Brown and Ryan Ehrlich and Daniel Y. Fu and Christopher R{\'e} and Azalia Mirhoseini},
year={2024},
eprint={2402.05099},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2402.05099},
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate BibTeX key juravsky2024hydragenhighthroughputllminference is defined earlier in this file (as an @article at lines 175–180) and then redefined here as an @misc. Duplicate keys can break bibliography generation; consolidate into a single canonical entry or rename one of the keys.

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +307
@online{gpu-mode-fa4,
title = {How FlashAttention 4 Works},
author = {GPU Mode},
year = {2025},
url = {https://youtu.be/ZIEq-WTquy4},
note = {YouTube video, accessed 2025-12-07}
}


@online{tri-dao-hotchips,
title = {Domain-Specific Languages for GPU Kernels and Automatic Kernel Authoring with LLMs},
author = {Tri Dao},
year = {2024},
url = {https://youtu.be/_sRkawqEMCs},
note = {Hot Chips talk, YouTube, accessed 2025-12-07}
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section redefines several BibTeX keys that already exist earlier in the file (e.g., gpu-mode-fa4, tri-dao-hotchips, modal-fa4, wu-fa4-medium). Duplicate keys usually cause BibTeX/Jekyll-Scholar failures; remove the earlier placeholder entries (the ones with YOUR_VIDEO_ID) or rename keys so each citation key is unique.

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +369
@article{schraudolph1999fast,
title={A fast, compact approximation of the exponential function},
author={Schraudolph, Nicol N},
journal={Neural Computation},
volume={11},
number={4},
pages={853--862},
year={1999},
publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate BibTeX key schraudolph1999fast is defined earlier in the file (lines 157–166) and then redefined again here with different publisher text. Duplicate keys can break bibliography generation; keep one entry and delete or rename the other.

Suggested change
@article{schraudolph1999fast,
title={A fast, compact approximation of the exponential function},
author={Schraudolph, Nicol N},
journal={Neural Computation},
volume={11},
number={4},
pages={853--862},
year={1999},
publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…}
}

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

⚠️ We have detected a problem with your submission! ⚠️

You can only add/change/remove files related to your post, i.e. files that match one of these patterns: <_posts/SLUG.md, assets/img/SLUG/..., assets/html/SLUG/..., assets/bibliography/SLUG.bib>. But we found that you changed the following: <.gitignore>. Also, make sure your PR's title (2026-04-27-the-evolution-of-flashattention) matches your post's slug!

Please make the aforementioned changes and re-submit :)

@busycalibrating busycalibrating added the accepted PR matches an accepted blog post label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

accepted PR matches an accepted blog post manual accept

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants