Releases: HKUSTDial/flash-sparse-attention
Releases · HKUSTDial/flash-sparse-attention
v2.0.1
What's Changed
- Add documentation for MkDocs setup and API reference by @LoserCheems in #257
- Remove Public Package Exports section from API reference documentation by @LoserCheems in #258
- Update docstrings in attention functions for consistency by @LoserCheems in #259
- Add return type annotations for attention functions by @LoserCheems in #260
- Init cute version by @LoserCheems in #261
- Update CuTe namespace and enhance dependencies by @LoserCheems in #262
- Add support for upstream split reference in sync scripts by @LoserCheems in #263
- [BUG FIX] Refactor CuTe namespace and enhance sync scripts by @LoserCheems in #264
- [BUG FIX] Optimize LSE computation in forward combine kernel by @LoserCheems in #265
- Update CuTe namespace and functionality by @LoserCheems in #266
- Enhance sync script with cherry-pick functionality and improve merge conflict handling Co-authored-by: Copilot copilot@github.com by @LoserCheems in #267
- Rename triton function by @LoserCheems in #268
- Revert "Rename triton function" by @LoserCheems in #269
- Rename forward combine functions and clarify comments by @LoserCheems in #270
- [FEATURE SUPPORT] Add Triton decode support with KV-cache APIs by @LoserCheems in #271
- Enhance decoding functions with FP8 and quantization support by @LoserCheems in #272
- Fix bug for decode benchmark by @LoserCheems in #273
- Cache optim by @LoserCheems in #274
- [PERFORMANCE OPTIMIZATION] Add compile-time CHECK_NAN toggle to finalize for decode kernel fast-path by @LoserCheems in #275
- [FEATURE SUPPORT] Add HuggingFace Kernel Hub support by @LoserCheems in #276
- Bump version to 2.0.1 by @LoserCheems in #277
Full Changelog: v2.0.0...v2.0.1
What's Changed
- Add documentation for MkDocs setup and API reference by @LoserCheems in #257
- Remove Public Package Exports section from API reference documentation by @LoserCheems in #258
- Update docstrings in attention functions for consistency by @LoserCheems in #259
- Add return type annotations for attention functions by @LoserCheems in #260
- Init cute version by @LoserCheems in #261
- Update CuTe namespace and enhance dependencies by @LoserCheems in #262
- Add support for upstream split reference in sync scripts by @LoserCheems in #263
- [BUG FIX] Refactor CuTe namespace and enhance sync scripts by @LoserCheems in #264
- [BUG FIX] Optimize LSE computation in forward combine kernel by @LoserCheems in #265
- Update CuTe namespace and functionality by @LoserCheems in #266
- Enhance sync script with cherry-pick functionality and improve merge conflict handling Co-authored-by: Copilot copilot@github.com by @LoserCheems in #267
- Rename triton function by @LoserCheems in #268
- Revert "Rename triton function" by @LoserCheems in #269
- Rename forward combine functions and clarify comments by @LoserCheems in #270
- [FEATURE SUPPORT] Add Triton decode support with KV-cache APIs by @LoserCheems in #271
- Enhance decoding functions with FP8 and quantization support by @LoserCheems in #272
- Fix bug for decode benchmark by @LoserCheems in #273
- Cache optim by @LoserCheems in #274
- [PERFORMANCE OPTIMIZATION] Add compile-time CHECK_NAN toggle to finalize for decode kernel fast-path by @LoserCheems in #275
- [FEATURE SUPPORT] Add HuggingFace Kernel Hub support by @LoserCheems in #276
- Bump version to 2.0.1 by @LoserCheems in #277
Full Changelog: v2.0.0...v2.0.1
v2.0.0
What's Changed
- Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
- [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
- [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
- [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
- Add utility functions for device management and input validation by @LoserCheems in #225
- [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
- [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
- Improves softmax stability with log2 scaling by @LoserCheems in #228
- Renames variables and refactors functions for clarity by @LoserCheems in #229
- Improve performance and configuration for SM90 forward path by @LoserCheems in #231
- Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
- [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
- Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
- [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
- Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
- Enhance forward kernel for block range and masking logic by @LoserCheems in #239
- Refactor backward kernels for clarity and optimization by @LoserCheems in #240
- [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
- Add benchmark functions for Triton attention operations by @LoserCheems in #242
- [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
- [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
- [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
- Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
- [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
- Enhance sparse attention implementation and documentation by @LoserCheems in #248
- [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
- Update project structure and dependencies by @LoserCheems in #250
- [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
- Update repository URLs and improve documentation by @LoserCheems in #252
- Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
- Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
- Refactor masking logic in backward kernel functions by @LoserCheems in #255
- Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256
Full Changelog: v1.2.4...v2.0.0
What's Changed
- Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
- [PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
- [BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
- [BUG FIX] Improve packed GQA handling by @LoserCheems in #224
- Add utility functions for device management and input validation by @LoserCheems in #225
- [PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
- [FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
- Improves softmax stability with log2 scaling by @LoserCheems in #228
- Renames variables and refactors functions for clarity by @LoserCheems in #229
- Improve performance and configuration for SM90 forward path by @LoserCheems in #231
- Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
- [BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
- Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
- [FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
- Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
- [FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
- Enhance forward kernel for block range and masking logic by @LoserCheems in #239
- Refactor backward kernels for clarity and optimization by @LoserCheems in #240
- [BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
- Add benchmark functions for Triton attention operations by @LoserCheems in #242
- [FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
- [BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
- [BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
- Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
- [FEATURE] Implement dense attention with masking support by @LoserCheems in #247
- Enhance sparse attention implementation and documentation by @LoserCheems in #248
- [FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
- Update project structure and dependencies by @LoserCheems in #250
- [BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
- Update repository URLs and improve documentation by @LoserCheems in #252
- Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
- Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
- Refactor masking logic in backward kernel functions by @Loserc...
v1.2.4
Last attn_mask version
We will adopt a new strategy to alleviate the memory bottleneck of attn_mask. This is the last version with attn_mask. Future versions will not pass attn_mask.
What's Changed
- Chore/sync after move by @LoserCheems in #208
- [BUG FIX] Unify masking utilities and improve performance by @LoserCheems in #209
- Corrects issue links in README guides by @LoserCheems in #212
- [BUG FIX] Correct causal mask handling for longer KV pairs by @LoserCheems in #213
- Add gradient computation for bias and token-level KV sparsity support by @LoserCheems in #214
- Add rotary-aware attention modules for improved inference by @LoserCheems in #215
- Improve code readability and linting workflow by @LoserCheems in #216
- Simplify attention mechanisms by @LoserCheems in #217
- Refactor create_mask function parameters by @LoserCheems in #218
Full Changelog: v1.2.3...v1.2.4
What's Changed
- Chore/sync after move by @LoserCheems in #208
- [BUG FIX] Unify masking utilities and improve performance by @LoserCheems in #209
- Corrects issue links in README guides by @LoserCheems in #212
- [BUG FIX] Correct causal mask handling for longer KV pairs by @LoserCheems in #213
- Add gradient computation for bias and token-level KV sparsity support by @LoserCheems in #214
- Add rotary-aware attention modules for improved inference by @LoserCheems in #215
- Improve code readability and linting workflow by @LoserCheems in #216
- Simplify attention mechanisms by @LoserCheems in #217
- Refactor create_mask function parameters by @LoserCheems in #218
Full Changelog: v1.2.3...v1.2.4
v1.2.3
What's Changed
- Add selectable masking strategies for attention by @LoserCheems in #204
- Refactor attention block smoothing for consistency by @LoserCheems in #205
- Optimize triton version: GQA, mask/bias broadcasting, skip inactive tiles, and stability fixes by @LoserCheems in #200
- [FEATURE SUPPORT] Triton special compact dynamic-mask attention: 1.6× faster fwd+bwd, numerically equivalent by @LoserCheems in #206
- Fix documentation and references for Flash Sparse Attention by @LoserCheems in #207
Full Changelog: v1.2.2...v1.2.3
What's Changed
- Add selectable masking strategies for attention by @LoserCheems in #204
- Refactor attention block smoothing for consistency by @LoserCheems in #205
- Optimize triton version: GQA, mask/bias broadcasting, skip inactive tiles, and stability fixes by @LoserCheems in #200
- [FEATURE SUPPORT] Triton special compact dynamic-mask attention: 1.6× faster fwd+bwd, numerically equivalent by @LoserCheems in #206
- Fix documentation and references for Flash Sparse Attention by @LoserCheems in #207
Full Changelog: v1.2.2...v1.2.3
v1.2.2
What's Changed
- [FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 by @LoserCheems in #194
- [FEATURE SUPPORT] Centralize dynamic mask creation for FDMA by @LoserCheems in #197
- Update documentation to use mask utility in examples by @LoserCheems in #198
- Fix attention bias calculation and dbias handling by @LoserCheems in #199
- Add block-wise smoothing to attention mask by @LoserCheems in #201
- [FEATURE SUPPORT] Move scaling out of streaming loops, bias-initialized acc_s, and fix dQ double-scaling by @LoserCheems in #203
Full Changelog: v1.2.1...v1.2.2
v1.2.1
What's Changed
- Implement variable-length attention with mask and bias support by @LoserCheems in #185
- Add issue/PR templates by @LoserCheems in #186
- [FEATURE SUPPORT] Variable-Length Attention with Padding-Free Execution by @LoserCheems in #188
- [FEATURE SUPPORT] Broadcastable 4D mask/bias, 128‑rounded key length, stride‑0 broadcasting, and dbias reductions by @LoserCheems in #190
- Refactor bias initialization and enhance bias computation in FlashDMAttnFunc by @LoserCheems in #191
- Fix attention_mask and attention_bias shape descriptions and remove redundant checks by @LoserCheems in #192
- Enhance bias gradient accumulation in backward pass by @LoserCheems in #193
Full Changelog: v1.2.0...v1.2.1
v1.2.0
What's Changed
- [BUG FIX] Fix mask/bias memory access and vectorization issues in kernels by @LoserCheems in #182
Full Changelog: v1.1.9...v1.2.0
v1.1.9
What's Changed
- Refactor attention mask and bias handling for efficiency by @LoserCheems in #177
- [BUG FIX] SM80 NaN in bias.grad when both mask and bias are enabled by @LoserCheems in #179
Full Changelog: v1.1.8...v1.1.9
v1.1.8
v1.1.7
What's Changed
- Increase GitHub Actions build timeout to 6 hours by @LoserCheems in #175
Full Changelog: v1.1.6...v1.1.7