28 Apr 09:48

LoserCheems

d21c3cf

v2.0.1 Latest

Latest

What's Changed

Add documentation for MkDocs setup and API reference by @LoserCheems in #257
Remove Public Package Exports section from API reference documentation by @LoserCheems in #258
Update docstrings in attention functions for consistency by @LoserCheems in #259
Add return type annotations for attention functions by @LoserCheems in #260
Init cute version by @LoserCheems in #261
Update CuTe namespace and enhance dependencies by @LoserCheems in #262
Add support for upstream split reference in sync scripts by @LoserCheems in #263
[BUG FIX] Refactor CuTe namespace and enhance sync scripts by @LoserCheems in #264
[BUG FIX] Optimize LSE computation in forward combine kernel by @LoserCheems in #265
Update CuTe namespace and functionality by @LoserCheems in #266
Enhance sync script with cherry-pick functionality and improve merge conflict handling Co-authored-by: Copilot copilot@github.com by @LoserCheems in #267
Rename triton function by @LoserCheems in #268
Revert "Rename triton function" by @LoserCheems in #269
Rename forward combine functions and clarify comments by @LoserCheems in #270
[FEATURE SUPPORT] Add Triton decode support with KV-cache APIs by @LoserCheems in #271
Enhance decoding functions with FP8 and quantization support by @LoserCheems in #272
Fix bug for decode benchmark by @LoserCheems in #273
Cache optim by @LoserCheems in #274
[PERFORMANCE OPTIMIZATION] Add compile-time CHECK_NAN toggle to finalize for decode kernel fast-path by @LoserCheems in #275
[FEATURE SUPPORT] Add HuggingFace Kernel Hub support by @LoserCheems in #276
Bump version to 2.0.1 by @LoserCheems in #277

Full Changelog: v2.0.0...v2.0.1

What's Changed

Add documentation for MkDocs setup and API reference by @LoserCheems in #257
Remove Public Package Exports section from API reference documentation by @LoserCheems in #258
Update docstrings in attention functions for consistency by @LoserCheems in #259
Add return type annotations for attention functions by @LoserCheems in #260
Init cute version by @LoserCheems in #261
Update CuTe namespace and enhance dependencies by @LoserCheems in #262
Add support for upstream split reference in sync scripts by @LoserCheems in #263
[BUG FIX] Refactor CuTe namespace and enhance sync scripts by @LoserCheems in #264
[BUG FIX] Optimize LSE computation in forward combine kernel by @LoserCheems in #265
Update CuTe namespace and functionality by @LoserCheems in #266
Enhance sync script with cherry-pick functionality and improve merge conflict handling Co-authored-by: Copilot copilot@github.com by @LoserCheems in #267
Rename triton function by @LoserCheems in #268
Revert "Rename triton function" by @LoserCheems in #269
Rename forward combine functions and clarify comments by @LoserCheems in #270
[FEATURE SUPPORT] Add Triton decode support with KV-cache APIs by @LoserCheems in #271
Enhance decoding functions with FP8 and quantization support by @LoserCheems in #272
Fix bug for decode benchmark by @LoserCheems in #273
Cache optim by @LoserCheems in #274
[PERFORMANCE OPTIMIZATION] Add compile-time CHECK_NAN toggle to finalize for decode kernel fast-path by @LoserCheems in #275
[FEATURE SUPPORT] Add HuggingFace Kernel Hub support by @LoserCheems in #276
Bump version to 2.0.1 by @LoserCheems in #277

Full Changelog: v2.0.0...v2.0.1

Contributors

LoserCheems

Assets 4

23 Mar 03:33

LoserCheems

v2.0.0

969a280

v2.0.0

What's Changed

Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
[PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
[BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
[BUG FIX] Improve packed GQA handling by @LoserCheems in #224
Add utility functions for device management and input validation by @LoserCheems in #225
[PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
[FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
Improves softmax stability with log2 scaling by @LoserCheems in #228
Renames variables and refactors functions for clarity by @LoserCheems in #229
Improve performance and configuration for SM90 forward path by @LoserCheems in #231
Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
[BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
[FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
[FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
[FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
Enhance forward kernel for block range and masking logic by @LoserCheems in #239
Refactor backward kernels for clarity and optimization by @LoserCheems in #240
[BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
Add benchmark functions for Triton attention operations by @LoserCheems in #242
[FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
[BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
[BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
[FEATURE] Implement dense attention with masking support by @LoserCheems in #247
Enhance sparse attention implementation and documentation by @LoserCheems in #248
[FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
Update project structure and dependencies by @LoserCheems in #250
[BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
Update repository URLs and improve documentation by @LoserCheems in #252
Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
Refactor masking logic in backward kernel functions by @LoserCheems in #255
Refactor GitHub Actions workflows for package building and publishing by @LoserCheems in #256

Full Changelog: v1.2.4...v2.0.0

What's Changed

Improve numerical stability in sparse attention with sink auxiliary logits by @LoserCheems in #220
[PERFORMANCE OPTIMIZATION] Flash Sparse Attention by @LoserCheems in #221
[BUG FIX] Refactor block min/max calculations by @LoserCheems in #223
[BUG FIX] Improve packed GQA handling by @LoserCheems in #224
Add utility functions for device management and input validation by @LoserCheems in #225
[PERFORMANCE OPTIMIZATION] Triton Sparse Base Forward Kernel with Gate-Based Sparsity by @LoserCheems in #226
[FEATURE] Enhance forward combine kernel and split attention by @LoserCheems in #227
Improves softmax stability with log2 scaling by @LoserCheems in #228
Renames variables and refactors functions for clarity by @LoserCheems in #229
Improve performance and configuration for SM90 forward path by @LoserCheems in #231
Refactor rescaling logic in online_softmax and rescale_o functions by @LoserCheems in #232
[BUG FIX] Improve forward kernel configuration and validation by @LoserCheems in #233
Refactor qheads_per_kvhead calculations for clarity by @LoserCheems in #234
[FEATURE SUPPORT] Add Triton backward support by @LoserCheems in #235
[FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Forward Kernel by @LoserCheems in #236
Refactor log_sigmoid function for improved performance and accuracy by @LoserCheems in #237
[FEATURE SUPPORT] Add Configurable Sparse Gate Modes and Adaptive Thresholding in Triton Backward Kernel by @LoserCheems in #238
Enhance forward kernel for block range and masking logic by @LoserCheems in #239
Refactor backward kernels for clarity and optimization by @LoserCheems in #240
[BUG FIX] Update launch configuration for RTX Pro 6000 by @LoserCheems in #241
Add benchmark functions for Triton attention operations by @LoserCheems in #242
[FEATURE SUPPORT] Enable Softmax-Threshold Block Skipping in Triton Dense/Sparse Forward Attention by @LoserCheems in #243
[BUG FIX] Improve clarity and accuracy in gating mechanisms by @LoserCheems in #244
[BUG FIX] Update stride parameters for consistency by @LoserCheems in #245
Add softmax threshold parameter for enhanced flexibility by @LoserCheems in #246
[FEATURE] Implement dense attention with masking support by @LoserCheems in #247
Enhance sparse attention implementation and documentation by @LoserCheems in #248
[FEATURE] Implement gated attention mechanism and enhance performance by @LoserCheems in #249
Update project structure and dependencies by @LoserCheems in #250
[BUG FIX] Improve error reporting and occupancy in benchmarks by @LoserCheems in #251
Update repository URLs and improve documentation by @LoserCheems in #252
Refactor benchmark tests to simplify tensor initialization by @LoserCheems in #253
Refactor test utilities and add CUDA tensor operation tests by @LoserCheems in #254
Refactor masking logic in backward kernel functions by @Loserc...

Contributors

LoserCheems

Assets 8

20 Dec 14:05

LoserCheems

v1.2.4

bd824a7

v1.2.4

Last attn_mask version

We will adopt a new strategy to alleviate the memory bottleneck of attn_mask. This is the last version with attn_mask. Future versions will not pass attn_mask.

What's Changed

Chore/sync after move by @LoserCheems in #208
[BUG FIX] Unify masking utilities and improve performance by @LoserCheems in #209
Corrects issue links in README guides by @LoserCheems in #212
[BUG FIX] Correct causal mask handling for longer KV pairs by @LoserCheems in #213
Add gradient computation for bias and token-level KV sparsity support by @LoserCheems in #214
Add rotary-aware attention modules for improved inference by @LoserCheems in #215
Improve code readability and linting workflow by @LoserCheems in #216
Simplify attention mechanisms by @LoserCheems in #217
Refactor create_mask function parameters by @LoserCheems in #218

Full Changelog: v1.2.3...v1.2.4

What's Changed

Chore/sync after move by @LoserCheems in #208
[BUG FIX] Unify masking utilities and improve performance by @LoserCheems in #209
Corrects issue links in README guides by @LoserCheems in #212
[BUG FIX] Correct causal mask handling for longer KV pairs by @LoserCheems in #213
Add gradient computation for bias and token-level KV sparsity support by @LoserCheems in #214
Add rotary-aware attention modules for improved inference by @LoserCheems in #215
Improve code readability and linting workflow by @LoserCheems in #216
Simplify attention mechanisms by @LoserCheems in #217
Refactor create_mask function parameters by @LoserCheems in #218

Full Changelog: v1.2.3...v1.2.4

Contributors

LoserCheems

Assets 69

09 Nov 15:55

LoserCheems

v1.2.3

b746952

v1.2.3

What's Changed

Add selectable masking strategies for attention by @LoserCheems in #204
Refactor attention block smoothing for consistency by @LoserCheems in #205
Optimize triton version: GQA, mask/bias broadcasting, skip inactive tiles, and stability fixes by @LoserCheems in #200
[FEATURE SUPPORT] Triton special compact dynamic-mask attention: 1.6× faster fwd+bwd, numerically equivalent by @LoserCheems in #206
Fix documentation and references for Flash Sparse Attention by @LoserCheems in #207

Full Changelog: v1.2.2...v1.2.3

What's Changed

Add selectable masking strategies for attention by @LoserCheems in #204
Refactor attention block smoothing for consistency by @LoserCheems in #205
Optimize triton version: GQA, mask/bias broadcasting, skip inactive tiles, and stability fixes by @LoserCheems in #200
[FEATURE SUPPORT] Triton special compact dynamic-mask attention: 1.6× faster fwd+bwd, numerically equivalent by @LoserCheems in #206
Fix documentation and references for Flash Sparse Attention by @LoserCheems in #207

Full Changelog: v1.2.2...v1.2.3

Contributors

LoserCheems

Assets 67

05 Nov 08:10

LoserCheems

v1.2.2

1dcc395

v1.2.2

What's Changed

[FEATURE SUPPORT] Robust dBias accumulation for seqlen_q_bias == 1 by @LoserCheems in #194
[FEATURE SUPPORT] Centralize dynamic mask creation for FDMA by @LoserCheems in #197
Update documentation to use mask utility in examples by @LoserCheems in #198
Fix attention bias calculation and dbias handling by @LoserCheems in #199
Add block-wise smoothing to attention mask by @LoserCheems in #201
[FEATURE SUPPORT] Move scaling out of streaming loops, bias-initialized acc_s, and fix dQ double-scaling by @LoserCheems in #203

Full Changelog: v1.2.1...v1.2.2

Contributors

LoserCheems

Assets 51

16 Oct 04:51

LoserCheems

v1.2.1

df0971d

v1.2.1

What's Changed

Implement variable-length attention with mask and bias support by @LoserCheems in #185
Add issue/PR templates by @LoserCheems in #186
[FEATURE SUPPORT] Variable-Length Attention with Padding-Free Execution by @LoserCheems in #188
[FEATURE SUPPORT] Broadcastable 4D mask/bias, 128‑rounded key length, stride‑0 broadcasting, and dbias reductions by @LoserCheems in #190
Refactor bias initialization and enhance bias computation in FlashDMAttnFunc by @LoserCheems in #191
Fix attention_mask and attention_bias shape descriptions and remove redundant checks by @LoserCheems in #192
Enhance bias gradient accumulation in backward pass by @LoserCheems in #193

Full Changelog: v1.2.0...v1.2.1

Contributors

LoserCheems

Assets 87

01 Oct 16:58

LoserCheems

v1.2.0

0206bc8

v1.2.0

What's Changed

[BUG FIX] Fix mask/bias memory access and vectorization issues in kernels by @LoserCheems in #182

Full Changelog: v1.1.9...v1.2.0

Contributors

LoserCheems

Assets 82

22 Sep 16:19

LoserCheems

v1.1.9

0446100

v1.1.9

What's Changed

Refactor attention mask and bias handling for efficiency by @LoserCheems in #177
[BUG FIX] SM80 NaN in bias.grad when both mask and bias are enabled by @LoserCheems in #179

Full Changelog: v1.1.8...v1.1.9

Contributors

LoserCheems

Assets 94

21 Sep 02:05

LoserCheems

v1.1.8

ad7a3ab

v1.1.8

What's Changed

Bump version to 1.1.8 by @LoserCheems in #176

Full Changelog: v1.1.7...v1.1.8

Contributors

LoserCheems

Assets 97

20 Sep 18:30

LoserCheems

v1.1.7

a73c635

v1.1.7

What's Changed

Increase GitHub Actions build timeout to 6 hours by @LoserCheems in #175

Full Changelog: v1.1.6...v1.1.7

Contributors

LoserCheems

Assets 22

Releases: HKUSTDial/flash-sparse-attention

v2.0.1

What's Changed

What's Changed

Contributors

Uh oh!

v2.0.0

What's Changed

What's Changed

Contributors

Uh oh!

v1.2.4

Last attn_mask version

What's Changed

What's Changed

Contributors

Uh oh!

v1.2.3

What's Changed

What's Changed

Contributors

Uh oh!

v1.2.2

What's Changed

Contributors

Uh oh!

v1.2.1

What's Changed

Contributors

Uh oh!

v1.2.0

What's Changed

Contributors

Uh oh!

v1.1.9

What's Changed

Contributors

Uh oh!

v1.1.8

What's Changed

Contributors

Uh oh!

v1.1.7

What's Changed

Contributors

Uh oh!