Block-sparse attention primitive

The gather-matmul-scatter pattern in `bsr_spmm` is architecturally identical to **sparse attention**: Longformer's local+global attention, BigBird's random+local+global, or any structured-sparse transformer mask. A `BSRMatrix` with a sliding-window block pattern IS a local-attention mask.

**Acceptance:**
- `docs/sparse_attention.md` — writeup showing how to build `BSRMatrix` patterns for common attention variants (local window, dilated, global tokens)
- `examples/block_sparse_attention.py` — minimal reference: build the pattern, compute `softmax(Q @ K.T) @ V` with `bsr_spmm` for the masked parts, verify against a dense-mask reference

No new kernel — this is framing + example code. The claim is that trnsparse's BSR path already provides the primitive; block-sparse attention is a consumer.

Depends on #18.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block-sparse attention primitive #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Block-sparse attention primitive #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions