The gather-matmul-scatter pattern in bsr_spmm is architecturally identical to sparse attention: Longformer's local+global attention, BigBird's random+local+global, or any structured-sparse transformer mask. A BSRMatrix with a sliding-window block pattern IS a local-attention mask.
Acceptance:
docs/sparse_attention.md — writeup showing how to build BSRMatrix patterns for common attention variants (local window, dilated, global tokens)
examples/block_sparse_attention.py — minimal reference: build the pattern, compute softmax(Q @ K.T) @ V with bsr_spmm for the masked parts, verify against a dense-mask reference
No new kernel — this is framing + example code. The claim is that trnsparse's BSR path already provides the primitive; block-sparse attention is a consumer.
Depends on #18.
The gather-matmul-scatter pattern in
bsr_spmmis architecturally identical to sparse attention: Longformer's local+global attention, BigBird's random+local+global, or any structured-sparse transformer mask. ABSRMatrixwith a sliding-window block pattern IS a local-attention mask.Acceptance:
docs/sparse_attention.md— writeup showing how to buildBSRMatrixpatterns for common attention variants (local window, dilated, global tokens)examples/block_sparse_attention.py— minimal reference: build the pattern, computesoftmax(Q @ K.T) @ Vwithbsr_spmmfor the masked parts, verify against a dense-mask referenceNo new kernel — this is framing + example code. The claim is that trnsparse's BSR path already provides the primitive; block-sparse attention is a consumer.
Depends on #18.