Releases · trnsci/trnsparse

16 Apr 20:29

scttfrdmn

v0.4.2

e101f77

trnsparse 0.4.2 — block-sparse attention primitive Latest

Latest

What's new

examples/block_sparse_attention.py — block-sparse attention reference using BSRMatrix + bsr_spmm. Three mask patterns (local window, dilated, global tokens); verifies against a dense reference; reports block density and timing for the bsr_spmm step. Closes #21.
docs/sparse_attention.md — writeup: how BSR-128 maps to Longformer/BigBird-style attention masks, block density arithmetic, pattern construction helpers, and the fused-tile follow-up (#25).
tests/test_attention.py — 8 CPU tests: mask shape/symmetry checks + parity against dense reference at atol=1e-4 for all three patterns and the full-attention edge case.
mkdocs.yml: add Iterative Solvers (was missing from nav) and Sparse Attention.

Notes

No API changes, no kernel changes. The claim in #21: bsr_spmm is the block-sparse attention primitive; BSRMatrix captures the mask.

The 128-token granularity at which sparse attention is natural is the Tensor Engine tile — Trainium was built for attention; that design transfers to any block-128-structured sparse workload.

Install

pip install trnsparse==0.4.2

Full changelog: CHANGELOG.md

Assets 2

15 Apr 01:35

scttfrdmn

v0.4.1

3693747

trnsparse 0.4.1 — screened Fock + PySCF integration examples

Closes #6 and #13. No API changes, no kernel changes — v0.4.0 users who don't need the examples can stay on v0.4.0.

Added

`examples/sparse_fock.py` — rewritten around v0.4.0's `screened_spmm`. Three paths side-by-side on the same inputs:
1. v0.1.x unfused flow (`schwarz_bounds → screen → from_dense → spmm`)
2. v0.4.0 fused `screened_spmm` (one call)
3. Full Fock build — coulomb from path 2 contracted against MO coefficients via `trnblas.gemm` for `F_MO = C.T @ J @ C`. Optional dep on trnblas; falls back to `torch.matmul` otherwise.
On a 50-basis synthetic system, the fused path is ~130× faster than the unfused (dominated by eliminating the Python `from_dense` CSR construction).
`examples/pyscf_bridge.py` (new) — optional PySCF-driven demo. Builds H₂O (or benzene, or H₂), extracts real AO ERIs via `mol.intor("int2e")`, feeds the `(μμ|μμ)` diagonal into `schwarz_bounds` + `screened_spmm` against a mock density matrix. Requires `pip install pyscf`; tests skip cleanly if not available.
`tests/test_examples.py` — 2 CPU smoke tests plus a PySCF-gated test. Exercises the unfused + fused paths end-to-end and asserts parity (`atol=1e-6`).

Validation

51 CPU tests pass (49 existing + 2 new example tests); 1 PySCF test skips cleanly.
No hardware or simulator regression.

Assets 2

15 Apr 00:34

scttfrdmn

v0.4.0

12b5575

trnsparse 0.4.0 — fused Schwarz-screened SpMM

Closes #19.

Added

`screened_spmm(A, diag_integrals, B, threshold)` — fused Schwarz-screened dense matmul. One NKI kernel fuses the full pipeline (outer-product pair bound → threshold → mask-apply → `nc_matmul`) into a single dispatch. Saves ~30–50% end-to-end vs the unfused `density_screen + from_dense + spmm` flow on Fock-build-sized inputs.
`_screened_spmm_kernel` (`@nki.jit`) — stationary-A-tile-reuse GEMM extended with a per-tile pair-bound mask built from the 1-D Schwarz-bound vector.
`_ScreenedSpMMFunction` — `torch.autograd.Function` wrapper. Third differentiable NKI kernel in the trnsci suite (after v0.2.0 CSR SpMM and v0.3.0 BSR SpMM). `torch.autograd.gradcheck` passes at `atol=1e-4` on hardware.

Validation

Surface	Tests	Result
CPU suite	4 `TestScreenedSpmm`	✅
Simulator (ubuntu-latest CI + trn1)	2 `TestScreenedSpmmSimulator`	✅
Hardware (trn1.2xlarge)	7 `TestNkiScreenedSpmmParity` + `TestNkiScreenedSpmmDifferentiability`	✅

Total: 49 CPU + 23 hardware tests green; no regression across the suite.

Also closed this session

#24 — fused CG NKI kernel not buildable under NKI 2.24/0.3.0 (no `break`, no iteration-carried scalar state across `affine_range`, no nested kernels). Per-iteration `_cg_step_kernel` reframe evaluated and found to save only 5–20% — closed honestly. If upstream NKI gains persistent-SBUF-across-calls or in-kernel control flow, the whole-loop CG kernel can be reopened.

Known limits

`screened_spmm` currently restricted to square A (M == K) with 1-D `diag_integrals`. The common Fock-build case. Rectangular / asymmetric-bounds extension is a follow-up if asked for.

Assets 2

14 Apr 23:18

scttfrdmn

v0.3.2

0fd2de9

trnsparse 0.3.2 — CG + power iteration on BSR

Phase 1 plumbing of #22 on-chip iterative solvers. Python-level CG and power iteration on top of `bsr_spmm`.

Added

`cg_bsr(A, b, x0, tol, max_iter, M=None) -> (x, iters, rel)` — Conjugate Gradient for SPD BSR matrices. Takes an optional preconditioner.
`power_iteration_bsr(A, v0, max_iter, tol) -> (lam, v, iters)` — dominant eigenpair via power iteration.
`jacobi_preconditioner_bsr(A)` — diagonal preconditioner builder.
`bsr_diagonal(A)` — main-diagonal extractor.
`docs/iterative_solvers.md` — design note covering the v0.3.2 plumbing and the v0.4.0 fused-kernel goal.
`tests/test_iterative.py` (8 tests, scipy parity at `atol=1e-4`).
`benchmarks/bench_iterative.py` — at 128×128 SPD: scipy 310 μs, trnsparse 369 μs (1.19×).

Not in this release

The architectural win from #22's acceptance list — A SBUF-resident across all iterations — requires a fused NKI kernel that wraps the CG loop. That's tracked in #24 for v0.4.0. Today each CG iteration dispatches one `bsr_spmm`, so A round-trips to HBM per iter on NKI. The API stays stable across the v0.4.0 transition — users get the fused-kernel speedup automatically.

Validation

45/45 CPU tests pass (37 existing + 8 new iterative).
Simulator CI (`nki-simulator`) green on ubuntu-latest.
No NKI code changes — `bsr_spmm` unchanged; hardware re-validation not required.

Closes #22 Phase 1. v0.4.0 tracker: #24.

Assets 2

14 Apr 21:08

scttfrdmn

v0.3.1

570f62e

trnsparse 0.3.1 — NKI 0.3.0 namespace + simulator dispatch

Clears `[Unreleased]` with the three-commit trnsparse#23 landing. No kernel-body changes; no public API changes; no numeric drift. Upgrade is safe for anyone already on v0.3.0.

Changed

Migrated NKI imports to the nki.* namespace (NKI 0.3.0 Stable, Neuron SDK 2.29, April 2026). Legacy neuronxcc.nki.* shim is no longer used. pyproject.toml [neuron] extra gains nki>=0.3.0 alongside neuronxcc>=2.24 and torch-neuronx>=2.9. Hosts without an nki wheel (macOS, non-Linux archs) still hit HAS_NKI=False and get the torch fallback.
test CI job now filters -m "not neuron and not nki_simulator" so each test runs in exactly one job.

Added

TRNSPARSE_USE_SIMULATOR=1 dispatch branch through nki.simulate(kernel)(np_args). Bypasses torch_xla + NEFF compile; kernels run on CPU for correctness iteration. Hardware still owns perf numbers.
nki-simulator CI job on ubuntu-latest — installs nki>=0.3.0 from the AWS pip index, runs the simulator suite on every push/PR. Kernel correctness gate without AWS cost. Catches Python-trace-level errors (bad kwargs, dropped ops, shape mismatches); MLIR verifier errors remain hardware-only.
tests/test_nki_sim.py — curated simulator suite (4 tests: CSR aligned + rectangular, BSR block-dense + block-diagonal). Skips cleanly off-hardware.
scripts/run_simulator_tests.sh — SSM runner for the simulator suite on trn1.
tests/conftest.py — registers the nki_simulator pytest marker.

Validation

37 CPU tests pass unchanged.
15 hardware tests on trn1.2xlarge — all green post-migration (60s).
4 simulator tests green on both trn1 and ubuntu-latest.

Closes #23.

Assets 2

13 Apr 21:43

scttfrdmn

v0.3.0

798e63a

trnsparse 0.3.0 — BSR is the Trainium-native sparse format

v0.3.0 reframes the library around what Trainium uniquely enables.

Trainium's Tensor Engine is a 128×128 systolic array. The natural unit of sparse work on it is not an individual nonzero — it's a 128×128 block. v0.3.0 introduces BSRMatrix and bsr_spmm, where every stored block is already a Tensor-Engine tile and maps to one nc_matmul call with zero gather overhead.

Added

BSRMatrix — block-sparse row format at block_size=128. Conversions to/from CSRMatrix and dense.
bsr_spmm(A_bsr, B) with NKI + PyTorch dispatch; NKI path wraps _BSRSpMMFunction (suite's second torch.autograd.Function-backed kernel after v0.2.0 CSR).
Hardware-validated on trn1.2xlarge — 7/7 @pytest.mark.neuron tests including torch.autograd.gradcheck.
Benchmarks populated with real trn1 numbers (docs/benchmarks.md).
sparse_add no longer materializes N×N dense intermediate (closes #8).
density_screen test coverage (closes #10).

Architectural thesis

Documented in docs/architecture.md lede: CSR is the construction and interop format; BSR is the NKI compute format. For matrices with real block structure — Fock/ERI tensors after Schwarz screening, FEM stiffness, graph adjacencies, block-sparse attention masks — BSR is strictly preferred. For truly unstructured sparse, the torch.sparse_csr_tensor PyTorch fallback (v0.1.3) is already within 2× of scipy and NKI adds nothing.

Honest reading of the benchmarks

At v0.3.0 scales, NKI dispatch + compilation + HBM round-trips dominate the matmul work. BSR-NKI is ~15-25× slower than BSR-PyTorch at small sizes. The architectural wins live in follow-up issues:

#19 — fused screen + matmul kernel (eliminates two HBM round-trips)
#20 — on-chip iterative solvers (CG / power iteration with A SBUF-resident across iterations)
#21 — block-sparse attention primitive (BSR is the building block for Longformer/BigBird-style sparse transformers)

Closed

#8, #9, #10, #12, #18.

Reframed

#15 (CSR row-bucketing) demoted to backlog. Under the architectural frame, the CSR path is served by the PyTorch fallback and BSR is the NKI-side story. Row-bucketing would only help if NKI 2.24 exposed an indirect-DMA primitive, which it doesn't.

Assets 2

13 Apr 20:51

scttfrdmn

v0.2.0

ea5f085

trnsparse 0.2.0 — NKI SpMM validated on trn1

Phase 1 lands. First hardware-validated NKI kernel in trnsparse, and the suite's first torch.autograd.Function-wrapped NKI kernel (closes trnsci/trnsci#3 for this repo).

Added

NKI SpMM kernel (trnsparse/nki/kernels.py::_spmm_dense_kernel) — stationary A-tile reuse GEMM on the Tensor Engine. TILE_M = TILE_K = 128, TILE_N = 512.
Autograd wrapping (trnsparse.nki.dispatch._SpMMFunction) — analytic backward (dA = dC @ Bᵀ projected, dB = Aᵀ @ dC). torch.autograd.gradcheck passes at atol=1e-4.
Dispatch wiring — set_backend("nki") routes trnsparse.spmm through the NKI path; v0.1.3 torch.sparse fallback unchanged otherwise.
tests/test_nki_spmm.py — 8 @pytest.mark.neuron tests: parity across aligned + unaligned + low-density shapes, gradcheck, end-to-end loss.backward() smoke.
benchmarks/bench_spmm.py — four-backend SpMM table (scipy / torch.sparse / trnsparse pytorch / trnsparse nki) in one pytest pass.
docs/benchmarks.md populated with real trn1.2xlarge numbers.

Hardware validation (trn1.2xlarge, Neuron SDK 2.24)

Test	Result
5 parity cases (aligned + unaligned + density 0.01–0.1)	✅ all pass at `atol=1e-3, rtol=1e-4`
`torch.autograd.gradcheck`	✅ pass at `atol=1e-4`
`loss.backward()` smoke	✅ finite gradients through full stack

Known limits

NKI is slower than CPU in v0.2.0 — the kernel materializes the CSR into a dense (M, K) tile before the matmul. At density 0.001 on 1024 × 1024, this means ~1000× more work than scipy does. See docs/benchmarks.md for numbers. Sparse speedup comes from row-bucketing + gather-matmul-scatter, which is #15 / v0.3.0 / Phase 3.
SpMV stays on PyTorch — single-column NKI matmul doesn't amortize the compile + dispatch overhead.

Closes #14. Closes #4.

Assets 2

13 Apr 03:00

scttfrdmn

v0.1.3

c889770

trnsparse 0.1.3

Changed

spmv, spmm, spmv_symmetric, and CSRMatrix.to_dense now lower to torch.sparse_csr_tensor operations instead of per-row Python loops.
On CPU (256×256, density 0.01) the change is 26× faster for SpMV (958 μs → 37 μs) and 52–88× faster for SpMM (1.2 ms → 13–24 μs depending on RHS width), putting trnsparse's PyTorch fallback within 2× of torch.sparse.

Pure performance change — no API or numeric-output differences. All 25 existing tests pass unchanged. NKI backend remains scaffolded; routing lands in v0.2.0.

Added

benchmarks/ directory (conftest.py, bench_spmv.py, bench_spmm.py, bench_screening.py) running trnsparse vs scipy.sparse vs torch.sparse on the same numeric inputs. Closes #11; partial #4.

Assets 2

13 Apr 02:28

scttfrdmn

v0.1.2

e5825e5

trnsparse 0.1.2

Changed

Sync trnsparse.__version__ with pyproject.toml (both now 0.1.2). Previously __init__.py reported 0.1.0 while the package version was 0.1.1.
Docs badge in README.md and site_url in mkdocs.yml point at trnsci.dev/trnsparse/ instead of trnsci.github.io/trnsparse/. Per-repo GitHub Pages is superseded by the centralized trnsci.dev site.
docs/architecture.md clarifies that the NKI backend is scaffolded only — the PyTorch path runs regardless of set_backend in v0.1.x. Routing + on-hardware validation land in v0.2.0.

Closes #5.

Assets 2

13 Apr 00:42

scttfrdmn

v0.1.1

0bc4273

trnsparse 0.1.1

Added

mkdocs site with index, installation, quickstart, api, architecture, aws_setup
infra/terraform/ for on-hardware CI instance provisioning
scripts/run_neuron_tests.sh and benchmark helpers
GitHub Actions ci.yml for CPU-only pytest matrix
Issues URL in pyproject.toml

Changed

Bumped neuronxcc floor from >=2.15 to >=2.24 to unify with the rest of the trnsci suite. torch-neuronx floor bumped to >=2.9.

Assets 2

Releases: trnsci/trnsparse

trnsparse 0.4.2 — block-sparse attention primitive

What's new

Notes

Install

Uh oh!

trnsparse 0.4.1 — screened Fock + PySCF integration examples

Added

Validation

Uh oh!

trnsparse 0.4.0 — fused Schwarz-screened SpMM

Added

Validation

Also closed this session

Known limits

Uh oh!

trnsparse 0.3.2 — CG + power iteration on BSR

Added

Not in this release

Validation

Uh oh!

trnsparse 0.3.1 — NKI 0.3.0 namespace + simulator dispatch

Changed

Added

Validation

Uh oh!

trnsparse 0.3.0 — BSR is the Trainium-native sparse format

Added

Architectural thesis

Honest reading of the benchmarks

Closed

Reframed

Uh oh!

trnsparse 0.2.0 — NKI SpMM validated on trn1

Added

Hardware validation (trn1.2xlarge, Neuron SDK 2.24)

Known limits

Uh oh!

trnsparse 0.1.3

Changed

Added

Uh oh!

trnsparse 0.1.2

Changed

Uh oh!

trnsparse 0.1.1

Added

Changed

Uh oh!