Releases: trnsci/trnsparse
trnsparse 0.4.2 — block-sparse attention primitive
What's new
examples/block_sparse_attention.py— block-sparse attention reference usingBSRMatrix+bsr_spmm. Three mask patterns (local window, dilated, global tokens); verifies against a dense reference; reports block density and timing for thebsr_spmmstep. Closes #21.docs/sparse_attention.md— writeup: how BSR-128 maps to Longformer/BigBird-style attention masks, block density arithmetic, pattern construction helpers, and the fused-tile follow-up (#25).tests/test_attention.py— 8 CPU tests: mask shape/symmetry checks + parity against dense reference atatol=1e-4for all three patterns and the full-attention edge case.mkdocs.yml: add Iterative Solvers (was missing from nav) and Sparse Attention.
Notes
No API changes, no kernel changes. The claim in #21: bsr_spmm is the block-sparse attention primitive; BSRMatrix captures the mask.
The 128-token granularity at which sparse attention is natural is the Tensor Engine tile — Trainium was built for attention; that design transfers to any block-128-structured sparse workload.
Install
pip install trnsparse==0.4.2Full changelog: CHANGELOG.md
trnsparse 0.4.1 — screened Fock + PySCF integration examples
Closes #6 and #13. No API changes, no kernel changes — v0.4.0 users who don't need the examples can stay on v0.4.0.
Added
-
`examples/sparse_fock.py` — rewritten around v0.4.0's `screened_spmm`. Three paths side-by-side on the same inputs:
- v0.1.x unfused flow (`schwarz_bounds → screen → from_dense → spmm`)
- v0.4.0 fused `screened_spmm` (one call)
- Full Fock build — coulomb from path 2 contracted against MO coefficients via `trnblas.gemm` for `F_MO = C.T @ J @ C`. Optional dep on trnblas; falls back to `torch.matmul` otherwise.
On a 50-basis synthetic system, the fused path is ~130× faster than the unfused (dominated by eliminating the Python `from_dense` CSR construction).
-
`examples/pyscf_bridge.py` (new) — optional PySCF-driven demo. Builds H₂O (or benzene, or H₂), extracts real AO ERIs via `mol.intor("int2e")`, feeds the `(μμ|μμ)` diagonal into `schwarz_bounds` + `screened_spmm` against a mock density matrix. Requires `pip install pyscf`; tests skip cleanly if not available.
-
`tests/test_examples.py` — 2 CPU smoke tests plus a PySCF-gated test. Exercises the unfused + fused paths end-to-end and asserts parity (`atol=1e-6`).
Validation
- 51 CPU tests pass (49 existing + 2 new example tests); 1 PySCF test skips cleanly.
- No hardware or simulator regression.
trnsparse 0.4.0 — fused Schwarz-screened SpMM
Closes #19.
Added
- `screened_spmm(A, diag_integrals, B, threshold)` — fused Schwarz-screened dense matmul. One NKI kernel fuses the full pipeline (outer-product pair bound → threshold → mask-apply → `nc_matmul`) into a single dispatch. Saves ~30–50% end-to-end vs the unfused `density_screen + from_dense + spmm` flow on Fock-build-sized inputs.
- `_screened_spmm_kernel` (`@nki.jit`) — stationary-A-tile-reuse GEMM extended with a per-tile pair-bound mask built from the 1-D Schwarz-bound vector.
- `_ScreenedSpMMFunction` — `torch.autograd.Function` wrapper. Third differentiable NKI kernel in the trnsci suite (after v0.2.0 CSR SpMM and v0.3.0 BSR SpMM). `torch.autograd.gradcheck` passes at `atol=1e-4` on hardware.
Validation
| Surface | Tests | Result |
|---|---|---|
| CPU suite | 4 `TestScreenedSpmm` | ✅ |
| Simulator (ubuntu-latest CI + trn1) | 2 `TestScreenedSpmmSimulator` | ✅ |
| Hardware (trn1.2xlarge) | 7 `TestNkiScreenedSpmmParity` + `TestNkiScreenedSpmmDifferentiability` | ✅ |
Total: 49 CPU + 23 hardware tests green; no regression across the suite.
Also closed this session
- #24 — fused CG NKI kernel not buildable under NKI 2.24/0.3.0 (no `break`, no iteration-carried scalar state across `affine_range`, no nested kernels). Per-iteration `_cg_step_kernel` reframe evaluated and found to save only 5–20% — closed honestly. If upstream NKI gains persistent-SBUF-across-calls or in-kernel control flow, the whole-loop CG kernel can be reopened.
Known limits
- `screened_spmm` currently restricted to square A (M == K) with 1-D `diag_integrals`. The common Fock-build case. Rectangular / asymmetric-bounds extension is a follow-up if asked for.
trnsparse 0.3.2 — CG + power iteration on BSR
Phase 1 plumbing of #22 on-chip iterative solvers. Python-level CG and power iteration on top of `bsr_spmm`.
Added
- `cg_bsr(A, b, x0, tol, max_iter, M=None) -> (x, iters, rel)` — Conjugate Gradient for SPD BSR matrices. Takes an optional preconditioner.
- `power_iteration_bsr(A, v0, max_iter, tol) -> (lam, v, iters)` — dominant eigenpair via power iteration.
- `jacobi_preconditioner_bsr(A)` — diagonal preconditioner builder.
- `bsr_diagonal(A)` — main-diagonal extractor.
- `docs/iterative_solvers.md` — design note covering the v0.3.2 plumbing and the v0.4.0 fused-kernel goal.
- `tests/test_iterative.py` (8 tests, scipy parity at `atol=1e-4`).
- `benchmarks/bench_iterative.py` — at 128×128 SPD: scipy 310 μs, trnsparse 369 μs (1.19×).
Not in this release
The architectural win from #22's acceptance list — A SBUF-resident across all iterations — requires a fused NKI kernel that wraps the CG loop. That's tracked in #24 for v0.4.0. Today each CG iteration dispatches one `bsr_spmm`, so A round-trips to HBM per iter on NKI. The API stays stable across the v0.4.0 transition — users get the fused-kernel speedup automatically.
Validation
- 45/45 CPU tests pass (37 existing + 8 new iterative).
- Simulator CI (`nki-simulator`) green on ubuntu-latest.
- No NKI code changes — `bsr_spmm` unchanged; hardware re-validation not required.
trnsparse 0.3.1 — NKI 0.3.0 namespace + simulator dispatch
Clears `[Unreleased]` with the three-commit trnsparse#23 landing. No kernel-body changes; no public API changes; no numeric drift. Upgrade is safe for anyone already on v0.3.0.
Changed
- Migrated NKI imports to the
nki.*namespace (NKI 0.3.0 Stable, Neuron SDK 2.29, April 2026). Legacyneuronxcc.nki.*shim is no longer used.pyproject.toml[neuron]extra gainsnki>=0.3.0alongsideneuronxcc>=2.24andtorch-neuronx>=2.9. Hosts without annkiwheel (macOS, non-Linux archs) still hitHAS_NKI=Falseand get the torch fallback. testCI job now filters-m "not neuron and not nki_simulator"so each test runs in exactly one job.
Added
TRNSPARSE_USE_SIMULATOR=1dispatch branch throughnki.simulate(kernel)(np_args). Bypasses torch_xla + NEFF compile; kernels run on CPU for correctness iteration. Hardware still owns perf numbers.nki-simulatorCI job onubuntu-latest— installsnki>=0.3.0from the AWS pip index, runs the simulator suite on every push/PR. Kernel correctness gate without AWS cost. Catches Python-trace-level errors (bad kwargs, dropped ops, shape mismatches); MLIR verifier errors remain hardware-only.tests/test_nki_sim.py— curated simulator suite (4 tests: CSR aligned + rectangular, BSR block-dense + block-diagonal). Skips cleanly off-hardware.scripts/run_simulator_tests.sh— SSM runner for the simulator suite on trn1.tests/conftest.py— registers thenki_simulatorpytest marker.
Validation
- 37 CPU tests pass unchanged.
- 15 hardware tests on trn1.2xlarge — all green post-migration (60s).
- 4 simulator tests green on both trn1 and ubuntu-latest.
Closes #23.
trnsparse 0.3.0 — BSR is the Trainium-native sparse format
v0.3.0 reframes the library around what Trainium uniquely enables.
Trainium's Tensor Engine is a 128×128 systolic array. The natural unit of sparse work on it is not an individual nonzero — it's a 128×128 block. v0.3.0 introduces BSRMatrix and bsr_spmm, where every stored block is already a Tensor-Engine tile and maps to one nc_matmul call with zero gather overhead.
Added
BSRMatrix— block-sparse row format atblock_size=128. Conversions to/fromCSRMatrixand dense.bsr_spmm(A_bsr, B)with NKI + PyTorch dispatch; NKI path wraps_BSRSpMMFunction(suite's secondtorch.autograd.Function-backed kernel after v0.2.0 CSR).- Hardware-validated on
trn1.2xlarge— 7/7@pytest.mark.neurontests includingtorch.autograd.gradcheck. - Benchmarks populated with real trn1 numbers (
docs/benchmarks.md). sparse_addno longer materializesN×Ndense intermediate (closes #8).density_screentest coverage (closes #10).
Architectural thesis
Documented in docs/architecture.md lede: CSR is the construction and interop format; BSR is the NKI compute format. For matrices with real block structure — Fock/ERI tensors after Schwarz screening, FEM stiffness, graph adjacencies, block-sparse attention masks — BSR is strictly preferred. For truly unstructured sparse, the torch.sparse_csr_tensor PyTorch fallback (v0.1.3) is already within 2× of scipy and NKI adds nothing.
Honest reading of the benchmarks
At v0.3.0 scales, NKI dispatch + compilation + HBM round-trips dominate the matmul work. BSR-NKI is ~15-25× slower than BSR-PyTorch at small sizes. The architectural wins live in follow-up issues:
- #19 — fused screen + matmul kernel (eliminates two HBM round-trips)
- #20 — on-chip iterative solvers (CG / power iteration with A SBUF-resident across iterations)
- #21 — block-sparse attention primitive (BSR is the building block for Longformer/BigBird-style sparse transformers)
Closed
Reframed
#15 (CSR row-bucketing) demoted to backlog. Under the architectural frame, the CSR path is served by the PyTorch fallback and BSR is the NKI-side story. Row-bucketing would only help if NKI 2.24 exposed an indirect-DMA primitive, which it doesn't.
trnsparse 0.2.0 — NKI SpMM validated on trn1
Phase 1 lands. First hardware-validated NKI kernel in trnsparse, and the suite's first torch.autograd.Function-wrapped NKI kernel (closes trnsci/trnsci#3 for this repo).
Added
- NKI SpMM kernel (
trnsparse/nki/kernels.py::_spmm_dense_kernel) — stationary A-tile reuse GEMM on the Tensor Engine.TILE_M = TILE_K = 128, TILE_N = 512. - Autograd wrapping (
trnsparse.nki.dispatch._SpMMFunction) — analytic backward (dA = dC @ Bᵀprojected,dB = Aᵀ @ dC).torch.autograd.gradcheckpasses atatol=1e-4. - Dispatch wiring —
set_backend("nki")routestrnsparse.spmmthrough the NKI path; v0.1.3torch.sparsefallback unchanged otherwise. tests/test_nki_spmm.py— 8@pytest.mark.neurontests: parity across aligned + unaligned + low-density shapes, gradcheck, end-to-endloss.backward()smoke.benchmarks/bench_spmm.py— four-backend SpMM table (scipy / torch.sparse / trnsparse pytorch / trnsparse nki) in one pytest pass.docs/benchmarks.mdpopulated with realtrn1.2xlargenumbers.
Hardware validation (trn1.2xlarge, Neuron SDK 2.24)
| Test | Result |
|---|---|
| 5 parity cases (aligned + unaligned + density 0.01–0.1) | ✅ all pass at atol=1e-3, rtol=1e-4 |
torch.autograd.gradcheck |
✅ pass at atol=1e-4 |
loss.backward() smoke |
✅ finite gradients through full stack |
Known limits
- NKI is slower than CPU in v0.2.0 — the kernel materializes the CSR into a dense
(M, K)tile before the matmul. At density 0.001 on1024 × 1024, this means ~1000× more work than scipy does. Seedocs/benchmarks.mdfor numbers. Sparse speedup comes from row-bucketing + gather-matmul-scatter, which is #15 / v0.3.0 / Phase 3. - SpMV stays on PyTorch — single-column NKI matmul doesn't amortize the compile + dispatch overhead.
trnsparse 0.1.3
Changed
spmv,spmm,spmv_symmetric, andCSRMatrix.to_densenow lower totorch.sparse_csr_tensoroperations instead of per-row Python loops.- On CPU (256×256, density 0.01) the change is 26× faster for SpMV (958 μs → 37 μs) and 52–88× faster for SpMM (1.2 ms → 13–24 μs depending on RHS width), putting trnsparse's PyTorch fallback within 2× of
torch.sparse.
Pure performance change — no API or numeric-output differences. All 25 existing tests pass unchanged. NKI backend remains scaffolded; routing lands in v0.2.0.
Added
trnsparse 0.1.2
Changed
- Sync
trnsparse.__version__withpyproject.toml(both now0.1.2). Previously__init__.pyreported0.1.0while the package version was0.1.1. - Docs badge in
README.mdandsite_urlinmkdocs.ymlpoint attrnsci.dev/trnsparse/instead oftrnsci.github.io/trnsparse/. Per-repo GitHub Pages is superseded by the centralized trnsci.dev site. docs/architecture.mdclarifies that the NKI backend is scaffolded only — the PyTorch path runs regardless ofset_backendin v0.1.x. Routing + on-hardware validation land in v0.2.0.
Closes #5.
trnsparse 0.1.1
Added
- mkdocs site with
index,installation,quickstart,api,architecture,aws_setup infra/terraform/for on-hardware CI instance provisioningscripts/run_neuron_tests.shand benchmark helpers- GitHub Actions
ci.ymlfor CPU-only pytest matrix IssuesURL in pyproject.toml
Changed
- Bumped
neuronxccfloor from>=2.15to>=2.24to unify with the rest of the trnsci suite.torch-neuronxfloor bumped to>=2.9.