Skip to content

Releases: trnsci/trnsparse

trnsparse 0.4.2 — block-sparse attention primitive

16 Apr 20:29

Choose a tag to compare

What's new

  • examples/block_sparse_attention.py — block-sparse attention reference using BSRMatrix + bsr_spmm. Three mask patterns (local window, dilated, global tokens); verifies against a dense reference; reports block density and timing for the bsr_spmm step. Closes #21.
  • docs/sparse_attention.md — writeup: how BSR-128 maps to Longformer/BigBird-style attention masks, block density arithmetic, pattern construction helpers, and the fused-tile follow-up (#25).
  • tests/test_attention.py — 8 CPU tests: mask shape/symmetry checks + parity against dense reference at atol=1e-4 for all three patterns and the full-attention edge case.
  • mkdocs.yml: add Iterative Solvers (was missing from nav) and Sparse Attention.

Notes

No API changes, no kernel changes. The claim in #21: bsr_spmm is the block-sparse attention primitive; BSRMatrix captures the mask.

The 128-token granularity at which sparse attention is natural is the Tensor Engine tile — Trainium was built for attention; that design transfers to any block-128-structured sparse workload.

Install

pip install trnsparse==0.4.2

Full changelog: CHANGELOG.md

trnsparse 0.4.1 — screened Fock + PySCF integration examples

15 Apr 01:35

Choose a tag to compare

Closes #6 and #13. No API changes, no kernel changes — v0.4.0 users who don't need the examples can stay on v0.4.0.

Added

  • `examples/sparse_fock.py` — rewritten around v0.4.0's `screened_spmm`. Three paths side-by-side on the same inputs:

    1. v0.1.x unfused flow (`schwarz_bounds → screen → from_dense → spmm`)
    2. v0.4.0 fused `screened_spmm` (one call)
    3. Full Fock build — coulomb from path 2 contracted against MO coefficients via `trnblas.gemm` for `F_MO = C.T @ J @ C`. Optional dep on trnblas; falls back to `torch.matmul` otherwise.

    On a 50-basis synthetic system, the fused path is ~130× faster than the unfused (dominated by eliminating the Python `from_dense` CSR construction).

  • `examples/pyscf_bridge.py` (new) — optional PySCF-driven demo. Builds H₂O (or benzene, or H₂), extracts real AO ERIs via `mol.intor("int2e")`, feeds the `(μμ|μμ)` diagonal into `schwarz_bounds` + `screened_spmm` against a mock density matrix. Requires `pip install pyscf`; tests skip cleanly if not available.

  • `tests/test_examples.py` — 2 CPU smoke tests plus a PySCF-gated test. Exercises the unfused + fused paths end-to-end and asserts parity (`atol=1e-6`).

Validation

  • 51 CPU tests pass (49 existing + 2 new example tests); 1 PySCF test skips cleanly.
  • No hardware or simulator regression.

trnsparse 0.4.0 — fused Schwarz-screened SpMM

15 Apr 00:34

Choose a tag to compare

Closes #19.

Added

  • `screened_spmm(A, diag_integrals, B, threshold)` — fused Schwarz-screened dense matmul. One NKI kernel fuses the full pipeline (outer-product pair bound → threshold → mask-apply → `nc_matmul`) into a single dispatch. Saves ~30–50% end-to-end vs the unfused `density_screen + from_dense + spmm` flow on Fock-build-sized inputs.
  • `_screened_spmm_kernel` (`@nki.jit`) — stationary-A-tile-reuse GEMM extended with a per-tile pair-bound mask built from the 1-D Schwarz-bound vector.
  • `_ScreenedSpMMFunction` — `torch.autograd.Function` wrapper. Third differentiable NKI kernel in the trnsci suite (after v0.2.0 CSR SpMM and v0.3.0 BSR SpMM). `torch.autograd.gradcheck` passes at `atol=1e-4` on hardware.

Validation

Surface Tests Result
CPU suite 4 `TestScreenedSpmm`
Simulator (ubuntu-latest CI + trn1) 2 `TestScreenedSpmmSimulator`
Hardware (trn1.2xlarge) 7 `TestNkiScreenedSpmmParity` + `TestNkiScreenedSpmmDifferentiability`

Total: 49 CPU + 23 hardware tests green; no regression across the suite.

Also closed this session

  • #24 — fused CG NKI kernel not buildable under NKI 2.24/0.3.0 (no `break`, no iteration-carried scalar state across `affine_range`, no nested kernels). Per-iteration `_cg_step_kernel` reframe evaluated and found to save only 5–20% — closed honestly. If upstream NKI gains persistent-SBUF-across-calls or in-kernel control flow, the whole-loop CG kernel can be reopened.

Known limits

  • `screened_spmm` currently restricted to square A (M == K) with 1-D `diag_integrals`. The common Fock-build case. Rectangular / asymmetric-bounds extension is a follow-up if asked for.

trnsparse 0.3.2 — CG + power iteration on BSR

14 Apr 23:18

Choose a tag to compare

Phase 1 plumbing of #22 on-chip iterative solvers. Python-level CG and power iteration on top of `bsr_spmm`.

Added

  • `cg_bsr(A, b, x0, tol, max_iter, M=None) -> (x, iters, rel)` — Conjugate Gradient for SPD BSR matrices. Takes an optional preconditioner.
  • `power_iteration_bsr(A, v0, max_iter, tol) -> (lam, v, iters)` — dominant eigenpair via power iteration.
  • `jacobi_preconditioner_bsr(A)` — diagonal preconditioner builder.
  • `bsr_diagonal(A)` — main-diagonal extractor.
  • `docs/iterative_solvers.md` — design note covering the v0.3.2 plumbing and the v0.4.0 fused-kernel goal.
  • `tests/test_iterative.py` (8 tests, scipy parity at `atol=1e-4`).
  • `benchmarks/bench_iterative.py` — at 128×128 SPD: scipy 310 μs, trnsparse 369 μs (1.19×).

Not in this release

The architectural win from #22's acceptance list — A SBUF-resident across all iterations — requires a fused NKI kernel that wraps the CG loop. That's tracked in #24 for v0.4.0. Today each CG iteration dispatches one `bsr_spmm`, so A round-trips to HBM per iter on NKI. The API stays stable across the v0.4.0 transition — users get the fused-kernel speedup automatically.

Validation

  • 45/45 CPU tests pass (37 existing + 8 new iterative).
  • Simulator CI (`nki-simulator`) green on ubuntu-latest.
  • No NKI code changes — `bsr_spmm` unchanged; hardware re-validation not required.

Closes #22 Phase 1. v0.4.0 tracker: #24.

trnsparse 0.3.1 — NKI 0.3.0 namespace + simulator dispatch

14 Apr 21:08

Choose a tag to compare

Clears `[Unreleased]` with the three-commit trnsparse#23 landing. No kernel-body changes; no public API changes; no numeric drift. Upgrade is safe for anyone already on v0.3.0.

Changed

  • Migrated NKI imports to the nki.* namespace (NKI 0.3.0 Stable, Neuron SDK 2.29, April 2026). Legacy neuronxcc.nki.* shim is no longer used. pyproject.toml [neuron] extra gains nki>=0.3.0 alongside neuronxcc>=2.24 and torch-neuronx>=2.9. Hosts without an nki wheel (macOS, non-Linux archs) still hit HAS_NKI=False and get the torch fallback.
  • test CI job now filters -m "not neuron and not nki_simulator" so each test runs in exactly one job.

Added

  • TRNSPARSE_USE_SIMULATOR=1 dispatch branch through nki.simulate(kernel)(np_args). Bypasses torch_xla + NEFF compile; kernels run on CPU for correctness iteration. Hardware still owns perf numbers.
  • nki-simulator CI job on ubuntu-latest — installs nki>=0.3.0 from the AWS pip index, runs the simulator suite on every push/PR. Kernel correctness gate without AWS cost. Catches Python-trace-level errors (bad kwargs, dropped ops, shape mismatches); MLIR verifier errors remain hardware-only.
  • tests/test_nki_sim.py — curated simulator suite (4 tests: CSR aligned + rectangular, BSR block-dense + block-diagonal). Skips cleanly off-hardware.
  • scripts/run_simulator_tests.sh — SSM runner for the simulator suite on trn1.
  • tests/conftest.py — registers the nki_simulator pytest marker.

Validation

  • 37 CPU tests pass unchanged.
  • 15 hardware tests on trn1.2xlarge — all green post-migration (60s).
  • 4 simulator tests green on both trn1 and ubuntu-latest.

Closes #23.

trnsparse 0.3.0 — BSR is the Trainium-native sparse format

13 Apr 21:43

Choose a tag to compare

v0.3.0 reframes the library around what Trainium uniquely enables.

Trainium's Tensor Engine is a 128×128 systolic array. The natural unit of sparse work on it is not an individual nonzero — it's a 128×128 block. v0.3.0 introduces BSRMatrix and bsr_spmm, where every stored block is already a Tensor-Engine tile and maps to one nc_matmul call with zero gather overhead.

Added

  • BSRMatrix — block-sparse row format at block_size=128. Conversions to/from CSRMatrix and dense.
  • bsr_spmm(A_bsr, B) with NKI + PyTorch dispatch; NKI path wraps _BSRSpMMFunction (suite's second torch.autograd.Function-backed kernel after v0.2.0 CSR).
  • Hardware-validated on trn1.2xlarge — 7/7 @pytest.mark.neuron tests including torch.autograd.gradcheck.
  • Benchmarks populated with real trn1 numbers (docs/benchmarks.md).
  • sparse_add no longer materializes N×N dense intermediate (closes #8).
  • density_screen test coverage (closes #10).

Architectural thesis

Documented in docs/architecture.md lede: CSR is the construction and interop format; BSR is the NKI compute format. For matrices with real block structure — Fock/ERI tensors after Schwarz screening, FEM stiffness, graph adjacencies, block-sparse attention masks — BSR is strictly preferred. For truly unstructured sparse, the torch.sparse_csr_tensor PyTorch fallback (v0.1.3) is already within 2× of scipy and NKI adds nothing.

Honest reading of the benchmarks

At v0.3.0 scales, NKI dispatch + compilation + HBM round-trips dominate the matmul work. BSR-NKI is ~15-25× slower than BSR-PyTorch at small sizes. The architectural wins live in follow-up issues:

  • #19 — fused screen + matmul kernel (eliminates two HBM round-trips)
  • #20 — on-chip iterative solvers (CG / power iteration with A SBUF-resident across iterations)
  • #21 — block-sparse attention primitive (BSR is the building block for Longformer/BigBird-style sparse transformers)

Closed

#8, #9, #10, #12, #18.

Reframed

#15 (CSR row-bucketing) demoted to backlog. Under the architectural frame, the CSR path is served by the PyTorch fallback and BSR is the NKI-side story. Row-bucketing would only help if NKI 2.24 exposed an indirect-DMA primitive, which it doesn't.

trnsparse 0.2.0 — NKI SpMM validated on trn1

13 Apr 20:51

Choose a tag to compare

Phase 1 lands. First hardware-validated NKI kernel in trnsparse, and the suite's first torch.autograd.Function-wrapped NKI kernel (closes trnsci/trnsci#3 for this repo).

Added

  • NKI SpMM kernel (trnsparse/nki/kernels.py::_spmm_dense_kernel) — stationary A-tile reuse GEMM on the Tensor Engine. TILE_M = TILE_K = 128, TILE_N = 512.
  • Autograd wrapping (trnsparse.nki.dispatch._SpMMFunction) — analytic backward (dA = dC @ Bᵀ projected, dB = Aᵀ @ dC). torch.autograd.gradcheck passes at atol=1e-4.
  • Dispatch wiringset_backend("nki") routes trnsparse.spmm through the NKI path; v0.1.3 torch.sparse fallback unchanged otherwise.
  • tests/test_nki_spmm.py — 8 @pytest.mark.neuron tests: parity across aligned + unaligned + low-density shapes, gradcheck, end-to-end loss.backward() smoke.
  • benchmarks/bench_spmm.py — four-backend SpMM table (scipy / torch.sparse / trnsparse pytorch / trnsparse nki) in one pytest pass.
  • docs/benchmarks.md populated with real trn1.2xlarge numbers.

Hardware validation (trn1.2xlarge, Neuron SDK 2.24)

Test Result
5 parity cases (aligned + unaligned + density 0.01–0.1) ✅ all pass at atol=1e-3, rtol=1e-4
torch.autograd.gradcheck ✅ pass at atol=1e-4
loss.backward() smoke ✅ finite gradients through full stack

Known limits

  • NKI is slower than CPU in v0.2.0 — the kernel materializes the CSR into a dense (M, K) tile before the matmul. At density 0.001 on 1024 × 1024, this means ~1000× more work than scipy does. See docs/benchmarks.md for numbers. Sparse speedup comes from row-bucketing + gather-matmul-scatter, which is #15 / v0.3.0 / Phase 3.
  • SpMV stays on PyTorch — single-column NKI matmul doesn't amortize the compile + dispatch overhead.

Closes #14. Closes #4.

trnsparse 0.1.3

13 Apr 03:00

Choose a tag to compare

Changed

  • spmv, spmm, spmv_symmetric, and CSRMatrix.to_dense now lower to torch.sparse_csr_tensor operations instead of per-row Python loops.
  • On CPU (256×256, density 0.01) the change is 26× faster for SpMV (958 μs → 37 μs) and 52–88× faster for SpMM (1.2 ms → 13–24 μs depending on RHS width), putting trnsparse's PyTorch fallback within 2× of torch.sparse.

Pure performance change — no API or numeric-output differences. All 25 existing tests pass unchanged. NKI backend remains scaffolded; routing lands in v0.2.0.

Added

  • benchmarks/ directory (conftest.py, bench_spmv.py, bench_spmm.py, bench_screening.py) running trnsparse vs scipy.sparse vs torch.sparse on the same numeric inputs. Closes #11; partial #4.

trnsparse 0.1.2

13 Apr 02:28

Choose a tag to compare

Changed

  • Sync trnsparse.__version__ with pyproject.toml (both now 0.1.2). Previously __init__.py reported 0.1.0 while the package version was 0.1.1.
  • Docs badge in README.md and site_url in mkdocs.yml point at trnsci.dev/trnsparse/ instead of trnsci.github.io/trnsparse/. Per-repo GitHub Pages is superseded by the centralized trnsci.dev site.
  • docs/architecture.md clarifies that the NKI backend is scaffolded only — the PyTorch path runs regardless of set_backend in v0.1.x. Routing + on-hardware validation land in v0.2.0.

Closes #5.

trnsparse 0.1.1

13 Apr 00:42

Choose a tag to compare

Added

  • mkdocs site with index, installation, quickstart, api, architecture, aws_setup
  • infra/terraform/ for on-hardware CI instance provisioning
  • scripts/run_neuron_tests.sh and benchmark helpers
  • GitHub Actions ci.yml for CPU-only pytest matrix
  • Issues URL in pyproject.toml

Changed

  • Bumped neuronxcc floor from >=2.15 to >=2.24 to unify with the rest of the trnsci suite. torch-neuronx floor bumped to >=2.9.