Architectural opportunity: 32 GB SBUF per NeuronCore v2 means a 5000×5000 BSR Hamiltonian (~100 MB if 25% block-dense) fits on-chip entirely. Iterative solvers (CG, power iteration, Lanczos, Davidson) that need A @ x per iteration become SBUF-resident — A stays on-chip across thousands of iterations, only x and r round-trip HBM.
This is where Trainium wins big vs CPU (no bus trip for A per iteration) and vs GPU (HBM bandwidth isn't the bottleneck when A is on-chip).
Acceptance:
trnsparse.cg_bsr(A_bsr, b, x0, tol=1e-6, max_iter=1000) -> x — A stays resident in SBUF across iterations
trnsparse.power_iteration_bsr(A_bsr, v0, max_iter=100) for dominant eigenpair
- NKI kernel structure: outer kernel-level loop; A loaded into SBUF once; x/r/p cycled
- Parity with
scipy.sparse.linalg.cg at atol=1e-3
- Benchmarks showing per-iteration cost vs torch.sparse CG on CPU
Depends on #18 (BSR — now shipped in v0.3.0). Cross-links: trnsolver#14 (Newton-Schulz preconditioners) — similar SBUF-resident pattern.
Architectural opportunity: 32 GB SBUF per NeuronCore v2 means a 5000×5000 BSR Hamiltonian (~100 MB if 25% block-dense) fits on-chip entirely. Iterative solvers (CG, power iteration, Lanczos, Davidson) that need
A @ xper iteration become SBUF-resident — A stays on-chip across thousands of iterations, onlyxandrround-trip HBM.This is where Trainium wins big vs CPU (no bus trip for A per iteration) and vs GPU (HBM bandwidth isn't the bottleneck when A is on-chip).
Acceptance:
trnsparse.cg_bsr(A_bsr, b, x0, tol=1e-6, max_iter=1000) -> x— A stays resident in SBUF across iterationstrnsparse.power_iteration_bsr(A_bsr, v0, max_iter=100)for dominant eigenpairscipy.sparse.linalg.cgatatol=1e-3Depends on #18 (BSR — now shipped in v0.3.0). Cross-links:
trnsolver#14(Newton-Schulz preconditioners) — similar SBUF-resident pattern.