On-chip iterative solvers over BSR (SBUF-resident A)

**Architectural opportunity:** 32 GB SBUF per NeuronCore v2 means a 5000×5000 BSR Hamiltonian (~100 MB if 25% block-dense) fits on-chip entirely. Iterative solvers (CG, power iteration, Lanczos, Davidson) that need `A @ x` per iteration become **SBUF-resident** — A stays on-chip across thousands of iterations, only `x` and `r` round-trip HBM.

This is where Trainium wins big vs CPU (no bus trip for A per iteration) and vs GPU (HBM bandwidth isn't the bottleneck when A is on-chip).

**Acceptance:**
- `trnsparse.cg_bsr(A_bsr, b, x0, tol=1e-6, max_iter=1000) -> x` — A stays resident in SBUF across iterations
- `trnsparse.power_iteration_bsr(A_bsr, v0, max_iter=100)` for dominant eigenpair
- NKI kernel structure: outer kernel-level loop; A loaded into SBUF once; x/r/p cycled
- Parity with `scipy.sparse.linalg.cg` at `atol=1e-3`
- Benchmarks showing per-iteration cost vs torch.sparse CG on CPU

Depends on #18 (BSR — now shipped in v0.3.0). Cross-links: `trnsolver#14` (Newton-Schulz preconditioners) — similar SBUF-resident pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-chip iterative solvers over BSR (SBUF-resident A) #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

On-chip iterative solvers over BSR (SBUF-resident A) #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions