Skip to content

On-chip iterative solvers over BSR (SBUF-resident A) #22

@scttfrdmn

Description

@scttfrdmn

Architectural opportunity: 32 GB SBUF per NeuronCore v2 means a 5000×5000 BSR Hamiltonian (~100 MB if 25% block-dense) fits on-chip entirely. Iterative solvers (CG, power iteration, Lanczos, Davidson) that need A @ x per iteration become SBUF-resident — A stays on-chip across thousands of iterations, only x and r round-trip HBM.

This is where Trainium wins big vs CPU (no bus trip for A per iteration) and vs GPU (HBM bandwidth isn't the bottleneck when A is on-chip).

Acceptance:

  • trnsparse.cg_bsr(A_bsr, b, x0, tol=1e-6, max_iter=1000) -> x — A stays resident in SBUF across iterations
  • trnsparse.power_iteration_bsr(A_bsr, v0, max_iter=100) for dominant eigenpair
  • NKI kernel structure: outer kernel-level loop; A loaded into SBUF once; x/r/p cycled
  • Parity with scipy.sparse.linalg.cg at atol=1e-3
  • Benchmarks showing per-iteration cost vs torch.sparse CG on CPU

Depends on #18 (BSR — now shipped in v0.3.0). Cross-links: trnsolver#14 (Newton-Schulz preconditioners) — similar SBUF-resident pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    chemistryQuantum-chemistry / scientific-computing use caseenhancementNew feature or requestneuronRequires AWS Neuron / Trainium hardware

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions