CORING enforces N:M fine-grained structured sparsity on neural network weight tensors. For a given N:M pattern (e.g., 2:4), the weight tensor is partitioned into contiguous groups of M elements; within each group, exactly N elements are retained (non-zero) and M−N are pruned (zeroed).
This structured pattern maps directly to the NVIDIA Ampere Sparse Tensor Core
instruction set (mma.sp), delivering up to 2× matrix-multiply throughput on A100
and H100 GPUs.
Reference implementation: include/tensorbit/core/coring.hpp
Given weight vector w ∈ R^N with per-element importance scores s ∈ R^N
(typically from an upstream EHAP pruner) and sparsity pattern N:M:
Objective: Partition into G = N/M groups of size M. Within each group g,
select exactly N weights to retain such that the sum of retained importance
is maximised:
where b_g = g · M is the group base index.
For each group, sort or partial-sort the M importance values and select the top N.
Uses std::nth_element (Hoare 1961, quickselect) for O(M) partial sorting.
Mathematical property: Top-N is provably optimal when importance scores are drawn i.i.d. from a continuous distribution (Hubara et al. 2022, Theorem 1).
Enumerate all C(M,N) = M! / (N! · (M−N)!) possible mask patterns per group
using Gosper's hack (Knuth 2011, Sec. 7.2.1.3) for bitwise iteration.
Selects the pattern with maximum total importance.
| N:M | C(M,N) | Feasibility |
|---|---|---|
| 2:4 | 6 | Always |
| 4:8 | 70 | Always |
| 8:16 | 12,870 | Always |
| >16 | — | Falls back to kTopN |
- Initialise with top-N mask.
- For each round (up to
iterative_rounds, default 3):- For each group, attempt to swap one pruned element with one kept element.
- Accept the swap if
s_pruned > s_kept(improves total importance). - Stop early if no swaps improve any group (convergence).
This is a local search heuristic (2-opt style). It can escape locally-optimal top-N solutions and approaches the optimal mask with high probability.
CORING operates on per-weight importance scores. These can come from:
| Source | Formula | When to use |
|---|---|---|
| EHAP OBD | s_i = w_i² · (F_ii + λ) |
Fisher diagonal available |
| EHAP OBS | s_i = w_i² / (F_ii + λ) |
Inverse-Hessian diagonal available |
| Magnitude | `s_i = | w_i |
When coupled with EHAP's Fisher accumulation and store_gradient(), the
importance scores passed to CORING are curvature-aware: they encode
both the weight magnitude AND the local loss-landscape curvature.
This means CORING's mask selection naturally preserves "load-bearing" weights (high Fisher) even when their magnitude is moderate — the defining advantage of second-order pruning.
After N:M pruning, the remaining weights can be adjusted to compensate for removed parameters.
Earlier implementations used raw weight sums for redistribution:
Δ_g = Σ_{pruned} w_i. This suffers from signed-weight cancellation:
a group with w = [+5, −4] (both pruned) would have Δ_g = +1, severely
underestimating the actual removed magnitude (9).
Fix: Use absolute magnitude:
Δ_g = Σ_{pruned} |w_i|
Distribute pruned magnitude to kept weights proportionally to their importance scores (OBS-inspired):
The sign(w_i) factor ensures the redistribution preserves the direction of
each weight, avoiding sign flips.
Mean-preserving: the group's total absolute magnitude is conserved.
When permute_weights = true, weights within each group are sorted by
absolute magnitude in descending order before mask selection:
This concentrates large weights in early group positions, making them more likely to survive the N:M constraint. The caller is responsible for tracking the index permutation to reverse it after pruning, maintaining the original weight ordering.
Reference: Pool & Yu (2021) showed that channel-wise permutation of weight columns before 2:4 sparsity application improves accuracy by 1–3 percentage points at the same sparsity ratio.
When hardware_aware_layout = true and the pattern is 2:4, the mask buffer
is structured to match NVIDIA's Ampere Sparse Tensor Core requirements:
- Each group of 4 weights maps to one mask byte (4 bits used)
- Groups are contiguous along the inner GEMM K-dimension
- The
mma.spinstruction expects masks in this exact organisation
In row-major weight layout, groups are naturally contiguous along the K
dimension, so no reordering is needed. The apply_ampere_2_4_layout()
hook exists for future transpose/reshape support.
All CORING strategies use a packed bitmask format:
Byte g: bit 0 → weight[g·M + 0] is kept (1) or pruned (0)
bit 1 → weight[g·M + 1] is kept (1) or pruned (0)
...
bit (M−1) → weight[g·M + M−1] is kept (1) or pruned (0)
One byte per group. Groups are stored consecutively: mask_data[g] for group g.
Analytical pruned count (deterministic, no runtime counting needed):
Valid when N_elements is divisible by M (enforced by validate_config).
| Kernel | Pattern | Thread Model | Shared Memory | Ops/Element |
|---|---|---|---|---|
nm_mask_2_4_kernel |
2:4 | 1 thread/group | 0 B | O(1) |
nm_mask_generic_kernel |
Any N:M, M ≤ 32 | M threads/group | 256 B | O(M²) |
apply_mask_kernel |
Any M | 1 thread/element | 0 B | O(1) |
GPU strategy support: kTopN is available on GPU via these kernels.
kOptimal and kIterative run on CPU (they require combinatorial/heuristic
search not yet implemented in CUDA). When GPU is enabled, kTopN automatically
dispatches; other strategies fall back to CPU with automatic host/device transfers.
include/tensorbit/core/coring.hpp
| Type | Purpose |
|---|---|
CORINGConfig |
N/M, CUDA toggle, mask strategy, redistribution mode, iterative rounds, permutation toggle, hardware layout toggle |
CORINGPruner<F> |
Template class for float/double |
MaskStrategy |
kTopN, kOptimal, kIterative |
RedistMode |
kNone, kProportional, kUniform |
| Method | Algorithm | Complexity |
|---|---|---|
generate_nm_mask(s, mask) |
Mask selection per strategy + optional permutation + HW layout | O(G·C(M,N)) worst |
apply_mask(w, mask) |
Element-wise bit-test + zero | O(N) |
redistribute(w, mask, s) |
Group-local absolute-magnitude redistribution | O(N) |
prune(s, w) |
Full pipeline: mask → apply → redistribute | O(N) |
| Setting | Effect |
|---|---|
mask_strategy = kTopN |
Fast partial-sort, GPU-accelerated |
mask_strategy = kOptimal |
Exact C(M,N) enumeration, CPU only |
mask_strategy = kIterative |
Locally-optimal swap-refine, CPU only |
iterative_rounds = R |
Max refinement rounds for kIterative |
redist_mode = kProportional |
Importance-weighted redistribution |
redist_mode = kUniform |
Equal-share redistribution |
permute_weights = true |
Group-local magnitude sort before masking |
hardware_aware_layout = true |
2:4 mask alignment for Ampere GEMM |
- GPU parity: Only
kTopNruns on GPU.kOptimalandkIterativeare CPU-only. A future release could implement warp-level combinatorial enumeration in CUDA for kOptimal. - Permutation reversal:
permute_weightsreorders weights but does not track the index mapping for reversal. Caller must handle this. - M > 32: The generic GPU kernel uses shared-memory ranking with
__shared__ float[32]. For M > 32, the CPU path processes groups serially. This is acceptable since practical N:M patterns (2:4, 1:4, 2:8, 4:8) all have M ≤ 8. - Cross-group coupling: Groups are processed independently. Blockwise coupling (sharing sparsity budget across groups) is deferred to a future Phase.
-
Mishra, A., Latorre, J. A., Pool, J., Stosic, D., Stosic, D., Venkatesh, G., Yu, C., & Micikevicius, P. (2021). "Accelerating Sparse Deep Neural Networks." arXiv:2104.08378.
-
Hubara, I., Chmiel, B., Island, M., Banner, R., Naor, J., & Soudry, D. (2022). "Accelerated Sparse Neural Training." NeurIPS 35.
-
Pool, J., & Yu, C. (2021). "Channel Permutations for N:M Sparsity." NeurIPS 34.
-
Frantar, E., & Alistarh, D. (2023). "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." ICML 2023.
-
NVIDIA Corporation (2021). "Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture." NVIDIA Technical Blog.
-
Hoare, C. A. R. (1961). "Algorithm 65: Find." Communications of the ACM, 4(7), pp. 321–322.
-
Knuth, D. E. (2011). The Art of Computer Programming, Volume 4A. Addison-Wesley.