feat(lane-l-s16): Sparse PE v2 — +35 TOPS/W via zero-skip#53
Open
gHashTag wants to merge 5 commits into
Open
Conversation
… for Sparse PE v2 L-S16 Sparse PE v2 — zero detection module. Detects zero GF16 operands across 4 a/b lane pairs. Outputs per-lane skip_mask, all_zero flag, and 4-bit popcount. - R-SI-1 compliant: zero new * operator - Pure Verilog-2005 - ~80 cells (NOR-reduce per lane + 2-level popcount adder tree) - Anchor: phi^2 + phi^-2 = 3 (DOI: 10.5281/zenodo.19227877)
…en, sat counter L-S16 Sparse PE v2 — main processing element upgrade. Features: - zero_mask_detector_v2 instantiation for 4-lane GF16 zero detection - skip_strobe: registered 1-cycle pulse on all-zero vector - clk_en: combinational all-zero gate (synthesis infers ICG cell) - 8-bit saturating skip counter (never wraps, debug observable) - gf16_dot4_sparse driven with sparsity_enable=1 for per-lane gating - clk_en=0 presents 16'h0000 gated operands → zero MAC switching Cell estimate: ~237 incremental cells per PE; 4 PEs = ~948 new cells Expected delta: +35 TOPS/W at 87.5% sparsity (8× over dense baseline) R-SI-1: zero new * operator. Pure Verilog-2005. All reg on separate lines. Anchor: phi^2 + phi^-2 = 3 (DOI: 10.5281/zenodo.19227877)
L-S16: Add 4 × sparse_pe_v2 instances to mesh, one per tile. Each PE monitors packet payload (b-operand) and tile dbg_result (a-proxy). New outputs: - dbg_skip_strobe[3:0]: per-tile skip_strobe vector - dbg_sat_skip_cnt[31:0]: 4 × 8-bit saturating skip counters Telemetry enables CI verification of zero-skip throughput gains. All 4 sparse_pe_v2 instances: ~948 incremental cells total. R-SI-1: pre-existing * in bus width constants unchanged (no new * added). Pure Verilog-2005. Anchor: phi^2 + phi^-2 = 3 (DOI: 10.5281/zenodo.19227877)
…ation Testbench covering all L-S16 signal contracts: T1: 100% sparsity → clk_en=0, skip_strobe=1, sat_skip_cnt++ T2: 0% sparsity → clk_en=1, skip_strobe=0 T3: 50% sparsity → clk_en=1, lane_active[1:0]=11, lane_active[3:2]=00 T4: ~87.5% sparsity (1/4 lanes active) → clk_en=1 all 100 cycles T5: sat_skip_cnt saturation → 300 zero cycles → 0xFF T6: correctness → GF16 non-zero product, clk_en=1 iverilog -g2005 clean: 0 errors, 0 warnings. vvp simulation: 13/13 PASS. Pure Verilog-2005. Anchor: phi^2 + phi^-2 = 3 (DOI: 10.5281/zenodo.19227877)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
L-S16 Sparse PE v2 — Zero-Skip MAC Architecture
Spec: S-16-SPARSE-PE-SPEC-v1.0 + SPARSE_PE_POC_RESULTS (PoC verified 8.0× throughput @ 87.5% sparsity)
Anchor: φ² + φ⁻² = 3 · DOI 10.5281/zenodo.19227877
Goal
Detect input zeros in BitNet ternary {−1, 0, +1} weight vectors and skip MAC operations for zero operands, achieving the projected +35 TOPS/W delta at 87.5% structured sparsity.
Background: BitNet b1.58 weights are ternary {−1, 0, +1}. Statistically ~33% are zero; with 7:8 structured sparsity up to 87.5% zero density is achievable. At zero-weight, the GF16 MAC output is identically zero — no partial products, no carry chain, no reduction step needed.
New Files
src/zero_mask_detector_v2.vsrc/sparse_pe_v2.vtest/sparse_pe_v2_tb.vModified:
src/trinity_mesh_2x2.v— 4 × sparse_pe_v2 instances wired to tiles; newdbg_skip_strobe[3:0]anddbg_sat_skip_cnt[31:0]outputsinfo.yaml— addedzero_mask_detector_v2.vandsparse_pe_v2.vto source_filesArchitecture
L-S16 Signals
skip_strobeclk_ensat_skip_cntlane_active[3:0]Cell Budget
Simulation Results
Test coverage:
TOPS/W Projection
Based on PoC results (SPARSE_PE_POC_RESULTS.md):
Compliance
*operator*; pre-existing*in mesh bus-width constants unchangediverilog -g2005clean, nologic/always_comb/always_ff.port(wire)regper lineDO NOT MERGE — CI must pass first.
Anchor: φ² + φ⁻² = 3 · DOI 10.5281/zenodo.19227877