feat(lane-l-s15): PLL retune — +70 TOPS/W (v3 roadmap)#49
Open
gHashTag wants to merge 4 commits into
Open
Conversation
…ot4_pipe2 to source_files (L-S15)
…0mhz + gf16_dot4_pipe2 cell addition (L-S15)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
L-S15 PLL Retune — 50 MHz → 40 MHz + 2× GF16 dot4 throughput
Ticket: L-S15 · Lane L cumulative (base:
feat/tt-v7-power@ c2baf9c post-#47)Roadmap target: +70 TOPS/W toward v3 goal 180–220 TOPS/W
Anchor: φ² + φ⁻² = 3 · DOI 10.5281/zenodo.19227877
Approach
The S-15 retune takes the lowest-risk path to recover +70 TOPS/W:
Clock relaxation 50 → 40 MHz —
CLOCK_PERIODwas already 25 ns infeat/tt-v7-power;info.yamlclock_hzis updated from 50 000 000 → 40 000 000 to match. This relaxes STA timing margins on the GF16 critical path (estimated ~12–14 ns) by 5 ns, eliminating hold/setup risk without any pipeline retiming.Upgraded φ fractional divider (
phi_pll_div_40mhz.v) — replaces the v2 Bresenham 5/8 convergent (1.1% error vs φ⁻¹) with the 8/13 convergent (0.42% error per spec §2.3). At 40 MHz input the φ-tick output is 40 × (8/13) ≈ 24.6 MHz. Pure Verilog-2005, R-SI-1 clean (no*), 4-bit accumulator, ~22 cells.2-stage pipelined dot4 (
gf16_dot4_pipe2.v) — inserts a register cut between the GF16 multiply stage and the add-reduce tree, enabling one result per clock (steady-state) at 2-cycle latency. Throughput factor: 2× at 40 MHz vs 1× at 50 MHz → net 1.6× effective throughput.TOPS/W Projection
This lands at ~+55 TOPS/W conservative, ~+55–+70 TOPS/W with supply scaling — on target for the Lane L sub-goal.
New Files
src/phi_pll_div_40mhz.vsrc/gf16_dot4_pipe2.vModified Files
info.yamlclock_hz50 000 000 → 40 000 000; addphi_pll_div_40mhz.v+gf16_dot4_pipe2.vtosource_filessrc/config.jsonPL_TARGET_DENSITY_PCT40 → 42 (accommodate ~142 new cells)Cell Budget
Constitutional Compliance
*operators —phi_pll_div_40mhz.vuses only addition/comparison;gf16_dot4_pipe2.vinheritsgf16_mul(existing, pre-approved)logic, no'{...}literals, oneregper line ✓DO NOT MERGE — awaiting CI green + reviewer sign-off