feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization#45
Merged
hengliao1972 merged 27 commits intoLinxISA:mainfrom Mar 27, 2026
Merged
Conversation
Direct-form FIR filter: y[n] = c0·x[n] + c1·x[n-1] + c2·x[n-2] + c3·x[n-3] with 16-bit signed input, 16-bit coefficients, 34-bit accumulator. - digital_filter.py: pyCircuit RTL (shift register + parallel MAC) - filter_capi.cpp: C API wrapper for compiled RTL - emulate_filter.py: terminal UI with delay line, waveform display, 5 test scenarios (impulse, step, ramp, alternating, large values) - All tests verified against true RTL simulation via ctypes Co-authored-by: Cursor <cursoragent@cursor.com>
Sync pyCircuit cycle-aware additions with Janus Core design
Co-authored-by: Cursor <cursoragent@cursor.com>
…spec Add the Tile Management Unit (TMU) with 8-station bidirectional ring interconnect, SPB/MGB buffering, configurable 1MB TileReg, and cycle-accurate C++/SV testbenches. Include architecture spec document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add run/build scripts for C++ and Verilator simulation, RTL generation script, and trace visualization tools (SVG timeline, ring animation, VCD-based ring animation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
janus/tmu: add TMU ring interconnect implementation and spec
Add 16x16 systolic array matrix multiplication accelerator (Cube module)
Port Verilog design to pyCircuit (traffic lights + dodgeball game)
BF16 fused multiply-accumulate: acc(FP32) += a(BF16) × b(BF16) Built from first principles using HA, FA, RCA, CSA, Wallace tree, barrel shifters, and LZC — all from primitive_standard_cells.py. 4-stage pipeline with critical path analysis: Stage 1: Unpack + Exp Add depth=8 Stage 2: 8×8 Multiply (Wallace) depth=46 Stage 3: Align + Add depth=21 Stage 4: Normalize + Pack depth=31 100/100 test cases pass (true RTL simulation via ctypes). Max relative error: 5.36e-04 (limited by BF16 7-bit mantissa). Co-authored-by: Cursor <cursoragent@cursor.com>
- Add carry-select adder to primitive_standard_cells.py: splits N-bit addition into parallel halves, depth N+2 instead of 2N - Fix Wallace tree depth tracking: parallel CSAs share same depth level - Use carry-select adder for multiplier final addition - Pipeline now balanced: S1=8, S2=28, S3=21, S4=31 (critical path=31) - 100/100 tests still pass Co-authored-by: Cursor <cursoragent@cursor.com>
Move partial product generation + 2 CSA compression rounds into Stage 1 (alongside unpack/exponent). Stage 2 now only completes remaining CSA rounds + carry-select final addition. Pipeline depth: S1=13, S2=22, S3=21, S4=31 (was S1=8, S2=28) Critical path unchanged at 31 (Stage 4), but S1/S2 gap reduced from 20 to 9 for better balance. 100/100 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>
- npu_node.py: simplified NPU pyCircuit RTL (HBM inject + UB ports + FIFO) - sw5809s.py: simplified SW5809s pyCircuit RTL (VOQ + crossbar + RR) - fm16_system.py: behavioral system simulator with real-time visualization 16 NPU full-mesh, all-to-all 512B traffic, BW + latency stats - Results: 12.8 Tbps aggregate BW, Avg lat=3.2, P95=4, P99=5 cycles Co-authored-by: Cursor <cursoragent@cursor.com>
Rewrote fm16_system.py to simulate both topologies in parallel: FM16: 16 NPU full mesh (4 links/pair, direct) SW16: 16 NPU star via SW5809s (32 links/NPU, VOQ+crossbar+RR) Side-by-side real-time visualization: bandwidth, per-NPU bars, latency stats (avg/P50/P95/P99/max), latency histograms. Results (3000 cycles, 4Tbps HBM, all-to-all): FM16: 14.3 Tbps BW, avg lat 3.2, P99=5 SW16: 1.8 Tbps BW, avg lat 439, P99=485 (SW16 bottlenecked at crossbar: 1 pkt/output/cycle) Co-authored-by: Cursor <cursoragent@cursor.com>
- BW statistics now show per-NPU and aggregate separately - Added bottleneck explanation in final summary: FM16: 60 direct links per NPU = 6720 Gbps capacity SW16: 1 pkt/output/cycle per NPU = 112 Gbps (1.7% of FM16) Crossbar is the bottleneck, not the NPU→switch links Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly modeled: - 512×512 physical links (112Gbps each) - 4 links bundled per logical port → 128×128 port crossbar - Each port independently arbitrated, serves 4 pkt/cycle - Each NPU uses 8 logical ports (32 links) to the switch - ECMP: round-robin across dest NPU's 8 output ports - VOQ per (input_port, output_port) Results (both HBM-limited at 4Tbps): FM16: 895 Gbps/NPU, avg lat 3.2, 1-hop direct SW16: 895 Gbps/NPU, avg lat 5.0, 2-hop via switch Switch capacity: 57.3 Tbps (53% of FM16 mesh) Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly models: - Each of 128 input ports has its OWN independent RR pointer per dest NPU - When multiple input ports independently pick same egress port → VOQ collision - Compare 'independent' (real HW) vs 'coordinated' (ideal) ECMP modes 3-way comparison: FM16, SW16-independent, SW16-coordinated Under high load (INJECT_BATCH=32): P99: FM16=8, SW16-indep=45, SW16-coord=35 (+29% from collision) Max: FM16=16, SW16-indep=506, SW16-coord=452 Port load imbalance: independent 1.00x (subtle but impactful on tail) Co-authored-by: Cursor <cursoragent@cursor.com>
Each of 128 egress ports independently arbitrates to pick exactly 1 packet per cycle from all input VOQs. Total switch: 128 pkt/cycle. INJECT_BATCH=8 to match switch capacity point. VOQ collision now clearly visible: Independent RR: P99=168, Max=768 Coordinated RR: P99=89, Max=364 Collision adds +89% P99, +111% max latency Port load imbalance: 1.02x (small but tail-impactful) Co-authored-by: Cursor <cursoragent@cursor.com>
Track per-egress-port VOQ depth every cycle (snapshot before schedule). Report avg/peak/max-peak depth alongside cumulative enqueue imbalance. VOQ collision effect now clearly quantified: Independent RR: avg depth 21.8, peak 101 Coordinated RR: avg depth 12.0, peak 60 Independent VOQ is 1.8× deeper on average, 1.7× worse at peak → directly explains the P99 latency gap (168 vs 89 cycles) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Removes SyntaxError from misplaced from __future__ import annotations and drops unused pycircuit import in calculator emulator. Made-with: Cursor
… syntax - Add CycleAwareCircuit/CycleAwareDomain/CycleAwareSignal/StateSignal V5 frontend (v5.py) - Add CycleAwareTb wrapper for testbenches with .next() cycle advancement - Migrate 35 designs to V5: use cas(), mux(), domain.state(), domain.next() - Migrate 32 testbenches to V5: replace at=N with CycleAwareTb.next() - Add V5 programming tutorial and cycle-aware API documentation - Move examples (fm16, fmac, digital_filter, etc.) into designs/examples/ - Add iplib with V5-compatible IP blocks Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces PyCircuit V5, a major upgrade encompassing a new cycle-aware programming model, complete design/testbench migration, simulation performance improvements, and compiler optimizations.
1. PyCircuit V5 Cycle-Aware Grammar
compiler/frontend/pycircuit/v5.py):CycleAwareCircuit,CycleAwareDomain,CycleAwareSignal,StateSignaldomain.next()— advances the logical cycle index, clearly demarcating pipeline stages and sequential logic phasesdomain.state()— declares feedback registers (StateSignal) with.set()for next-cycle values and conditionalwhen=writescas()— wraps rawWireintoCycleAwareSignalat a specific cyclemux()— V5 conditional selection with automatic cycle balancing across branchescompile_cycle_aware()— new compilation entry point supporting both JIT and eager modesCycleAwareTb— V5 testbench wrapper with.next()for cycle advancement (replacesat=Nparameters)docs/PyCircuit V5 Programming Tutorial.md,docs/PyCurcit V5_CYCLE_AWARE_API.md2. All Designs Migrated to V5
designs/fully migrated to V5 cycle-aware syntaxm.out()registers →domain.state(),if Wire else→mux(),@functionhelpers → plain functionsm.new(),m.array(),m.state(spec)) kept in JIT mode withcompile_cycle_aware()compatibility fixesRegisterFile,BypassUnit,IssueQueue,digital_filter,fm16NPU system, etc.3. All Testbenches Migrated to V5
CycleAwareTbwith.next()cycle advancementat=Nparameters fromdrive()/expect()callstb.next()between cycles for clear temporal structure4. Simulation Speedup (docs/simulation.md)
eval()/tick()Wire<N>bit-vector operations use__uint128_tand SIMD intrinsicspyc_change_detect.hpptracks dirty signals to skip unchanged eval pathsllvm-profdatafor hot-path optimization5. Cycle-Balance DFF Optimization (docs/cycle_balance_improvement.md)
(value, clock, reset, delay_depth)delay chains across multiple fanout pathspyc-eliminate-dead-stateremoves unused registers after cycle balancingTest Plan
Made with Cursor