Skip to content

feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization#45

Merged
hengliao1972 merged 27 commits intoLinxISA:mainfrom
hengliao1972:main
Mar 27, 2026
Merged

feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization#45
hengliao1972 merged 27 commits intoLinxISA:mainfrom
hengliao1972:main

Conversation

@hengliao1972
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces PyCircuit V5, a major upgrade encompassing a new cycle-aware programming model, complete design/testbench migration, simulation performance improvements, and compiler optimizations.

1. PyCircuit V5 Cycle-Aware Grammar

  • New V5 frontend (compiler/frontend/pycircuit/v5.py): CycleAwareCircuit, CycleAwareDomain, CycleAwareSignal, StateSignal
  • domain.next() — advances the logical cycle index, clearly demarcating pipeline stages and sequential logic phases
  • domain.state() — declares feedback registers (StateSignal) with .set() for next-cycle values and conditional when= writes
  • cas() — wraps raw Wire into CycleAwareSignal at a specific cycle
  • mux() — V5 conditional selection with automatic cycle balancing across branches
  • compile_cycle_aware() — new compilation entry point supporting both JIT and eager modes
  • CycleAwareTb — V5 testbench wrapper with .next() for cycle advancement (replaces at=N parameters)
  • Documentation: docs/PyCircuit V5 Programming Tutorial.md, docs/PyCurcit V5_CYCLE_AWARE_API.md

2. All Designs Migrated to V5

  • 35 designs under designs/ fully migrated to V5 cycle-aware syntax
  • Replaced m.out() registers → domain.state(), if Wire elsemux(), @function helpers → plain functions
  • JIT-dependent designs (e.g., m.new(), m.array(), m.state(spec)) kept in JIT mode with compile_cycle_aware() compatibility fixes
  • New designs added: RegisterFile, BypassUnit, IssueQueue, digital_filter, fm16 NPU system, etc.

3. All Testbenches Migrated to V5

  • 32 testbenches rewritten using CycleAwareTb with .next() cycle advancement
  • Removed all at=N parameters from drive()/expect() calls
  • Multi-cycle testbenches use explicit tb.next() between cycles for clear temporal structure
  • All 32 TBs pass compilation verification

4. Simulation Speedup (docs/simulation.md)

  • Compiled-code simulation model: RTL compiled to native C++ struct with eval()/tick()
  • SIMD acceleration: Wire<N> bit-vector operations use __uint128_t and SIMD intrinsics
  • Signal change propagation: pyc_change_detect.hpp tracks dirty signals to skip unchanged eval paths
  • PGO (Profile-Guided Optimization): 2-pass build workflow with llvm-profdata for hot-path optimization
  • RegisterFile benchmark: 100K cycles in ~27ms compiled simulation

5. Cycle-Balance DFF Optimization (docs/cycle_balance_improvement.md)

  • Shared delay-chain interning: compiler reuses (value, clock, reset, delay_depth) delay chains across multiple fanout paths
  • Redundant DFF elimination: pyc-eliminate-dead-state removes unused registers after cycle balancing
  • Result: reduced gate count and area by avoiding per-fanout duplicate delay chains

Test Plan

  • All 35 designs compile successfully (MLIR emission verified)
  • All 32 testbenches compile successfully
  • RegisterFile 100K-cycle simulation produces correct results
  • Bypass unit stress test with SVA assertions passes
  • IssueQueue multi-stream enqueue/issue drain test passes

Made with Cursor

Mac and others added 27 commits February 10, 2026 19:14
Direct-form FIR filter: y[n] = c0·x[n] + c1·x[n-1] + c2·x[n-2] + c3·x[n-3]
with 16-bit signed input, 16-bit coefficients, 34-bit accumulator.

- digital_filter.py: pyCircuit RTL (shift register + parallel MAC)
- filter_capi.cpp: C API wrapper for compiled RTL
- emulate_filter.py: terminal UI with delay line, waveform display,
  5 test scenarios (impulse, step, ramp, alternating, large values)
- All tests verified against true RTL simulation via ctypes

Co-authored-by: Cursor <cursoragent@cursor.com>
Sync pyCircuit cycle-aware additions with Janus Core design
Co-authored-by: Cursor <cursoragent@cursor.com>
…spec

Add the Tile Management Unit (TMU) with 8-station bidirectional ring
interconnect, SPB/MGB buffering, configurable 1MB TileReg, and
cycle-accurate C++/SV testbenches. Include architecture spec document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add run/build scripts for C++ and Verilator simulation, RTL generation
script, and trace visualization tools (SVG timeline, ring animation,
VCD-based ring animation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
janus/tmu: add TMU ring interconnect implementation and spec
Add 16x16 systolic array matrix multiplication accelerator (Cube module)
Port Verilog design to pyCircuit (traffic lights + dodgeball game)
BF16 fused multiply-accumulate: acc(FP32) += a(BF16) × b(BF16)
Built from first principles using HA, FA, RCA, CSA, Wallace tree,
barrel shifters, and LZC — all from primitive_standard_cells.py.

4-stage pipeline with critical path analysis:
  Stage 1: Unpack + Exp Add        depth=8
  Stage 2: 8×8 Multiply (Wallace)  depth=46
  Stage 3: Align + Add             depth=21
  Stage 4: Normalize + Pack        depth=31

100/100 test cases pass (true RTL simulation via ctypes).
Max relative error: 5.36e-04 (limited by BF16 7-bit mantissa).

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add carry-select adder to primitive_standard_cells.py: splits N-bit
  addition into parallel halves, depth N+2 instead of 2N
- Fix Wallace tree depth tracking: parallel CSAs share same depth level
- Use carry-select adder for multiplier final addition
- Pipeline now balanced: S1=8, S2=28, S3=21, S4=31 (critical path=31)
- 100/100 tests still pass

Co-authored-by: Cursor <cursoragent@cursor.com>
Move partial product generation + 2 CSA compression rounds into Stage 1
(alongside unpack/exponent). Stage 2 now only completes remaining CSA
rounds + carry-select final addition.

Pipeline depth: S1=13, S2=22, S3=21, S4=31 (was S1=8, S2=28)
Critical path unchanged at 31 (Stage 4), but S1/S2 gap reduced from
20 to 9 for better balance. 100/100 tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
- npu_node.py: simplified NPU pyCircuit RTL (HBM inject + UB ports + FIFO)
- sw5809s.py: simplified SW5809s pyCircuit RTL (VOQ + crossbar + RR)
- fm16_system.py: behavioral system simulator with real-time visualization
  16 NPU full-mesh, all-to-all 512B traffic, BW + latency stats
- Results: 12.8 Tbps aggregate BW, Avg lat=3.2, P95=4, P99=5 cycles

Co-authored-by: Cursor <cursoragent@cursor.com>
Rewrote fm16_system.py to simulate both topologies in parallel:
  FM16: 16 NPU full mesh (4 links/pair, direct)
  SW16: 16 NPU star via SW5809s (32 links/NPU, VOQ+crossbar+RR)

Side-by-side real-time visualization: bandwidth, per-NPU bars,
latency stats (avg/P50/P95/P99/max), latency histograms.

Results (3000 cycles, 4Tbps HBM, all-to-all):
  FM16: 14.3 Tbps BW, avg lat 3.2, P99=5
  SW16: 1.8 Tbps BW, avg lat 439, P99=485
  (SW16 bottlenecked at crossbar: 1 pkt/output/cycle)

Co-authored-by: Cursor <cursoragent@cursor.com>
- BW statistics now show per-NPU and aggregate separately
- Added bottleneck explanation in final summary:
  FM16: 60 direct links per NPU = 6720 Gbps capacity
  SW16: 1 pkt/output/cycle per NPU = 112 Gbps (1.7% of FM16)
  Crossbar is the bottleneck, not the NPU→switch links

Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly modeled:
- 512×512 physical links (112Gbps each)
- 4 links bundled per logical port → 128×128 port crossbar
- Each port independently arbitrated, serves 4 pkt/cycle
- Each NPU uses 8 logical ports (32 links) to the switch
- ECMP: round-robin across dest NPU's 8 output ports
- VOQ per (input_port, output_port)

Results (both HBM-limited at 4Tbps):
  FM16: 895 Gbps/NPU, avg lat 3.2, 1-hop direct
  SW16: 895 Gbps/NPU, avg lat 5.0, 2-hop via switch
  Switch capacity: 57.3 Tbps (53% of FM16 mesh)

Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly models:
- Each of 128 input ports has its OWN independent RR pointer per dest NPU
- When multiple input ports independently pick same egress port → VOQ collision
- Compare 'independent' (real HW) vs 'coordinated' (ideal) ECMP modes

3-way comparison: FM16, SW16-independent, SW16-coordinated
Under high load (INJECT_BATCH=32):
  P99: FM16=8, SW16-indep=45, SW16-coord=35 (+29% from collision)
  Max: FM16=16, SW16-indep=506, SW16-coord=452
Port load imbalance: independent 1.00x (subtle but impactful on tail)

Co-authored-by: Cursor <cursoragent@cursor.com>
Each of 128 egress ports independently arbitrates to pick exactly 1
packet per cycle from all input VOQs. Total switch: 128 pkt/cycle.
INJECT_BATCH=8 to match switch capacity point.

VOQ collision now clearly visible:
  Independent RR: P99=168, Max=768
  Coordinated RR: P99=89,  Max=364
  Collision adds +89% P99, +111% max latency
  Port load imbalance: 1.02x (small but tail-impactful)

Co-authored-by: Cursor <cursoragent@cursor.com>
Track per-egress-port VOQ depth every cycle (snapshot before schedule).
Report avg/peak/max-peak depth alongside cumulative enqueue imbalance.

VOQ collision effect now clearly quantified:
  Independent RR: avg depth 21.8, peak 101
  Coordinated RR: avg depth 12.0, peak 60
  Independent VOQ is 1.8× deeper on average, 1.7× worse at peak
  → directly explains the P99 latency gap (168 vs 89 cycles)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Removes SyntaxError from misplaced from __future__ import annotations
and drops unused pycircuit import in calculator emulator.

Made-with: Cursor
… syntax

- Add CycleAwareCircuit/CycleAwareDomain/CycleAwareSignal/StateSignal V5 frontend (v5.py)
- Add CycleAwareTb wrapper for testbenches with .next() cycle advancement
- Migrate 35 designs to V5: use cas(), mux(), domain.state(), domain.next()
- Migrate 32 testbenches to V5: replace at=N with CycleAwareTb.next()
- Add V5 programming tutorial and cycle-aware API documentation
- Move examples (fm16, fmac, digital_filter, etc.) into designs/examples/
- Add iplib with V5-compatible IP blocks

Made-with: Cursor
@hengliao1972 hengliao1972 merged commit 1173c8c into LinxISA:main Mar 27, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants