feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization by hengliao1972 · Pull Request #45 · LinxISA/pyCircuit

hengliao1972 · 2026-03-27T03:06:05Z

Summary

This PR introduces PyCircuit V5, a major upgrade encompassing a new cycle-aware programming model, complete design/testbench migration, simulation performance improvements, and compiler optimizations.

1. PyCircuit V5 Cycle-Aware Grammar

New V5 frontend (compiler/frontend/pycircuit/v5.py): CycleAwareCircuit, CycleAwareDomain, CycleAwareSignal, StateSignal
domain.next() — advances the logical cycle index, clearly demarcating pipeline stages and sequential logic phases
domain.state() — declares feedback registers (StateSignal) with .set() for next-cycle values and conditional when= writes
cas() — wraps raw Wire into CycleAwareSignal at a specific cycle
mux() — V5 conditional selection with automatic cycle balancing across branches
compile_cycle_aware() — new compilation entry point supporting both JIT and eager modes
CycleAwareTb — V5 testbench wrapper with .next() for cycle advancement (replaces at=N parameters)
Documentation: docs/PyCircuit V5 Programming Tutorial.md, docs/PyCurcit V5_CYCLE_AWARE_API.md

2. All Designs Migrated to V5

35 designs under designs/ fully migrated to V5 cycle-aware syntax
Replaced m.out() registers → domain.state(), if Wire else → mux(), @function helpers → plain functions
JIT-dependent designs (e.g., m.new(), m.array(), m.state(spec)) kept in JIT mode with compile_cycle_aware() compatibility fixes
New designs added: RegisterFile, BypassUnit, IssueQueue, digital_filter, fm16 NPU system, etc.

3. All Testbenches Migrated to V5

32 testbenches rewritten using CycleAwareTb with .next() cycle advancement
Removed all at=N parameters from drive()/expect() calls
Multi-cycle testbenches use explicit tb.next() between cycles for clear temporal structure
All 32 TBs pass compilation verification

4. Simulation Speedup (docs/simulation.md)

Compiled-code simulation model: RTL compiled to native C++ struct with eval()/tick()
SIMD acceleration: Wire<N> bit-vector operations use __uint128_t and SIMD intrinsics
Signal change propagation: pyc_change_detect.hpp tracks dirty signals to skip unchanged eval paths
PGO (Profile-Guided Optimization): 2-pass build workflow with llvm-profdata for hot-path optimization
RegisterFile benchmark: 100K cycles in ~27ms compiled simulation

5. Cycle-Balance DFF Optimization (docs/cycle_balance_improvement.md)

Shared delay-chain interning: compiler reuses (value, clock, reset, delay_depth) delay chains across multiple fanout paths
Redundant DFF elimination: pyc-eliminate-dead-state removes unused registers after cycle balancing
Result: reduced gate count and area by avoiding per-fanout duplicate delay chains

Test Plan

All 35 designs compile successfully (MLIR emission verified)
All 32 testbenches compile successfully
RegisterFile 100K-cycle simulation produces correct results
Bypass unit stress test with SVA assertions passes
IssueQueue multi-stream enqueue/issue drain test passes

Made with Cursor

Direct-form FIR filter: y[n] = c0·x[n] + c1·x[n-1] + c2·x[n-2] + c3·x[n-3] with 16-bit signed input, 16-bit coefficients, 34-bit accumulator. - digital_filter.py: pyCircuit RTL (shift register + parallel MAC) - filter_capi.cpp: C API wrapper for compiled RTL - emulate_filter.py: terminal UI with delay line, waveform display, 5 test scenarios (impulse, step, ramp, alternating, large values) - All tests verified against true RTL simulation via ctypes Co-authored-by: Cursor <cursoragent@cursor.com>

Sync pyCircuit cycle-aware additions with Janus Core design

Co-authored-by: Cursor <cursoragent@cursor.com>

…spec Add the Tile Management Unit (TMU) with 8-station bidirectional ring interconnect, SPB/MGB buffering, configurable 1MB TileReg, and cycle-accurate C++/SV testbenches. Include architecture spec document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add run/build scripts for C++ and Verilator simulation, RTL generation script, and trace visualization tools (SVG timeline, ring animation, VCD-based ring animation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

janus/tmu: add TMU ring interconnect implementation and spec

Add 16x16 systolic array matrix multiplication accelerator (Cube module)

Port Verilog design to pyCircuit (traffic lights + dodgeball game)

BF16 fused multiply-accumulate: acc(FP32) += a(BF16) × b(BF16) Built from first principles using HA, FA, RCA, CSA, Wallace tree, barrel shifters, and LZC — all from primitive_standard_cells.py. 4-stage pipeline with critical path analysis: Stage 1: Unpack + Exp Add depth=8 Stage 2: 8×8 Multiply (Wallace) depth=46 Stage 3: Align + Add depth=21 Stage 4: Normalize + Pack depth=31 100/100 test cases pass (true RTL simulation via ctypes). Max relative error: 5.36e-04 (limited by BF16 7-bit mantissa). Co-authored-by: Cursor <cursoragent@cursor.com>

- Add carry-select adder to primitive_standard_cells.py: splits N-bit addition into parallel halves, depth N+2 instead of 2N - Fix Wallace tree depth tracking: parallel CSAs share same depth level - Use carry-select adder for multiplier final addition - Pipeline now balanced: S1=8, S2=28, S3=21, S4=31 (critical path=31) - 100/100 tests still pass Co-authored-by: Cursor <cursoragent@cursor.com>

Move partial product generation + 2 CSA compression rounds into Stage 1 (alongside unpack/exponent). Stage 2 now only completes remaining CSA rounds + carry-select final addition. Pipeline depth: S1=13, S2=22, S3=21, S4=31 (was S1=8, S2=28) Critical path unchanged at 31 (Stage 4), but S1/S2 gap reduced from 20 to 9 for better balance. 100/100 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>

- npu_node.py: simplified NPU pyCircuit RTL (HBM inject + UB ports + FIFO) - sw5809s.py: simplified SW5809s pyCircuit RTL (VOQ + crossbar + RR) - fm16_system.py: behavioral system simulator with real-time visualization 16 NPU full-mesh, all-to-all 512B traffic, BW + latency stats - Results: 12.8 Tbps aggregate BW, Avg lat=3.2, P95=4, P99=5 cycles Co-authored-by: Cursor <cursoragent@cursor.com>

Rewrote fm16_system.py to simulate both topologies in parallel: FM16: 16 NPU full mesh (4 links/pair, direct) SW16: 16 NPU star via SW5809s (32 links/NPU, VOQ+crossbar+RR) Side-by-side real-time visualization: bandwidth, per-NPU bars, latency stats (avg/P50/P95/P99/max), latency histograms. Results (3000 cycles, 4Tbps HBM, all-to-all): FM16: 14.3 Tbps BW, avg lat 3.2, P99=5 SW16: 1.8 Tbps BW, avg lat 439, P99=485 (SW16 bottlenecked at crossbar: 1 pkt/output/cycle) Co-authored-by: Cursor <cursoragent@cursor.com>

- BW statistics now show per-NPU and aggregate separately - Added bottleneck explanation in final summary: FM16: 60 direct links per NPU = 6720 Gbps capacity SW16: 1 pkt/output/cycle per NPU = 112 Gbps (1.7% of FM16) Crossbar is the bottleneck, not the NPU→switch links Co-authored-by: Cursor <cursoragent@cursor.com>

SW5809s now correctly modeled: - 512×512 physical links (112Gbps each) - 4 links bundled per logical port → 128×128 port crossbar - Each port independently arbitrated, serves 4 pkt/cycle - Each NPU uses 8 logical ports (32 links) to the switch - ECMP: round-robin across dest NPU's 8 output ports - VOQ per (input_port, output_port) Results (both HBM-limited at 4Tbps): FM16: 895 Gbps/NPU, avg lat 3.2, 1-hop direct SW16: 895 Gbps/NPU, avg lat 5.0, 2-hop via switch Switch capacity: 57.3 Tbps (53% of FM16 mesh) Co-authored-by: Cursor <cursoragent@cursor.com>

SW5809s now correctly models: - Each of 128 input ports has its OWN independent RR pointer per dest NPU - When multiple input ports independently pick same egress port → VOQ collision - Compare 'independent' (real HW) vs 'coordinated' (ideal) ECMP modes 3-way comparison: FM16, SW16-independent, SW16-coordinated Under high load (INJECT_BATCH=32): P99: FM16=8, SW16-indep=45, SW16-coord=35 (+29% from collision) Max: FM16=16, SW16-indep=506, SW16-coord=452 Port load imbalance: independent 1.00x (subtle but impactful on tail) Co-authored-by: Cursor <cursoragent@cursor.com>

Each of 128 egress ports independently arbitrates to pick exactly 1 packet per cycle from all input VOQs. Total switch: 128 pkt/cycle. INJECT_BATCH=8 to match switch capacity point. VOQ collision now clearly visible: Independent RR: P99=168, Max=768 Coordinated RR: P99=89, Max=364 Collision adds +89% P99, +111% max latency Port load imbalance: 1.02x (small but tail-impactful) Co-authored-by: Cursor <cursoragent@cursor.com>

Track per-egress-port VOQ depth every cycle (snapshot before schedule). Report avg/peak/max-peak depth alongside cumulative enqueue imbalance. VOQ collision effect now clearly quantified: Independent RR: avg depth 21.8, peak 101 Coordinated RR: avg depth 12.0, peak 60 Independent VOQ is 1.8× deeper on average, 1.7× worse at peak → directly explains the P99 latency gap (168 vs 89 cycles) Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Removes SyntaxError from misplaced from __future__ import annotations and drops unused pycircuit import in calculator emulator. Made-with: Cursor

… syntax - Add CycleAwareCircuit/CycleAwareDomain/CycleAwareSignal/StateSignal V5 frontend (v5.py) - Add CycleAwareTb wrapper for testbenches with .next() cycle advancement - Migrate 35 designs to V5: use cas(), mux(), domain.state(), domain.next() - Migrate 32 testbenches to V5: replace at=N with CycleAwareTb.next() - Add V5 programming tutorial and cycle-aware API documentation - Move examples (fm16, fmac, digital_filter, etc.) into designs/examples/ - Add iplib with V5-compatible IP blocks Made-with: Cursor

Mac and others added 27 commits February 10, 2026 19:14

Merge pull request #1 from zhoubot/codex/hengliao-sync

2171f78

Sync pyCircuit cycle-aware additions with Janus Core design

chore: add .DS_Store, .pdf, .dSYM to .gitignore

31b8fd5

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #4 from sheyuheng/tmu-impl-and-spec-hengliao

b03ff91

janus/tmu: add TMU ring interconnect implementation and spec

Merge pull request #2 from fengzhazha/cube-accelerator

368b8f5

Add 16x16 systolic array matrix multiplication accelerator (Cube module)

Add traffic lights pyCircuit example

ea79aa1

Fix traffic lights countdown and add debug

b5fc5da

Improve traffic lights visualization

d129cad

Add dodgeball game pycircuit demo

db8d434

Merge pull request #5 from Auyuir/traffic-lights-ce-pyc

deeb190

Port Verilog design to pyCircuit (traffic lights + dodgeball game)

Merge PR #6: Enhanced pyCircuit simulation/verification capability

636115c

examples/fm16: sync fm16 updates (sw5809s.py)

83f0cdf

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'LinxISA:main' into main

bd7098d

fix(examples): put __future__ imports first in emulate scripts

b07f034

Removes SyntaxError from misplaced from __future__ import annotations and drops unused pycircuit import in calculator emulator. Made-with: Cursor

hengliao1972 merged commit 1173c8c into LinxISA:main Mar 27, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization#45

feat: PyCircuit V5 — cycle-aware grammar, design migration, simulation speedup & cycle-balance optimization#45
hengliao1972 merged 27 commits intoLinxISA:mainfrom
hengliao1972:main

hengliao1972 commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hengliao1972 commented Mar 27, 2026

Summary

1. PyCircuit V5 Cycle-Aware Grammar

2. All Designs Migrated to V5

3. All Testbenches Migrated to V5

4. Simulation Speedup (docs/simulation.md)

5. Cycle-Balance DFF Optimization (docs/cycle_balance_improvement.md)

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants