HengLiao updates to PyCircuit language by hengliao1972 · Pull Request #37 · LinxISA/pyCircuit

hengliao1972 · 2026-03-17T07:54:51Z

No description provided.

Direct-form FIR filter: y[n] = c0·x[n] + c1·x[n-1] + c2·x[n-2] + c3·x[n-3] with 16-bit signed input, 16-bit coefficients, 34-bit accumulator. - digital_filter.py: pyCircuit RTL (shift register + parallel MAC) - filter_capi.cpp: C API wrapper for compiled RTL - emulate_filter.py: terminal UI with delay line, waveform display, 5 test scenarios (impulse, step, ramp, alternating, large values) - All tests verified against true RTL simulation via ctypes Co-authored-by: Cursor <cursoragent@cursor.com>

Sync pyCircuit cycle-aware additions with Janus Core design

Co-authored-by: Cursor <cursoragent@cursor.com>

…spec Add the Tile Management Unit (TMU) with 8-station bidirectional ring interconnect, SPB/MGB buffering, configurable 1MB TileReg, and cycle-accurate C++/SV testbenches. Include architecture spec document. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add run/build scripts for C++ and Verilator simulation, RTL generation script, and trace visualization tools (SVG timeline, ring animation, VCD-based ring animation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

janus/tmu: add TMU ring interconnect implementation and spec

Add 16x16 systolic array matrix multiplication accelerator (Cube module)

Port Verilog design to pyCircuit (traffic lights + dodgeball game)

BF16 fused multiply-accumulate: acc(FP32) += a(BF16) × b(BF16) Built from first principles using HA, FA, RCA, CSA, Wallace tree, barrel shifters, and LZC — all from primitive_standard_cells.py. 4-stage pipeline with critical path analysis: Stage 1: Unpack + Exp Add depth=8 Stage 2: 8×8 Multiply (Wallace) depth=46 Stage 3: Align + Add depth=21 Stage 4: Normalize + Pack depth=31 100/100 test cases pass (true RTL simulation via ctypes). Max relative error: 5.36e-04 (limited by BF16 7-bit mantissa). Co-authored-by: Cursor <cursoragent@cursor.com>

- Add carry-select adder to primitive_standard_cells.py: splits N-bit addition into parallel halves, depth N+2 instead of 2N - Fix Wallace tree depth tracking: parallel CSAs share same depth level - Use carry-select adder for multiplier final addition - Pipeline now balanced: S1=8, S2=28, S3=21, S4=31 (critical path=31) - 100/100 tests still pass Co-authored-by: Cursor <cursoragent@cursor.com>

Move partial product generation + 2 CSA compression rounds into Stage 1 (alongside unpack/exponent). Stage 2 now only completes remaining CSA rounds + carry-select final addition. Pipeline depth: S1=13, S2=22, S3=21, S4=31 (was S1=8, S2=28) Critical path unchanged at 31 (Stage 4), but S1/S2 gap reduced from 20 to 9 for better balance. 100/100 tests pass. Co-authored-by: Cursor <cursoragent@cursor.com>

- npu_node.py: simplified NPU pyCircuit RTL (HBM inject + UB ports + FIFO) - sw5809s.py: simplified SW5809s pyCircuit RTL (VOQ + crossbar + RR) - fm16_system.py: behavioral system simulator with real-time visualization 16 NPU full-mesh, all-to-all 512B traffic, BW + latency stats - Results: 12.8 Tbps aggregate BW, Avg lat=3.2, P95=4, P99=5 cycles Co-authored-by: Cursor <cursoragent@cursor.com>

Rewrote fm16_system.py to simulate both topologies in parallel: FM16: 16 NPU full mesh (4 links/pair, direct) SW16: 16 NPU star via SW5809s (32 links/NPU, VOQ+crossbar+RR) Side-by-side real-time visualization: bandwidth, per-NPU bars, latency stats (avg/P50/P95/P99/max), latency histograms. Results (3000 cycles, 4Tbps HBM, all-to-all): FM16: 14.3 Tbps BW, avg lat 3.2, P99=5 SW16: 1.8 Tbps BW, avg lat 439, P99=485 (SW16 bottlenecked at crossbar: 1 pkt/output/cycle) Co-authored-by: Cursor <cursoragent@cursor.com>

- BW statistics now show per-NPU and aggregate separately - Added bottleneck explanation in final summary: FM16: 60 direct links per NPU = 6720 Gbps capacity SW16: 1 pkt/output/cycle per NPU = 112 Gbps (1.7% of FM16) Crossbar is the bottleneck, not the NPU→switch links Co-authored-by: Cursor <cursoragent@cursor.com>

SW5809s now correctly modeled: - 512×512 physical links (112Gbps each) - 4 links bundled per logical port → 128×128 port crossbar - Each port independently arbitrated, serves 4 pkt/cycle - Each NPU uses 8 logical ports (32 links) to the switch - ECMP: round-robin across dest NPU's 8 output ports - VOQ per (input_port, output_port) Results (both HBM-limited at 4Tbps): FM16: 895 Gbps/NPU, avg lat 3.2, 1-hop direct SW16: 895 Gbps/NPU, avg lat 5.0, 2-hop via switch Switch capacity: 57.3 Tbps (53% of FM16 mesh) Co-authored-by: Cursor <cursoragent@cursor.com>

SW5809s now correctly models: - Each of 128 input ports has its OWN independent RR pointer per dest NPU - When multiple input ports independently pick same egress port → VOQ collision - Compare 'independent' (real HW) vs 'coordinated' (ideal) ECMP modes 3-way comparison: FM16, SW16-independent, SW16-coordinated Under high load (INJECT_BATCH=32): P99: FM16=8, SW16-indep=45, SW16-coord=35 (+29% from collision) Max: FM16=16, SW16-indep=506, SW16-coord=452 Port load imbalance: independent 1.00x (subtle but impactful on tail) Co-authored-by: Cursor <cursoragent@cursor.com>

Each of 128 egress ports independently arbitrates to pick exactly 1 packet per cycle from all input VOQs. Total switch: 128 pkt/cycle. INJECT_BATCH=8 to match switch capacity point. VOQ collision now clearly visible: Independent RR: P99=168, Max=768 Coordinated RR: P99=89, Max=364 Collision adds +89% P99, +111% max latency Port load imbalance: 1.02x (small but tail-impactful) Co-authored-by: Cursor <cursoragent@cursor.com>

Track per-egress-port VOQ depth every cycle (snapshot before schedule). Report avg/peak/max-peak depth alongside cumulative enqueue imbalance. VOQ collision effect now clearly quantified: Independent RR: avg depth 21.8, peak 101 Coordinated RR: avg depth 12.0, peak 60 Independent VOQ is 1.8× deeper on average, 1.7× worse at peak → directly explains the P99 latency gap (168 vs 89 cycles) Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Removes SyntaxError from misplaced from __future__ import annotations and drops unused pycircuit import in calculator emulator. Made-with: Cursor

hengliao1972 · 2026-03-27T02:47:07Z

Obsoleted PR. abandon this PR.

Mac and others added 26 commits February 10, 2026 19:14

Merge pull request #1 from zhoubot/codex/hengliao-sync

2171f78

Sync pyCircuit cycle-aware additions with Janus Core design

chore: add .DS_Store, .pdf, .dSYM to .gitignore

31b8fd5

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #4 from sheyuheng/tmu-impl-and-spec-hengliao

b03ff91

janus/tmu: add TMU ring interconnect implementation and spec

Merge pull request #2 from fengzhazha/cube-accelerator

368b8f5

Add 16x16 systolic array matrix multiplication accelerator (Cube module)

Add traffic lights pyCircuit example

ea79aa1

Fix traffic lights countdown and add debug

b5fc5da

Improve traffic lights visualization

d129cad

Add dodgeball game pycircuit demo

db8d434

Merge pull request #5 from Auyuir/traffic-lights-ce-pyc

deeb190

Port Verilog design to pyCircuit (traffic lights + dodgeball game)

Merge PR #6: Enhanced pyCircuit simulation/verification capability

636115c

examples/fm16: sync fm16 updates (sw5809s.py)

83f0cdf

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'LinxISA:main' into main

bd7098d

fix(examples): put __future__ imports first in emulate scripts

b07f034

Removes SyntaxError from misplaced from __future__ import annotations and drops unused pycircuit import in calculator emulator. Made-with: Cursor

hengliao1972 closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HengLiao updates to PyCircuit language#37

HengLiao updates to PyCircuit language#37
hengliao1972 wants to merge 26 commits intoLinxISA:mainfrom
hengliao1972:main

hengliao1972 commented Mar 17, 2026

Uh oh!

hengliao1972 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hengliao1972 commented Mar 17, 2026

Uh oh!

hengliao1972 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants