Skip to content

HengLiao updates to PyCircuit language#37

Closed
hengliao1972 wants to merge 26 commits intoLinxISA:mainfrom
hengliao1972:main
Closed

HengLiao updates to PyCircuit language#37
hengliao1972 wants to merge 26 commits intoLinxISA:mainfrom
hengliao1972:main

Conversation

@hengliao1972
Copy link
Copy Markdown
Collaborator

No description provided.

Mac and others added 26 commits February 10, 2026 19:14
Direct-form FIR filter: y[n] = c0·x[n] + c1·x[n-1] + c2·x[n-2] + c3·x[n-3]
with 16-bit signed input, 16-bit coefficients, 34-bit accumulator.

- digital_filter.py: pyCircuit RTL (shift register + parallel MAC)
- filter_capi.cpp: C API wrapper for compiled RTL
- emulate_filter.py: terminal UI with delay line, waveform display,
  5 test scenarios (impulse, step, ramp, alternating, large values)
- All tests verified against true RTL simulation via ctypes

Co-authored-by: Cursor <cursoragent@cursor.com>
Sync pyCircuit cycle-aware additions with Janus Core design
Co-authored-by: Cursor <cursoragent@cursor.com>
…spec

Add the Tile Management Unit (TMU) with 8-station bidirectional ring
interconnect, SPB/MGB buffering, configurable 1MB TileReg, and
cycle-accurate C++/SV testbenches. Include architecture spec document.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add run/build scripts for C++ and Verilator simulation, RTL generation
script, and trace visualization tools (SVG timeline, ring animation,
VCD-based ring animation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
janus/tmu: add TMU ring interconnect implementation and spec
Add 16x16 systolic array matrix multiplication accelerator (Cube module)
Port Verilog design to pyCircuit (traffic lights + dodgeball game)
BF16 fused multiply-accumulate: acc(FP32) += a(BF16) × b(BF16)
Built from first principles using HA, FA, RCA, CSA, Wallace tree,
barrel shifters, and LZC — all from primitive_standard_cells.py.

4-stage pipeline with critical path analysis:
  Stage 1: Unpack + Exp Add        depth=8
  Stage 2: 8×8 Multiply (Wallace)  depth=46
  Stage 3: Align + Add             depth=21
  Stage 4: Normalize + Pack        depth=31

100/100 test cases pass (true RTL simulation via ctypes).
Max relative error: 5.36e-04 (limited by BF16 7-bit mantissa).

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add carry-select adder to primitive_standard_cells.py: splits N-bit
  addition into parallel halves, depth N+2 instead of 2N
- Fix Wallace tree depth tracking: parallel CSAs share same depth level
- Use carry-select adder for multiplier final addition
- Pipeline now balanced: S1=8, S2=28, S3=21, S4=31 (critical path=31)
- 100/100 tests still pass

Co-authored-by: Cursor <cursoragent@cursor.com>
Move partial product generation + 2 CSA compression rounds into Stage 1
(alongside unpack/exponent). Stage 2 now only completes remaining CSA
rounds + carry-select final addition.

Pipeline depth: S1=13, S2=22, S3=21, S4=31 (was S1=8, S2=28)
Critical path unchanged at 31 (Stage 4), but S1/S2 gap reduced from
20 to 9 for better balance. 100/100 tests pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
- npu_node.py: simplified NPU pyCircuit RTL (HBM inject + UB ports + FIFO)
- sw5809s.py: simplified SW5809s pyCircuit RTL (VOQ + crossbar + RR)
- fm16_system.py: behavioral system simulator with real-time visualization
  16 NPU full-mesh, all-to-all 512B traffic, BW + latency stats
- Results: 12.8 Tbps aggregate BW, Avg lat=3.2, P95=4, P99=5 cycles

Co-authored-by: Cursor <cursoragent@cursor.com>
Rewrote fm16_system.py to simulate both topologies in parallel:
  FM16: 16 NPU full mesh (4 links/pair, direct)
  SW16: 16 NPU star via SW5809s (32 links/NPU, VOQ+crossbar+RR)

Side-by-side real-time visualization: bandwidth, per-NPU bars,
latency stats (avg/P50/P95/P99/max), latency histograms.

Results (3000 cycles, 4Tbps HBM, all-to-all):
  FM16: 14.3 Tbps BW, avg lat 3.2, P99=5
  SW16: 1.8 Tbps BW, avg lat 439, P99=485
  (SW16 bottlenecked at crossbar: 1 pkt/output/cycle)

Co-authored-by: Cursor <cursoragent@cursor.com>
- BW statistics now show per-NPU and aggregate separately
- Added bottleneck explanation in final summary:
  FM16: 60 direct links per NPU = 6720 Gbps capacity
  SW16: 1 pkt/output/cycle per NPU = 112 Gbps (1.7% of FM16)
  Crossbar is the bottleneck, not the NPU→switch links

Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly modeled:
- 512×512 physical links (112Gbps each)
- 4 links bundled per logical port → 128×128 port crossbar
- Each port independently arbitrated, serves 4 pkt/cycle
- Each NPU uses 8 logical ports (32 links) to the switch
- ECMP: round-robin across dest NPU's 8 output ports
- VOQ per (input_port, output_port)

Results (both HBM-limited at 4Tbps):
  FM16: 895 Gbps/NPU, avg lat 3.2, 1-hop direct
  SW16: 895 Gbps/NPU, avg lat 5.0, 2-hop via switch
  Switch capacity: 57.3 Tbps (53% of FM16 mesh)

Co-authored-by: Cursor <cursoragent@cursor.com>
SW5809s now correctly models:
- Each of 128 input ports has its OWN independent RR pointer per dest NPU
- When multiple input ports independently pick same egress port → VOQ collision
- Compare 'independent' (real HW) vs 'coordinated' (ideal) ECMP modes

3-way comparison: FM16, SW16-independent, SW16-coordinated
Under high load (INJECT_BATCH=32):
  P99: FM16=8, SW16-indep=45, SW16-coord=35 (+29% from collision)
  Max: FM16=16, SW16-indep=506, SW16-coord=452
Port load imbalance: independent 1.00x (subtle but impactful on tail)

Co-authored-by: Cursor <cursoragent@cursor.com>
Each of 128 egress ports independently arbitrates to pick exactly 1
packet per cycle from all input VOQs. Total switch: 128 pkt/cycle.
INJECT_BATCH=8 to match switch capacity point.

VOQ collision now clearly visible:
  Independent RR: P99=168, Max=768
  Coordinated RR: P99=89,  Max=364
  Collision adds +89% P99, +111% max latency
  Port load imbalance: 1.02x (small but tail-impactful)

Co-authored-by: Cursor <cursoragent@cursor.com>
Track per-egress-port VOQ depth every cycle (snapshot before schedule).
Report avg/peak/max-peak depth alongside cumulative enqueue imbalance.

VOQ collision effect now clearly quantified:
  Independent RR: avg depth 21.8, peak 101
  Coordinated RR: avg depth 12.0, peak 60
  Independent VOQ is 1.8× deeper on average, 1.7× worse at peak
  → directly explains the P99 latency gap (168 vs 89 cycles)

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Removes SyntaxError from misplaced from __future__ import annotations
and drops unused pycircuit import in calculator emulator.

Made-with: Cursor
@hengliao1972
Copy link
Copy Markdown
Collaborator Author

Obsoleted PR. abandon this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants