-
Notifications
You must be signed in to change notification settings - Fork 18
Performance Benchmark: Pure C++ Cycle-Accurate TLM against pyCircuit Simulator #44
Description
Description:
While validating the 64-node NoC layout using pyCircuit simulation (tb_perf_9round.py), we developed and validated a cycle-accurate pure C++ Transaction-Level Model (TLM) to measure maximum simulation throughput for behaviorally equivalent network hardware.
This issue tracks the cycle-accurate equivalence and proposes tracking C-Models for exploring Routing & Topology limits where standard Object-Oriented simulation is bottlenecked by object lookup trees.
Methodology & Golden Trace Verification:
- A dump of the 147,456 Multi/Broadcast random injections from the 9-round testbench was generated to freeze the randomness to a constant sequence (
golden_trace.txt). - A 350-line monolithic C++ TLM was implemented natively, honoring identical FIFO depth, Round-Robin logic, Routing computation (unicast, multicast, broadcast splits), and Node logic (replication registers).
Performance Baseline:
Tested against pyCircuit Version: 67774d4ffa57a0dec21676d3de146df2385981c2
- Native PyCircuit (C++ Wrapper): 6.72 s (~24k print ticks before hard stop)
bash scripts/build_tb.sh designs/contest_module/tb_perf_9round.py --run-cpp- Verilator (Flattened RTL bits): 0.81 s (25k cycles hard stop)
bash scripts/build_tb.sh designs/contest_module/tb_perf_9round.py --run-verilator- Pure Behavioral C++ TLM: 0.068 s (9704 true finish cycles!)
TLM completes 286,441 true local pops natively in <70ms.
# 1. Trace Dump
PYTHONPATH=../pyCircuit/compiler/frontend python3 designs/contest_module/dump_trace.py
# 2. Build
g++ -O3 -std=c++11 tests/fair_benchmark/cpp_tlm_noc.cpp -o tests/fair_benchmark/cpp_tlm_noc
# 3. Simulate
./tests/fair_benchmark/cpp_tlm_nocThe stark contrast clearly showcases the benefits of a native architecture-focused TLM layer (avoiding explicit wire state replication). The 24,000 "cycle" pycircuit duration is an artifact of the timeout rather than pure network limits, which truly drain around physical tick ~9700!
Can we introduce a dedicated models/cmodel interface layer bridging PyCircuit stimulus with external lightweight TLMs for extreme algorithmic exploration?