Skip to content

Performance Benchmark: Pure C++ Cycle-Accurate TLM against pyCircuit Simulator #44

@winfredsu

Description

@winfredsu

Description:
While validating the 64-node NoC layout using pyCircuit simulation (tb_perf_9round.py), we developed and validated a cycle-accurate pure C++ Transaction-Level Model (TLM) to measure maximum simulation throughput for behaviorally equivalent network hardware.

This issue tracks the cycle-accurate equivalence and proposes tracking C-Models for exploring Routing & Topology limits where standard Object-Oriented simulation is bottlenecked by object lookup trees.

Methodology & Golden Trace Verification:

  1. A dump of the 147,456 Multi/Broadcast random injections from the 9-round testbench was generated to freeze the randomness to a constant sequence (golden_trace.txt).
  2. A 350-line monolithic C++ TLM was implemented natively, honoring identical FIFO depth, Round-Robin logic, Routing computation (unicast, multicast, broadcast splits), and Node logic (replication registers).

Performance Baseline:
Tested against pyCircuit Version: 67774d4ffa57a0dec21676d3de146df2385981c2

  1. Native PyCircuit (C++ Wrapper): 6.72 s (~24k print ticks before hard stop)
bash scripts/build_tb.sh designs/contest_module/tb_perf_9round.py --run-cpp
  1. Verilator (Flattened RTL bits): 0.81 s (25k cycles hard stop)
bash scripts/build_tb.sh designs/contest_module/tb_perf_9round.py --run-verilator
  1. Pure Behavioral C++ TLM: 0.068 s (9704 true finish cycles!)
    TLM completes 286,441 true local pops natively in <70ms.
# 1. Trace Dump
PYTHONPATH=../pyCircuit/compiler/frontend python3 designs/contest_module/dump_trace.py
# 2. Build
g++ -O3 -std=c++11 tests/fair_benchmark/cpp_tlm_noc.cpp -o tests/fair_benchmark/cpp_tlm_noc
# 3. Simulate
./tests/fair_benchmark/cpp_tlm_noc

The stark contrast clearly showcases the benefits of a native architecture-focused TLM layer (avoiding explicit wire state replication). The 24,000 "cycle" pycircuit duration is an artifact of the timeout rather than pure network limits, which truly drain around physical tick ~9700!

Can we introduce a dedicated models/cmodel interface layer bridging PyCircuit stimulus with external lightweight TLMs for extreme algorithmic exploration?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions