This repository demonstrates a complete end-to-end flow from training a Convolutional Neural Network (CNN) in PyTorch to validating its exact cycle-accurate execution on a simulated hardware accelerator designed in Amaranth HDL.
This project bridges the gap between Data Science / AI and Digital ASIC/FPGA Design, proving that software mathematical models can be perfectly translated into integer-based hardware logic.
- Hardware-Aware ML Architecture: A custom lightweight CNN designed in PyTorch (
src.ml.model.SimpleCNN) optimized for hardware deployment. - Automated Training Pipeline: Full training loop on CIFAR-10 with validation and model checkpointing (
src.ml.train). - Custom Hardware MAC Unit: A cycle-accurate Multiply-Accumulate (MAC) unit described in Python using Amaranth HDL (
src.hardware.units.mac), capable of exporting to Verilog. - Exact Co-Simulation: A robust testbench (
src.hardware.testbenches.tb_mac) that extracts weights and image patches, quantizes them toint8, simulates the hardware clock cycle by cycle, and mathematically asserts that the Amaranth hardware output perfectly matches the PyTorch software prediction. - Modern Python Stack: Managed entirely with
uvfor blazing-fast dependency resolution andPydanticfor strict configuration validation.
- Quantization: Bridging the gap between PyTorch's
float32and the Hardware'sint8data bus. - Synchronous Logic: The Amaranth MAC unit uses a clock domain (
sync) to accurately model pipeline latencies.
.
├── pyproject.toml # Project configuration and dependencies (uv/poetry/pip)
├── requirements.txt # pip-compatible dependency export
├── environment.yml # conda environment definition
├── initializer.sh # All-in-one pipeline runner (uv/poetry/conda/pip)
├── cleaner.sh # Removes all runtime-generated files and directories
├── data_purge.py # Releases GPU VRAM and system RAM held by PyTorch
├── mac.v # Generated Verilog RTL (Amaranth export)
├── mac_simulation.vcd # Hardware waveform dump (GTKWave)
└── src/
├── config.py # Global configuration (Pydantic models)
├── ml/ # Machine Learning Domain (PyTorch)
│ ├── dataset.py # Multi-dataset factory (CIFAR-10, Tiny-ImageNet, ImageNet)
│ ├── model.py # CNN Architecture & Hardware Target extraction
│ └── train.py # Training loop and evaluation
└── hardware/ # Microelectronics Domain (Amaranth)
├── units/
│ └── mac.py # MAC unit design (Generates Verilog)
├── bus/ # Reserved for AXI4/AXI-Stream wrappers
└── testbenches/
└── tb_mac.py # Co-simulation and VCD waveform generation
This project requires Python ≥ 3.13, as i use a GTX 1070 max-q (latest cuda not supported), however, the project is fully compatible with Python 3.14. Ensure you have at least one of the following package managers installed:
| Tool | Install |
|---|---|
uv (recommended) |
astral.sh/uv |
poetry |
python-poetry.org |
conda |
docs.conda.io |
pip |
Bundled with Python ≥ 3.13 |
Manual command substitution table — replace
uv run pythonin all steps below according to your manager:
Manager Replace uv run pythonwithuvuv run pythonpoetrypoetry run pythoncondaconda run -n edgevision pythonpipsource .venv/bin/activate && python
The fastest way to run the full pipeline is via the provided shell script, which auto-detects your package manager and handles environment checks, dataset download, training, RTL generation, and co-simulation in one command:
chmod +x initializer.sh cleaner.sh
./initializer.shTo skip training and reuse an existing checkpoint:
./initializer.sh --skip-trainTo force a specific package manager:
./initializer.sh --manager pip
./initializer.sh --manager conda
./initializer.sh --manager poetryThe script will guide you interactively if a checkpoint already exists (retrain or keep it).
Manual steps are documented below for users who prefer to run each stage individually or integrate them into their own workflow.
Clone the repository:
git clone git@github.com:Engeryu/EdgeVision-HW-Bridge.git
cd EdgeVision-HW-BridgeThen install dependencies with your package manager:
# uv (recommended)
uv sync
# poetry
poetry install
# conda
conda env create -f environment.yml
conda activate edgevision
# pip
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtuv only — to use Python 3.14 instead of the default 3.13:
echo "3.14" > .python-version uv sync
The active dataset is configured via cfg.ml.dataset in src/config.py. Three datasets are supported:
| Dataset | Classes | Resolution | Download |
|---|---|---|---|
cifar10 |
10 | 32×32 | Automatic |
tiny-imagenet |
200 | 64×64 | Automatic (~236 MB) |
imagenet |
1000 | 224×224 | Manual (see below) |
For dataset download only,
uv run pythonis used directly as there is no dedicated CLI entry point. e.g - To trigger the download manually:
uv run python -c "from src.ml.dataset import get_dataloaders; get_dataloaders()"Replace
uv run pythonwith your manager's equivalent (see substitution table above).
(Note: Running the training or testbench scripts will also trigger the download automatically if the dataset is missing.)
ImageNet cannot be downloaded automatically. Register at image-net.org and place the dataset at
./data/imagenet/train/and./data/imagenet/val/.
To train the CNN on CIFAR-10 and generate the model weights inside the ./checkpoints directory:
uv run edgevision-trainReplace
uv run pythonwith your manager's equivalent (see substitution table above).
You will get an output of the training iteration with metrics, until seeing:
Checkpoint saved: checkpoints/<dataset>.pth (epoch=N, acc=XX.XX%)
The model is ready to be transferred to Hardware!
Best Test Acc: XX.XX%
The active model is selected automatically based on the configured dataset:
| Dataset | Model | Notes |
|---|---|---|
cifar10 |
SimpleCNN |
Hardware-bridge compatible |
tiny-imagenet |
ResNet-18 |
Trained from scratch |
imagenet |
ResNet-50 |
Pretrained ImageNet weights |
Hyperparameters (optimizer, LR, scheduler) are automatically preset per dataset. Manual overrides in src/config.py always take precedence.
Before running the simulation, you can export the Amaranth hardware design into standard Verilog RTL. This proves the design is synthesizable for real FPGAs and ASICs:
uv run edgevision-rtlReplace
uv run pythonwith your manager's equivalent (see substitution table above).
This will generate a mac.v file in the root directory.
To extract the data, quantize it to int8, run the hardware simulation, and mathematically compare the Amaranth signals with the PyTorch tensors:
uv run edgevision-simReplace
uv run pythonwith your manager's equivalent (see substitution table above).
If successful, the console will output:
Software Result (PyTorch): <value>
Hardware Result (Amaranth): <value>
Success: The Software/Hardware bridge is perfect!
Waveform file generated: 'mac_simulation.vcd'
The simulation generates a standard .vcd file. You can open it with GTKWave to inspect the cycle-by-cycle electrical signals of the MAC unit:
# Install GTKWave from your package manager or from source:
# https://gtkwave.github.io/gtkwave/install/unix_linux.html
gtkwave mac_simulation.vcdGTKWave can be intimidating at first glance. Follow these quick steps to visualize your MAC unit's clock cycles:
- Find the component: In the top-left pane (SST / Search Hierarchy), click on
benchthentop. - Select the signals: In the bottom-left pane, you will see all the pins of our MAC unit (
clk,clear,pixel_in,weight_in,result_out). - Append them: Select all signals (Shift+Click), then click the Append button at the bottom of that pane.
- Adjust the view: Click the Zoom Fit button to fit the entire simulation into your screen.
- Analyze: Click anywhere on the waveform to see the exact integer values at any given microsecond, and watch
result_outupdate on every rising edge ofclk.
To remove all runtime-generated files and directories (data/, checkpoints/, .venv/, __pycache__/, mac.v, *.vcd):
./cleaner.sh # Interactive — previews targets and asks for confirmation
./cleaner.sh --force # Non-interactive — skips promptAfter training or co-simulation, PyTorch may hold onto GPU memory. To release it without restarting your Python process:
uv run edgevision-purgeReplace
uv runwith your manager's equivalent (see substitution table above).
This runs Python's garbage collector, empties the CUDA/ROCm cache, resets memory statistics, and attempts to return fragmented RAM to the OS (Linux only via malloc_trim).
While this repository successfully demonstrates the core Software-to-Hardware bridge and cycle-accurate MAC operations, scaling this into a full-fledged Edge AI accelerator (ASIC/FPGA) would require the following architectural additions:
- On-Chip Memory (SRAM/BRAM): Integrating local memory blocks to store the quantized weights and input feature maps directly on the chip, reducing off-chip memory bottlenecks.
- Control Logic (FSM): Implementing a Finite State Machine to orchestrate the read/write addresses, controlling the loops over the image patches without needing Python to inject data cycle-by-cycle.
- Systolic Array / Spatial Architecture: Expanding the single MAC unit into a 2D array of MACs to process multiple pixels and filters in parallel, maximizing throughput.
- System Bus Integration: Wrapping the accelerator with an industry-standard bus interface (e.g., AXI4 or AXI-Stream) to allow a host processor (like an ARM Cortex or RISC-V) to offload ML tasks to our IP.