End-to-end flow for taking a PyTorch model through quantization, per-target kernel generation (reference, hand-curated, or LLM-written), Zephyr build, and validation on spike, RTL sim, or FireSim. Plus the multi-model + XPURT-schedule layer that runs N networks on M cores in one binary with explicit core pinning and inter-network synchronization.
Full canonical pipeline diagram (workload JSON → scheduler → codegen → FireSim → trace plot):
modelblaster/notes/pipeline_overview.md. Parent-repo cross-link to the XPURT side of the same flow:../../docs/end_to_end_xpurt_firesim.md.
modelblaster/
models/ PyTorch model classes (one .py per model)
pipeline/ codegen — extract IR, emit C, pick kernels, build, profile
reference_kernels KernelSpec per op: signature, semantics, scalar C oracle,
AlgorithmCandidate list (the "alternatives" the picker
and the LLM are allowed to consider)
kernels/ hand-curated kernels, organized by HW target
cores/ vendored target SDKs (gemmini.h, saturn_opu.h, ...)
harness/ single-model Zephyr app template
harness_multi/ N-models-in-one-ELF harness (for pool sweeps)
harness_xpurt/ schedule-driven multi-model harness (XPU-RT execution)
harness_microros/ micro-ROS variant (DDS broker + N model nodes)
validation/ spike + firesim runners; profile CSV writer
examples/ per-model run.sh + cached artifacts
notes/ working design docs (deep-dives by topic)
One-time per shell:
source tools/miniforge3/etc/profile.d/conda.sh && conda activate zephyr
source scripts/set_envvars_sdk.sh
source ../set_api_keys.sh # only for BACKEND=llm or --optimizeThe simplest single-model run:
# scalar fp32 reference kernels on spike — fastest sanity check
bash modelblaster/examples/mlp_generic/run.sh
# int8 PTQ, rvv backend, with curated kernels probed before LLM fallback
QUANT=int8 TARGET=rvv BACKEND=reference \
GLOBAL_CURATED_DIR=$PWD/modelblaster/kernels \
bash modelblaster/examples/dronet/run.sh
# fp16 + RVV+Zvfh widening on spike (the rvv_f16 backend)
QUANT=fp16 TARGET=rvv BACKEND=reference \
GLOBAL_CURATED_DIR=$PWD/modelblaster/kernels \
bash modelblaster/examples/vint/run.sh
# Saturn OPU integer matmul, spike via the custom OPU extension
QUANT=int8 TARGET=rvv_opu BACKEND=reference \
GLOBAL_CURATED_DIR=$PWD/modelblaster/kernels \
bash modelblaster/examples/dronet/run.sh
# FireSim runtime (any backend; runner copies the elf, runs infrasetup +
# runworkload, tails uartlog until OUTPUT_END markers)
RUNNER=firesim QUANT=int8 TARGET=rvv \
bash modelblaster/examples/dronet/run.shEach stage's outputs are deterministic, on disk, and re-enterable.
[1] extract_graph[_export] PyTorch → graph.json + weights.npz + io.npz
(per-quant; int8 PTQ + fp16 cast + mixed-prec
auto-cast all live in extract)
[2] generate_skeleton IR → model.{c,h} + weights.{c,h} + test_io.h
+ buffers.c (per-net scratch; extern-shared
across backends for the het multi-net path)
[3] generate_kernels IR → kernels.{c,h}
three sources, in priority order:
global curated dir (modelblaster/kernels/)
per-model LLM cache
LLM generation (--backend llm)
fastest-wins among curated + cached
optional --optimize beam-search per op
[4] west build modelblaster/harness + generated/<target>/* → .elf
(or harness_multi / harness_xpurt for the
multi-net + schedule paths)
[5] spike or firesim run the elf, parse OUTPUT/PROFILE/WALL
markers, compare to PyTorch / int8-sim
golden, write profile.csv
Single-model orchestration: modelblaster/examples/<model>/run.sh sources
modelblaster/examples/_run_lib.sh. Multi-net + schedule: multi_demo/run.sh
and xpurt_demo/run.sh chain the per-model flow then run a single
harness build linking N models.
Registered in modelblaster/pipeline/backends.py::BACKENDS. Each is a
Backend(...) declaration plus a harness/backends/<name>.conf
overlay; nothing else hard-codes per-target logic.
| target | base ISA | extras | verify path |
|---|---|---|---|
scalar |
rv64imafdc | — | host ctypes |
scalar_f16 |
rv64imafdc | Zfh | host ctypes |
rvv |
rv64gcv | — | spike harness |
rvv_f16 |
rv64gcv | Zfh + Zvfh | spike harness |
rvv_opu |
rv64gcv | Saturn OPU custom .insn (i8 outer-product) | spike harness (needs OPU spike fork — see below) |
gemmini |
rv64imafdc | Gemmini int8 RoCC (DIM=16, f32 acc_scale) | chipyard spike harness |
gemmini_q31 |
rv64imafdc | Gemmini int8 RoCC + Q0.31 mvout requantize | chipyard spike harness |
QUANT |
what it does |
|---|---|
fp32 |
no quantization; reference and curated kernels operate on float. |
fp16 |
extract path casts to fp16; needs an _f16 backend variant. |
int8 |
per-tensor symmetric PTQ; one calibration sample drives scale choice. |
int8 + --per-channel |
per-output-channel weight scales for conv/linear (CMSIS-NN / TFLite convention). |
| Mixed precision | per-op overrides via get_precision_spec() in the model file. Walker inserts cast_i8_to_f16 / cast_f16_to_i8 at dtype boundaries. Most useful when one op family (e.g. ViNT's goal-encoder linear) has too-wide range for int8 but the rest of the net is fine. See modelblaster/notes/mixed_precision_plan.md. |
_run_lib.sh auto-promotes TARGET=rvv → rvv_f16 (or scalar →
scalar_f16) when the IR contains any _f16 op, so mixed-precision
runs don't need extra environment variables.
Each example dir contains a tiny run.sh plus a <quant>/cache/<target>/
of post-verify curated/LLM kernels that persist in git.
| example | what it is | notable |
|---|---|---|
mlp_generic |
random-init 16→32→32→10 MLP | smallest demo |
mlp_control |
trained rsl_rl PPO actor (steering) | trained weights |
lenet |
random-init LeNet-5 | first int8 PTQ smoke target |
mobilenet_v2 |
MobileNetV2 stem (no classifier) | exercises depthwise + SE |
dronet |
trained DroNet (3×112×112, steer+collision) | the canonical "real" model |
yolov8_nano |
YOLOv8 nano stem | int8 + RVV cache-blocked conv2d_s8 |
vint |
ViNT visual-navigation transformer | torch.export path, mixed-precision, mt-attention |
microros_demo |
micro-ROS broker + N model nodes | runtime + DDS integration |
multi_demo |
run N model in one ELF | pool-size sweeps, profile amortization |
xpurt_demo |
run an XPURT schedule.json on the harness | het core pinning, k_sem chains |
fp16_smoke, gemmini_smoke, gemmini_unittests, kernelbench, v_save_smoke |
targeted unit tests | each isolates one ISA / quant feature |
| var | values | default | notes |
|---|---|---|---|
BACKEND |
reference, llm |
reference |
source of impls. reference also probes GLOBAL_CURATED_DIR. |
TARGET |
scalar, rvv, rvv_f16, rvv_opu, gemmini, gemmini_q31, ... |
scalar |
HW backend. |
QUANT |
fp32, fp16, int8 |
fp32 |
quant pass at extract. |
OPTIMIZE |
0, 1 |
0 |
beam-search after correctness. requires BACKEND=llm. |
ALGORITHMS |
all, csv |
all |
per-op algorithm filter (e.g. direct,im2col_gemm). |
BEAM, EXPANSIONS, ITERATIONS |
int | 2, 3, 2 |
beam-search knobs. |
GLOBAL_CURATED_DIR |
path | unset | enables the modelblaster/kernels/ probe; safe to leave on. |
RUNNER |
spike, firesim |
spike |
downstream simulator. |
FIRESIM_TIMEOUT |
seconds | 600 |
wallclock cap for firesim runworkload. |
FIRESIM_SKIP_INFRASETUP |
0, 1 |
0 |
skip firesim infrasetup (advanced — only when the bitstream + driver are known fresh). |
MAX_ACCURACY_CLASS |
bit_exact, numeric_drift, approximate |
unset | tighten verify (drop curated algos with looser declared class). |
FIRESIM_EVAL, CACHE_AWARE_PROMPT |
0, 1 |
0 |
optimize-phase FireSim re-rank + cache-aware prompt. See notes/firesim_eval_design.md. |
Bedrock model id is meta.llama4-maverick-17b-instruct-v1:0 by default,
overridable via MODEL.
modelblaster/examples/<model>/<quant>/
generated/
graph.json
weights.npz
io.npz PyTorch reference input/output + golden
profile.csv last spike/firesim run's per-kernel cycles
<target>/ per-backend codegen output
model.{c,h} run_model() driver + rdcycle profile array
weights.{c,h} packed const arrays (per-backend layout)
kernels.{c,h} per-op implementations
buffers.c scratch storage (extern-shared in het multi-net)
test_io.h model_test_input + model_test_golden
optimize_summary.json beam-search history (--optimize only)
build/<target>/ west build tree → zephyr.elf
cache/<target>/ PASSing kernels keyed <target>_<op>_<algo>.c
(committed to git so re-runs skip LLM)
generated/ and build/ are regenerated by run.sh and gitignored.
cache/ is not gitignored — successful kernels persist across
machines. modelblaster/kernels/ is the global curated dir (hand-authored
kernels reusable across models); the per-model cache/ is its
LLM-iterated cousin.
Three depths of profile, from cheapest to most accurate:
Default — no extra flags. run_model() brackets each kernel call with
rdcycle() (read of the mcycle CSR — 1 insn). spike_runner parses
the MODELBLASTER_PROFILE_BEGIN/END block from stdout and writes
generated/profile.csv with (dispatch_id, name, op, shape, cycles).
QUANT=int8 TARGET=rvv BACKEND=reference \
bash modelblaster/examples/dronet/run.sh
cat modelblaster/examples/dronet/int8/generated/profile.csvSame flow with RUNNER=firesim. Runner takes care of XDMA chmod,
runs firesim infrasetup, then runworkload, tails the uartlog
until expected MODELBLASTER_WALL_CYCLES count is hit.
RUNNER=firesim QUANT=int8 TARGET=rvv \
bash modelblaster/examples/dronet/run.shThe same profile.csv format is produced; the spike vs firesim difference is the cycle counts (FireSim reflects pipeline, cache locality, etc. that spike can't model).
When PROFILE_OUT_ROOT is set, the runner additionally writes IREE-
schema results.csv files at
gen/profile/<backend>/<cpu>/<model>/.../topo_<cores>/results.csv.
XPU-RT consumes these directly. See modelblaster/notes/profile_emission.md.
PROFILE_OUT_ROOT=gen/profile \
PROFILE_CPU=firesim_rocket_saturn PROFILE_CORES=0,1,2,3 \
PROFILE_CLOCK_MHZ=1000.0 \
RUNNER=firesim QUANT=int8 TARGET=rvv \
bash modelblaster/examples/dronet/run.shmodelblaster/examples/multi_demo/run.sh builds one ELF that runs every
constituent model under each pool size in succession — useful for
amortizing FireSim infrasetup across a sweep:
MODELS=dronet,yolov8_nano TARGET=rvv QUANT=int8 \
POOL_SIZES=1,2,4 RUNNER=firesim \
bash modelblaster/examples/multi_demo/run.shProduces one topo_<cores>/results.csv per pool size, side by side.
With BACKEND=llm OPTIMIZE=1, each op's algorithms run through a
beam-search:
BACKEND=llm OPTIMIZE=1 TARGET=rvv \
bash modelblaster/examples/lenet/run.shEach candidate must verify AND have lower cycles than its parent to
survive. Winners are written into the per-model cache/<target>/.
With FIRESIM_EVAL=1, the top-K spike survivors get re-ranked on
FireSim for cache-locality wins spike misses. See
notes/firesim_eval_design.md.
The single-model flow above produces the inputs the XPURT scheduler
needs (per-dispatch IREE-shape profile CSVs). The schedule comes back
as a JSON; xpurt_demo consumes it.
# For each model and each candidate HW backend, emit one IREE-schema
# profile CSV at the chosen pool size:
for m in dronet yolov8_nano; do
for t in scalar rvv gemmini_q31; do
PROFILE_OUT_ROOT=gen/profile \
PROFILE_CPU=firesim_rocket_saturn PROFILE_CORES=0,1,2,3 \
PROFILE_CLOCK_MHZ=1000.0 \
RUNNER=firesim QUANT=int8 TARGET=$t \
bash modelblaster/examples/$m/run.sh
done
doneEdit a top-level data/toplevel/<workload>.json in the parent
FreshScheduler repo:
# From the parent FreshScheduler repo root:
python scripts/run_xpurt_schedule.py \
data/toplevel/<workload>.json \
--scheduler greedy_periodic \
--out schedules/scheduled_<workload>.jsonThe scheduler reads gen/profile/.../results.csv per (network,
backend), runs the MILP / greedy assignment, and emits
schedules/scheduled_<workload>.json plus a predicted-timeline plot.
SCHEDULE_JSON=$PWD/schedules/scheduled_<workload>.json \
MODELS=dronet,yolov8_nano \
BACKENDS=scalar,rvv,gemmini_q31 \
QUANT=int8 \
RUNNER=firesim \
XPURT_TRACE=1 \
bash modelblaster/examples/xpurt_demo/run.shThe harness links every (model × backend) object library; the dispatch
table generated from the schedule selects the right one per entry. With
XPURT_TRACE=1, the uartlog includes per-entry begin/end timestamps
that modelblaster/scripts/plot_xpurt_trace.py renders as a Gantt vs the
predicted timeline.
See modelblaster/notes/scheduler_investigation.md for the schedule.json
format and modelblaster/notes/dispatch_and_cores.md for the core-registry
and pinning model.
pipeline/backends.py is usually the only Python file that changes.
Register a Backend(...) entry:
NEW_TGT = Backend(
name="new_tgt",
description="…",
kernel_cflags=("-march=…", "-mabi=lp64d", "-DMODELBLASTER_NEW_TGT=1"),
kernel_includes=("<some_header.h>",),
prj_conf_overlay="new_tgt.conf",
spike_args=("--isa=…",), # if any
optimization_guide="optimization_guide_new_tgt.md",
verify_method=VERIFY_SPIKE_HARNESS, # or VERIFY_HOST_CTYPES
# atol_override / rtol_override if needed
)
BACKENDS[NEW_TGT.name] = NEW_TGTThen drop the supporting files:
modelblaster/harness/backends/new_tgt.conf # Kconfig overlay
modelblaster/pipeline/prompts/optimization_guide_new_tgt.md # LLM guide (optional —
can reuse scalar/rvv)
modelblaster/cores/new_tgt/include/... # vendored SDK headers (optional)
modelblaster/kernels/new_tgt/ # curated kernels go here
If the backend has a vendored SDK (gemmini, OPU) include paths use the
<repo_root> placeholder — Backend.resolved_kernel_cflags()
substitutes at build time. If the backend needs a custom spike fork
(gemmini, rvv_opu), wire the --spike lookup in
modelblaster/examples/_run_lib.sh (mirror the existing MODELBLASTER_GEMMINI_SPIKE
/ MODELBLASTER_OPU_SPIKE env knobs).
Worked example: modelblaster/notes/saturn_opu_backend.md documents the
full set of changes to add the OPU backend, end to end.
-
Drop
modelblaster/models/<name>.pywithget_model()(returns a torchnn.Modulewith weights loaded) andget_sample_input()(returns the calibration / golden input tensor). For trained models, load weights insideget_model()from a checkpoint path. Optionally defineget_precision_spec()for per-op mixed-precision overrides ({"default": "int8", "fp16_upstream_of": ["op_name"], "fp16_ops": [...]}). -
Register the name in
modelblaster/pipeline/extract_graph.py's--modelchoices, OR — for models that don't FX-trace (anything withnn.TransformerEncoderinternals,len(...), etc.) — add atorch.exportbranch inmodelblaster/pipeline/extract_graph_export.py. See ViNT's example for the export-path pattern. -
Copy
modelblaster/examples/mlp_generic/run.shtomodelblaster/examples/<name>/run.shand changeMODEL_NAME=<name>. -
If the model uses ops not yet registered, add them — see below.
-
(Optional) calibration data: drop
modelblaster/datasets/<spec>.jsonpointing at a list of input tensors; the extractor's per-channel activation calibration consumes it.
- New
KernelSpecinmodelblaster/pipeline/reference_kernels.py::KERNEL_SPECS:signature(exact string used inkernels.h)semantics(English description for the LLM prompt)reference_impl(correct naive scalar C — the verify oracle and the--backend referenceoutput)extra_shapes(verify shapes beyond what the IR happens to have)argtypes_factory(ctypes signature for host verify)algorithmslist (optional —AlgorithmCandidates withtarget_affinity,weight_layout,accuracy_class)
- Wire the op in
extract_graph[_export].py(FX/export node → IR op) andgenerate_skeleton.py(IR op → kernel call site). - For int8 op kinds, add the matching path in
extract_graph.py's integer pipeline simulator so the bit-exact golden stays in sync. - Verify with
--backend referencefirst; then write a curated kernel for the relevant target.
A curated kernel is a hand-written .c file at
modelblaster/kernels/<target>/<target>_<op>_<algo>.c. The pipeline picks
it up automatically when GLOBAL_CURATED_DIR is set, as long as the
algorithm name is registered in reference_kernels.py with
target_affinity=("<target>",).
Minimum recipe:
- Write the
.cfile. First two lines must be:Body implements the canonical signature from the/* source: curated */ /* algorithm: <algo_name> */
KernelSpec. - Add an
AlgorithmCandidatein the matching spec:AlgorithmCandidate( name="<algo_name>", target_affinity=("<target>",), description="…", reference_impl="", # the curated file supplies it ),
- Run any example with the matching target — the log shows
curated swap from .../<file>.cwhen the kernel gets picked up.
Worked examples in this repo:
rvv_f16widening MAC (linear / conv2d / depthwise) —modelblaster/kernels/rvv_f16/. Ported from the canonical scalar fp16 reference, vectorized viavfwmacc.gemmini_q31tiled conv + linear —modelblaster/kernels/gemmini_q31/. Routes through gemmini RoCC with bit-exact Q0.31 requantize.rvv_opuouter-product matmul + linear —modelblaster/kernels/rvv_opu/. Ported from upstream saturnbenchmarks/opu-gemm/kernel.h::i8_mm_bme_sq; cited in the file headers. Exercises the Saturn OPU custom .insn programming model.
See modelblaster/kernels/README.md for the curated-vs-cache distinction
and the picker priority order.
For backends with custom instructions, you need a spike build that decodes them. Two existing paths:
- gemmini — chipyard ships a
--extension=gemminispike fork.modelblaster/examples/_run_lib.shfinds it viaMODELBLASTER_GEMMINI_SPIKEenv, defaults to/scratch2/dima/chipyard-fsim/.conda-env/.... - rvv_opu — custom spike extension at
hw/chipyard/toolchains/riscv-tools/riscv-isa-sim/customext/saturn_opu.cc(in-repo functional model ofVOPACC/OPMVINBCAST/VMV_VR/VMV_RV)._run_lib.shfinds the built spike viaMODELBLASTER_OPU_SPIKE. Seenotes/saturn_opu_spike_support.mdfor build instructions.
For a brand-new accelerator, the path is the same: extend
riscv-isa-sim/customext/ with a functional model, register via
REGISTER_EXTENSION, and point _run_lib.sh at the built binary.
The modelblaster/notes/ directory holds focused design notes per topic.
Highlights for this README's surface area:
| topic | note |
|---|---|
| Canonical pipeline diagram | pipeline_overview.md |
| int8 PTQ flow | int8_quantization_flow.md |
| Mixed-precision plan + experiments | mixed_precision_plan.md, vint_mixed_precision_experiments.md |
| Per-dispatch profile schema (IREE-shape) | profile_emission.md |
| FireSim re-rank in the optimize loop | firesim_eval_design.md |
| XPURT schedule format + the dispatch table | scheduler_investigation.md, dispatch_and_cores.md |
| Multi-model threading + modelblaster_pool | multi_model_threading.md |
| POSIX affinity on Zephyr | posix_affinity_investigation.md |
| Saturn OPU backend status | saturn_opu_backend.md |
| Saturn OPU spike extension design | saturn_opu_spike_support.md |
| Gemmini extension status | gemmini_extension_plan.md, gemmini_firesim_status.md |
| Gemmini LUT optimization (FPGA-side) | gemmini_lut_optimization.md |
| Saturn FP-precision stripping (FPGA area) | saturn_fp_precision_stripping.md |
| Conv weight layout (OIHW / HWIO / IHWOC) | conv_weight_layout_decisions.md |
| Caveats from real bugs (Saturn strided memop, V context, FireSim quirks) | saturn_strided_memop_bug.md, firesim_* |
- Spike is an ISA simulator with flat memory. Cycle counts reward
pipeline-pattern wins (multiple accumulators, breaking fp dependency
chains, unrolling); they're blind to cache locality. Use
RUNNER=firesimfor memory-realistic profiling. - Reference impls are the trusted oracle, not the
signaturestrings inkernels.h. If you change aKernelSpec.signature, the reference impl's first line must match or host-ctypes verify will silently misalign. - conv2d_s8 RVV via OPU im2col — not yet curated; conv2d on the OPU backend currently falls back to scalar reference.
- Saturn OPU bitstream availability — the V256D128 OPU+Q31Gemmini
config exists in scala but the FireSim bitstream side is in flux
(see
saturn_opu_backend.md). Spike + verilator paths work today. - Stale Vitis cmake on PATH —
run.shprepends/usr/binto dodge it; do the same if you invokewestoutsiderun.sh.
{ "machines": { "CPU_P": "rvv", "CPU_E": "scalar", "GEMMINI": "gemmini_q31" }, "networks": [ {"name": "dronet", "period_ms": 50}, {"name": "yolov8_nano", "period_ms": 100} ], "profile_target": "firesim_rocket_saturn" }