Synapse

Edge-native inference stack built from scratch in Rust + Zig + Metal.

Synapse is a modular local inference engine with native SIMD kernels, optional Metal acceleration, an embeddable C boundary, and a pure-Rust WASM runtime for browser demos. The strongest near-term wedge is local inference across native and browser targets, not generic framework sprawl.

Positioning Snapshot

Edge-native inference stack for local ML across native and browser targets.

Native builds use Rust orchestration with Zig SIMD kernels and optional Metal acceleration.
Browser builds use a pure-Rust WASM runtime for portability and client-side demos.
Public benchmark rows are measured on Apple Silicon and synced from status/benchmark_matrix.json.

Benchmark Snapshot

Family	Configuration	Prompt	Prefill (tok/s)	Decode (tok/s)	Notes
Qwen3	f32 CPU	hello	11	7.3	Runtime backend=cpu_simd; prompt=hello
Qwen3	INT8 CPU	hello	23	27.3	Runtime backend=cpu_simd; prompt=hello
LLaMA 3.2	f32 CPU	hello	1	2.1	Runtime backend=cpu_simd; prompt=hello
LLaMA 3.2	INT8 CPU	hello	8	9.7	Runtime backend=cpu_simd; prompt=hello
Reference	llama.cpp Q4_K_M	reference_only	5518	173	Reference only, not a parity claim

Runtime Profiles

Runtime Profile	Support	Targets	Backends	Quantization
Native Performance	Stable	aarch64-apple-darwin, x86_64-unknown-linux-gnu	cpu_simd, metal	f32, f16, int8, q4_0, q4_k, q6_k, q8_0
ARM Compact	Beta	aarch64-unknown-linux-musl, aarch64-unknown-linux-gnu	cpu_simd	f32, int8, q4_0, q4_k
WASM Portable	Stable	wasm32-unknown-unknown	pure_rust_wasm	f32

Artifact Budgets

Artifact	Current	Budget	Status
WASM core	~158 KB	~160 KB	ok
WASM JS wrapper	~20 KB	~32 KB	ok

Architecture

synapse/
├── zig/src/ops/           # SIMD kernels (ARM NEON): matmul, RMSNorm, SiLU, INT8, KV-cache
├── crates/
│   ├── synapse-core/      # Core tensor ops, Zig FFI bindings
│   ├── synapse-inference/  # Inference engine
│   │   ├── config/        # Model config (JSON + HuggingFace format parser)
│   │   ├── registry/      # Pluggable components: attention, norm, FFN, position
│   │   ├── model/         # CausalLM, DecoderLayer, ModelBuilder
│   │   ├── generation/    # Pipeline, sampler, stopping conditions
│   │   ├── weight_loading/# Safetensors + GGUF, weight mapping per model
│   │   ├── tokenizer/     # BPE tokenizer (HuggingFace format)
│   │   ├── kv_cache/      # Pre-allocated KV-cache with append/slice
│   │   ├── quantization/  # INT8 per-channel quantization
│   │   └── metal/         # Apple Metal GPU backend (shaders + dispatch)
│   ├── synapse-nn/        # Neural network layers (training)
│   ├── synapse-autograd/  # Automatic differentiation
│   └── synapse-train/     # Training loop, optimizers
├── tests/
│   ├── integration/       # E2E inference, KV-cache, quantization accuracy
│   └── benchmarks/        # Throughput, memory, SIMD vs naive comparisons
├── examples/
│   ├── qwen3_chat.rs      # Interactive chat with real or demo models
│   └── model_benchmark.rs # Benchmark any model via config
└── configs/               # Model configs: Qwen3-0.6B, LLaMA-3.2-1B, Mistral-7B

Component Registry

Every architectural element is a pluggable trait with config-driven instantiation:

Component	Variants
Attention	GQA, MHA, MQA, SlidingWindow
Normalization	RMSNorm, LayerNorm
FFN	SwiGLU, GELU, GeGLU
Position	RoPE, Learned, Sinusoidal
Quantization	F32, F16, INT8
Weights	Safetensors, GGUF

Adding a new model = write its config JSON + weight mapper. No engine changes.

Quick Start

# Demo mode (random weights, no downloads)
cargo run --example qwen3_chat --release -- --demo

# Real Qwen3-0.6B (download model first)
# pip install huggingface_hub
# python3 -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-0.6B', local_dir='/tmp/qwen3-0.6b')"
cargo run --example qwen3_chat --release -- --model-dir /tmp/qwen3-0.6b

# Run benchmarks
cargo run --example model_benchmark --release -- --full-scale

# Run tests
cargo test -p synapse-inference
cargo test --test inference_e2e

# Benchmark vs llama.cpp
./bench_vs_llamacpp.sh

Development Phases

Phase	Status	What
Phase 1	Done	Zig SIMD tensor engine, Rust autograd, training framework (~30k lines)
Phase 2	Done	Transformer stack, attention kernels, LayerNorm, RoPE (~15k lines)
Phase 3	Done	Inference engine, component registry, INT8 quantization, Qwen3 support (~14k lines)
Phase 4	Done	Wire SIMD kernels, KV-cache, Metal GPU shaders, benchmark harness
Phase 4.5	TODO	Wire Metal shaders into forward path, fix output correctness
Phase 5	Planned	Q4_K block quantization, Flash Attention, simdgroup_matrix, match llama.cpp

TODO (Next Steps)

Correctness (blocking)

Debug gibberish output — forward pass produces wrong tokens with real Qwen3 weights
- Likely: attention masking, RoPE application, or weight loading order bug
- Test: compare logits at each layer against HuggingFace reference implementation

Performance (Phase 4.5 — tighten native path)

Improve Metal/native path consistency with the published benchmark surface
Add reproducible benchmark generation so public numbers come from a single manifest
Keep browser and native claims explicitly separated in docs and site copy
Target: make the edge/native story coherent before chasing broad parity claims

Multi-Model Support (Phase 3.5)

Generic weight mappers for LLaMA 3.2, Mistral 7B, Phi-3
SentencePiece tokenizer (for LLaMA/Mistral)
Config parser: dynamic norm/FFN detection (LayerNorm for Phi-3, GELU FFN)
Sliding window attention kernel (Mistral)
Engine auto-detection from HuggingFace config.json model_type field

Performance (Phase 5 — match llama.cpp)

Q4_K block quantization (4-bit weights, ~4x less memory bandwidth)
Metal simdgroup_matrix (hardware matrix multiply on M-series)
Flash Attention with tiled online softmax
Kernel fusion (RMSNorm + matmul, attention + softmax in single dispatch)
GGUF native inference (skip safetensors → f32 conversion)

Built With

Rust — inference engine, autograd, training framework
Zig — SIMD kernels (ARM NEON), custom allocators, FFI exports
Metal Shading Language — GPU compute shaders for Apple Silicon
Swarm development — Phases 1–4 built using attoswarm parallel agent orchestration

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.agent/hybrid-swarm		.agent/hybrid-swarm
.attocode		.attocode
.github/workflows		.github/workflows
docs		docs
scripts		scripts
synapse		synapse
tasks		tasks
tests/swarm_smoke		tests/swarm_smoke
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
README.swarm.md		README.swarm.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synapse

Positioning Snapshot

Benchmark Snapshot

Runtime Profiles

Artifact Budgets

Architecture

Component Registry

Quick Start

Development Phases

TODO (Next Steps)

Correctness (blocking)

Performance (Phase 4.5 — tighten native path)

Multi-Model Support (Phase 3.5)

Performance (Phase 5 — match llama.cpp)

Built With

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synapse

Positioning Snapshot

Benchmark Snapshot

Runtime Profiles

Artifact Budgets

Architecture

Component Registry

Quick Start

Development Phases

TODO (Next Steps)

Correctness (blocking)

Performance (Phase 4.5 — tighten native path)

Multi-Model Support (Phase 3.5)

Performance (Phase 5 — match llama.cpp)

Built With

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages