Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
-
Updated
May 25, 2026 - Python
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, FP8 GEMM — CPU-testable references, autotuning, and benchmarking
A specialized compiler that optimizes deep learning models for AI accelerators with operator fusion, memory optimization, and hardware-specific passes.
Native Rust edge inference engine with zero-copy memmap2 tensor loading, register-fused Linear+ReLU kernels, and scenario-aware MoE routing via rayon work-stealing — achieving 352µs lightweight and 1.39ms dense expert execution.
TensorMorph is an AI-assisted MLIR compiler for TOSA graph optimization and operator fusion.
C++17 ONNX inference optimizer + CPU runtime for Apple Silicon. Operator fusion via IR passes, Accelerate AMX-backed sgemm; benched vs ONNX Runtime CPU EP on DistilBERT (1.26x baseline speedup, 6.99x ORT on raw MatMul).
Add a description, image, and links to the operator-fusion topic page so that developers can more easily learn about it.
To associate your repository with the operator-fusion topic, visit your repo's landing page and select "manage topics."