You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MLX (Apple, MIT-licensed) achieves ~27% memory bandwidth utilization on M2 Max vs lattice's ~18%. Their Metal compute shaders are plain .metal files at mlx/backend/metal/kernels/ — same public Metal API we use.
Goal
Study MLX's Metal shader techniques and adapt applicable patterns to lattice's crates/inference/src/forward/metal_qwen35.rs. No dependency on MLX — port the ideas, not the code.
Flash-style attention in Metal compute, shared memory usage
quantized.metal
Q8_0/Q4_0 dequant+GEMV
Dequantization fused into GEMV, block format handling
softmax.metal
softmax in decode path
Online softmax, numerical stability tricks
normalization.metal
RMSNorm/LayerNorm Metal path
Parallel reduction, warp-level primitives
What we CAN'T port
MLX's C++ dispatch layer uses MPSGraph and MPSMatrixMultiplication which access Apple's AMX hardware blocks via private frameworks. This is NOT in their .metal files. The ~10% gap attributable to AMX acceleration cannot be closed through shader study alone.
Deliverables
Analysis doc: per-kernel comparison of MLX vs lattice approach, with specific line references
PRs implementing applicable optimizations (each with make bench-compare before/after)
Context
MLX (Apple, MIT-licensed) achieves ~27% memory bandwidth utilization on M2 Max vs lattice's ~18%. Their Metal compute shaders are plain
.metalfiles atmlx/backend/metal/kernels/— same public Metal API we use.Goal
Study MLX's Metal shader techniques and adapt applicable patterns to lattice's
crates/inference/src/forward/metal_qwen35.rs. No dependency on MLX — port the ideas, not the code.Key kernels to study
gemv.metal/gemv_masked.metalscaled_dot_product_attention.metalquantized.metalsoftmax.metalnormalization.metalWhat we CAN'T port
MLX's C++ dispatch layer uses
MPSGraphandMPSMatrixMultiplicationwhich access Apple's AMX hardware blocks via private frameworks. This is NOT in their.metalfiles. The ~10% gap attributable to AMX acceleration cannot be closed through shader study alone.Deliverables
make bench-comparebefore/after)Priority
P1 — directly blocks closing the MLX decode throughput gap.
References
lib/mlx/backend/metal/kernels/)