Skip to content

perf(metal): study MLX Metal kernels for GEMV/attention optimization #85

@ohdearquant

Description

@ohdearquant

Context

MLX (Apple, MIT-licensed) achieves ~27% memory bandwidth utilization on M2 Max vs lattice's ~18%. Their Metal compute shaders are plain .metal files at mlx/backend/metal/kernels/ — same public Metal API we use.

Goal

Study MLX's Metal shader techniques and adapt applicable patterns to lattice's crates/inference/src/forward/metal_qwen35.rs. No dependency on MLX — port the ideas, not the code.

Key kernels to study

MLX kernel Lattice equivalent What to look for
gemv.metal / gemv_masked.metal Q8/Q4 GEMV in metal_qwen35.rs Tiling strategy, threadgroup sizing, SIMD-group reductions
scaled_dot_product_attention.metal decode_attention kernel Flash-style attention in Metal compute, shared memory usage
quantized.metal Q8_0/Q4_0 dequant+GEMV Dequantization fused into GEMV, block format handling
softmax.metal softmax in decode path Online softmax, numerical stability tricks
normalization.metal RMSNorm/LayerNorm Metal path Parallel reduction, warp-level primitives

What we CAN'T port

MLX's C++ dispatch layer uses MPSGraph and MPSMatrixMultiplication which access Apple's AMX hardware blocks via private frameworks. This is NOT in their .metal files. The ~10% gap attributable to AMX acceleration cannot be closed through shader study alone.

Deliverables

  1. Analysis doc: per-kernel comparison of MLX vs lattice approach, with specific line references
  2. PRs implementing applicable optimizations (each with make bench-compare before/after)
  3. Updated issue perf: Metal decode throughput degrades 25% under concurrent GPU load (vs 2.8% for MLX) #77 with revised bandwidth utilization target

Priority

P1 — directly blocks closing the MLX decode throughput gap.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions