perf(metal): study MLX Metal kernels for GEMV/attention optimization

## Context

MLX (Apple, MIT-licensed) achieves ~27% memory bandwidth utilization on M2 Max vs lattice's ~18%. Their Metal compute shaders are plain `.metal` files at `mlx/backend/metal/kernels/` — same public Metal API we use.

## Goal

Study MLX's Metal shader techniques and adapt applicable patterns to lattice's `crates/inference/src/forward/metal_qwen35.rs`. No dependency on MLX — port the ideas, not the code.

## Key kernels to study

| MLX kernel | Lattice equivalent | What to look for |
|---|---|---|
| `gemv.metal` / `gemv_masked.metal` | Q8/Q4 GEMV in metal_qwen35.rs | Tiling strategy, threadgroup sizing, SIMD-group reductions |
| `scaled_dot_product_attention.metal` | decode_attention kernel | Flash-style attention in Metal compute, shared memory usage |
| `quantized.metal` | Q8_0/Q4_0 dequant+GEMV | Dequantization fused into GEMV, block format handling |
| `softmax.metal` | softmax in decode path | Online softmax, numerical stability tricks |
| `normalization.metal` | RMSNorm/LayerNorm Metal path | Parallel reduction, warp-level primitives |

## What we CAN'T port

MLX's C++ dispatch layer uses `MPSGraph` and `MPSMatrixMultiplication` which access Apple's AMX hardware blocks via private frameworks. This is NOT in their `.metal` files. The ~10% gap attributable to AMX acceleration cannot be closed through shader study alone.

## Deliverables

1. Analysis doc: per-kernel comparison of MLX vs lattice approach, with specific line references
2. PRs implementing applicable optimizations (each with `make bench-compare` before/after)
3. Updated issue #77 with revised bandwidth utilization target

## Priority

P1 — directly blocks closing the MLX decode throughput gap.

## References

- MLX repo: github.com/ml-explore/mlx (`lib/mlx/backend/metal/kernels/`)
- Issue #77: GPU contention variance
- Issue #84: cross-framework benchmark suite
- ADR-025: GPU backend design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(metal): study MLX Metal kernels for GEMV/attention optimization #85

Context

Goal

Key kernels to study

What we CAN'T port

Deliverables

Priority

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MLX kernel	Lattice equivalent	What to look for
`gemv.metal` / `gemv_masked.metal`	Q8/Q4 GEMV in metal_qwen35.rs	Tiling strategy, threadgroup sizing, SIMD-group reductions
`scaled_dot_product_attention.metal`	decode_attention kernel	Flash-style attention in Metal compute, shared memory usage
`quantized.metal`	Q8_0/Q4_0 dequant+GEMV	Dequantization fused into GEMV, block format handling
`softmax.metal`	softmax in decode path	Online softmax, numerical stability tricks
`normalization.metal`	RMSNorm/LayerNorm Metal path	Parallel reduction, warp-level primitives

perf(metal): study MLX Metal kernels for GEMV/attention optimization #85

Description

Context

Goal

Key kernels to study

What we CAN'T port

Deliverables

Priority

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions