perf: Metal decode throughput degrades 25% under concurrent GPU load (vs 2.8% for MLX)

## Problem

Lattice's Metal GPU decode throughput shows 13.8% run-to-run variance and ~25% absolute degradation when other GPU processes are running concurrently. MLX handles the same contention with only 2.8% variance and negligible throughput loss.

## Evidence

Precise benchmark (15 runs, 2 warmup, trimmed mean, N1=64 N2=512, Qwen3.5-0.8B Q8, M2 Max):

With 10 concurrent `li play` GPU processes:
| Engine | tok/s | ±95% CI | Spread |
|---|---|---|---|
| Lattice | 117.8 | ±3.9 | 13.8% |
| MLX | 258.6 | ±3.7 | 2.8% |
| Ollama | 81.7 | ±0.6 | 2.5% |

Without concurrent load (earlier measurement):
| Engine | tok/s |
|---|---|
| Lattice | 150 |
| MLX | 260 |

Lattice drops 150→118 (21% loss) while MLX stays 260→259 (0.4% loss).

## Root cause analysis

Our Metal compute shaders use `dispatch_thread_groups` with many small threadgroups (~378 dispatches per decode step). Under GPU contention, the Apple GPU scheduler deprioritizes our compute work. MLX uses Apple's MPS/MPSGraph framework which has higher scheduling priority and better GPU resource management.

Contributing factors:
- ~378 Metal dispatches per forward step (each has scheduling overhead)
- No Metal GPU priority hints (`.priority` on command queues is not used)
- `StorageModeShared` buffers may compete with other processes for cache coherency
- The lm_head GEMV (N=248320) creates 124K threadgroups, flooding the scheduler

## Potential fixes

1. **Set Metal command queue priority**: Use `MTLCommandQueue` priority hints
2. **Reduce dispatch count**: Fuse more operations into fewer kernels
3. **Use `MTLResourceStorageModePrivate`** for read-only weight buffers
4. **Investigate MPS integration** for GEMV operations
5. **Double-buffer command encoding**: Overlap GPU execution with CPU encoding

## Benchmark script

`scripts/bench_apples_precise.sh` — 15-run trimmed-mean benchmark with warmup, spread%, and 95% CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Metal decode throughput degrades 25% under concurrent GPU load (vs 2.8% for MLX) #77

Problem

Evidence

Root cause analysis

Potential fixes

Benchmark script

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: Metal decode throughput degrades 25% under concurrent GPU load (vs 2.8% for MLX) #77

Description

Problem

Evidence

Root cause analysis

Potential fixes

Benchmark script

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions