Problem
Lattice's Metal GPU decode throughput shows 13.8% run-to-run variance and ~25% absolute degradation when other GPU processes are running concurrently. MLX handles the same contention with only 2.8% variance and negligible throughput loss.
Evidence
Precise benchmark (15 runs, 2 warmup, trimmed mean, N1=64 N2=512, Qwen3.5-0.8B Q8, M2 Max):
With 10 concurrent li play GPU processes:
| Engine |
tok/s |
±95% CI |
Spread |
| Lattice |
117.8 |
±3.9 |
13.8% |
| MLX |
258.6 |
±3.7 |
2.8% |
| Ollama |
81.7 |
±0.6 |
2.5% |
Without concurrent load (earlier measurement):
| Engine |
tok/s |
| Lattice |
150 |
| MLX |
260 |
Lattice drops 150→118 (21% loss) while MLX stays 260→259 (0.4% loss).
Root cause analysis
Our Metal compute shaders use dispatch_thread_groups with many small threadgroups (~378 dispatches per decode step). Under GPU contention, the Apple GPU scheduler deprioritizes our compute work. MLX uses Apple's MPS/MPSGraph framework which has higher scheduling priority and better GPU resource management.
Contributing factors:
- ~378 Metal dispatches per forward step (each has scheduling overhead)
- No Metal GPU priority hints (
.priority on command queues is not used)
StorageModeShared buffers may compete with other processes for cache coherency
- The lm_head GEMV (N=248320) creates 124K threadgroups, flooding the scheduler
Potential fixes
- Set Metal command queue priority: Use
MTLCommandQueue priority hints
- Reduce dispatch count: Fuse more operations into fewer kernels
- Use
MTLResourceStorageModePrivate for read-only weight buffers
- Investigate MPS integration for GEMV operations
- Double-buffer command encoding: Overlap GPU execution with CPU encoding
Benchmark script
scripts/bench_apples_precise.sh — 15-run trimmed-mean benchmark with warmup, spread%, and 95% CI.
Problem
Lattice's Metal GPU decode throughput shows 13.8% run-to-run variance and ~25% absolute degradation when other GPU processes are running concurrently. MLX handles the same contention with only 2.8% variance and negligible throughput loss.
Evidence
Precise benchmark (15 runs, 2 warmup, trimmed mean, N1=64 N2=512, Qwen3.5-0.8B Q8, M2 Max):
With 10 concurrent
li playGPU processes:Without concurrent load (earlier measurement):
Lattice drops 150→118 (21% loss) while MLX stays 260→259 (0.4% loss).
Root cause analysis
Our Metal compute shaders use
dispatch_thread_groupswith many small threadgroups (~378 dispatches per decode step). Under GPU contention, the Apple GPU scheduler deprioritizes our compute work. MLX uses Apple's MPS/MPSGraph framework which has higher scheduling priority and better GPU resource management.Contributing factors:
.priorityon command queues is not used)StorageModeSharedbuffers may compete with other processes for cache coherencyPotential fixes
MTLCommandQueuepriority hintsMTLResourceStorageModePrivatefor read-only weight buffersBenchmark script
scripts/bench_apples_precise.sh— 15-run trimmed-mean benchmark with warmup, spread%, and 95% CI.