Skip to content

perf: Metal decode throughput degrades 25% under concurrent GPU load (vs 2.8% for MLX) #77

@ohdearquant

Description

@ohdearquant

Problem

Lattice's Metal GPU decode throughput shows 13.8% run-to-run variance and ~25% absolute degradation when other GPU processes are running concurrently. MLX handles the same contention with only 2.8% variance and negligible throughput loss.

Evidence

Precise benchmark (15 runs, 2 warmup, trimmed mean, N1=64 N2=512, Qwen3.5-0.8B Q8, M2 Max):

With 10 concurrent li play GPU processes:

Engine tok/s ±95% CI Spread
Lattice 117.8 ±3.9 13.8%
MLX 258.6 ±3.7 2.8%
Ollama 81.7 ±0.6 2.5%

Without concurrent load (earlier measurement):

Engine tok/s
Lattice 150
MLX 260

Lattice drops 150→118 (21% loss) while MLX stays 260→259 (0.4% loss).

Root cause analysis

Our Metal compute shaders use dispatch_thread_groups with many small threadgroups (~378 dispatches per decode step). Under GPU contention, the Apple GPU scheduler deprioritizes our compute work. MLX uses Apple's MPS/MPSGraph framework which has higher scheduling priority and better GPU resource management.

Contributing factors:

  • ~378 Metal dispatches per forward step (each has scheduling overhead)
  • No Metal GPU priority hints (.priority on command queues is not used)
  • StorageModeShared buffers may compete with other processes for cache coherency
  • The lm_head GEMV (N=248320) creates 124K threadgroups, flooding the scheduler

Potential fixes

  1. Set Metal command queue priority: Use MTLCommandQueue priority hints
  2. Reduce dispatch count: Fuse more operations into fewer kernels
  3. Use MTLResourceStorageModePrivate for read-only weight buffers
  4. Investigate MPS integration for GEMV operations
  5. Double-buffer command encoding: Overlap GPU execution with CPU encoding

Benchmark script

scripts/bench_apples_precise.sh — 15-run trimmed-mean benchmark with warmup, spread%, and 95% CI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions