feat: Apple Silicon (MPS) local generation support by jimeneztion · Pull Request #99 · Lightricks/LTX-Desktop

Sergio Gil Jiménez (jimeneztion) · 2026-04-13T16:41:13Z

Summary

Enables local AI video generation on Apple Silicon Macs. Previously, all macOS
users were forced to use the LTX cloud API regardless of hardware. With this
change, users with ≥15 GB of unified memory can run full local inference on
their GPU via Metal Performance Shaders (MPS).

The work has three layers:

1. Low-level MPS compatibility patches (backend/services/patches/)

The upstream ltx_pipelines library is CUDA-first and makes several API calls
that don't exist on MPS. We monkey-patch at server startup to fix them without
forking upstream:

mps_gpu_model_fix.py — replaces CUDA stream-based async layer streaming
with a synchronous MPS-aware wrapper. On CUDA, layers are streamed to GPU
using async CUDA streams for overlap; on MPS we do it synchronously since
Metal has no equivalent primitive.
mps_layer_streaming_fix.py — skips pin_memory() when moving layers to
MPS. Pinned host memory is CUDA-only; calling it on MPS silently corrupts
tensors.
mps_vocoder_fix.py — fixes a float32 dtype mismatch in the vocoder. MPS
autocast doesn't support float32, so we temporarily cast model weights
instead of relying on autocast.
safetensors_loader_fix.py — sets non_blocking=False and copy=False
when moving memory-mapped tensors to MPS. Async transfers on mmap buffers
can segfault on Metal.
ltx_text_encoder.py — inlines device-aware memory cleanup to avoid
unconditional torch.cuda.synchronize() calls.

2. Local generation policy (backend/runtime_config/runtime_policy.py)

Before: Darwin → always API-only.
After: Darwin with ≥15 GB unified memory → local generation allowed.

All four pipelines (fast, a2v, ic_lora, retake) set
streaming_prefetch_count=1 on MPS (synchronous) instead of 2 (async with
CUDA streams). This prevents OOM when streaming the transformer without CUDA
stream overlap.

3. Warmup skip on MPS (health_handler.py, pipelines_handler.py)

A full inference warmup at startup takes several minutes on a cold Metal device
and blocks all generation requests. On MPS we skip warmup — the pipeline loads
and the first real generation acts as warmup. For CUDA, the warmup state machine
is unchanged but now correctly manages WARMING → WARM transitions so requests
that arrive during warmup wait correctly instead of failing.

How to test

Apple Silicon Mac (M1/M2/M3/M4) with ≥15 GB unified memory
pnpm dev
Settings → disable "Use LTX API"
Generate a video — should run locally on MPS, no cloud API call
Logs should show [t2v] Generation started (model=fast, ...) with no CUDA errors
pnpm typecheck && pnpm backend:test

Sergio Gil Jiménez (jimeneztion) added 3 commits April 13, 2026 18:20

feat(mps): add Apple Silicon local generation support

843af9a

feat: remove comments

0a5f110

fix(mps): chunked attention patch to avoid OOM on Apple Silicon

2b38736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Apple Silicon (MPS) local generation support#99

feat: Apple Silicon (MPS) local generation support#99
Sergio Gil Jiménez (jimeneztion) wants to merge 3 commits intoLightricks:mainfrom
jimeneztion:feature/mps-apple-silicon-support

Sergio Gil Jiménez (jimeneztion) commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sergio Gil Jiménez (jimeneztion) commented Apr 13, 2026

Summary

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant