Skip to content

feat: Apple Silicon (MPS) local generation support#99

Open
Sergio Gil Jiménez (jimeneztion) wants to merge 3 commits intoLightricks:mainfrom
jimeneztion:feature/mps-apple-silicon-support
Open

feat: Apple Silicon (MPS) local generation support#99
Sergio Gil Jiménez (jimeneztion) wants to merge 3 commits intoLightricks:mainfrom
jimeneztion:feature/mps-apple-silicon-support

Conversation

@jimeneztion
Copy link
Copy Markdown

Summary

Enables local AI video generation on Apple Silicon Macs. Previously, all macOS
users were forced to use the LTX cloud API regardless of hardware. With this
change, users with ≥15 GB of unified memory can run full local inference on
their GPU via Metal Performance Shaders (MPS).

The work has three layers:

1. Low-level MPS compatibility patches (backend/services/patches/)

The upstream ltx_pipelines library is CUDA-first and makes several API calls
that don't exist on MPS. We monkey-patch at server startup to fix them without
forking upstream:

  • mps_gpu_model_fix.py — replaces CUDA stream-based async layer streaming
    with a synchronous MPS-aware wrapper. On CUDA, layers are streamed to GPU
    using async CUDA streams for overlap; on MPS we do it synchronously since
    Metal has no equivalent primitive.
  • mps_layer_streaming_fix.py — skips pin_memory() when moving layers to
    MPS. Pinned host memory is CUDA-only; calling it on MPS silently corrupts
    tensors.
  • mps_vocoder_fix.py — fixes a float32 dtype mismatch in the vocoder. MPS
    autocast doesn't support float32, so we temporarily cast model weights
    instead of relying on autocast.
  • safetensors_loader_fix.py — sets non_blocking=False and copy=False
    when moving memory-mapped tensors to MPS. Async transfers on mmap buffers
    can segfault on Metal.
  • ltx_text_encoder.py — inlines device-aware memory cleanup to avoid
    unconditional torch.cuda.synchronize() calls.

2. Local generation policy (backend/runtime_config/runtime_policy.py)

  • Before: Darwin → always API-only.
  • After: Darwin with ≥15 GB unified memory → local generation allowed.

All four pipelines (fast, a2v, ic_lora, retake) set
streaming_prefetch_count=1 on MPS (synchronous) instead of 2 (async with
CUDA streams). This prevents OOM when streaming the transformer without CUDA
stream overlap.

3. Warmup skip on MPS (health_handler.py, pipelines_handler.py)

A full inference warmup at startup takes several minutes on a cold Metal device
and blocks all generation requests. On MPS we skip warmup — the pipeline loads
and the first real generation acts as warmup. For CUDA, the warmup state machine
is unchanged but now correctly manages WARMING → WARM transitions so requests
that arrive during warmup wait correctly instead of failing.

How to test

  1. Apple Silicon Mac (M1/M2/M3/M4) with ≥15 GB unified memory
  2. pnpm dev
  3. Settings → disable "Use LTX API"
  4. Generate a video — should run locally on MPS, no cloud API call
  5. Logs should show [t2v] Generation started (model=fast, ...) with no CUDA errors
  6. pnpm typecheck && pnpm backend:test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant