diff --git a/backends/mlx/pte_inspector.py b/backends/mlx/pte_inspector.py index df11a21f370..20ec9e9d9df 100644 --- a/backends/mlx/pte_inspector.py +++ b/backends/mlx/pte_inspector.py @@ -788,9 +788,9 @@ def main(): # noqa: C901 parser.add_argument( "--delegate-index", type=int, - default=None, + default=0, metavar="N", - help="Index of delegate to extract (0-based). If not specified, extracts first matching delegate.", + help="Index of delegate to extract (0-based, default: 0).", ) parser.add_argument( "--parse-mlx", diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md index d1c48eb4032..f361a87e6b0 100644 --- a/docs/source/backends-overview.md +++ b/docs/source/backends-overview.md @@ -23,6 +23,7 @@ Backends are the bridge between your exported model and the hardware it runs on. | [XNNPACK](backends/xnnpack/xnnpack-overview.md) | All | CPU | General-purpose, fallback | | [CUDA](/backends/cuda/cuda-overview.md) | Linux/Windows | GPU | NVIDIA GPU acceleration | | [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance | +| [MLX](/backends/mlx/mlx-overview.md) | macOS | GPU | Apple Silicon GPU (MLX) | | [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration | | [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration | | [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | @@ -55,6 +56,7 @@ Backends are the bridge between your exported model and the hardware it runs on. backends/xnnpack/xnnpack-overview backends/cuda/cuda-overview backends/coreml/coreml-overview +backends/mlx/mlx-overview backends/mps/mps-overview backends/vulkan/vulkan-overview backends-qualcomm diff --git a/docs/source/backends/mlx/mlx-op-support.md b/docs/source/backends/mlx/mlx-op-support.md new file mode 100644 index 00000000000..4f9acd576c3 --- /dev/null +++ b/docs/source/backends/mlx/mlx-op-support.md @@ -0,0 +1,10 @@ +# Op Support + +The MLX backend supports ~90 ATen operators plus multi-node fused patterns and custom ops. The partitioner automatically determines which ops in your model can be delegated to MLX. Unsupported ops fall back to ExecuTorch's portable CPU runtime. + +For the current list of supported operators and fused patterns, see the source: + +- **[ops.py](https://github.com/pytorch/executorch/blob/main/backends/mlx/ops.py)** — Single-op handlers (ATen op → MLX IR node) +- **[patterns.py](https://github.com/pytorch/executorch/blob/main/backends/mlx/patterns.py)** — Multi-node fused patterns (quantized linear, SDPA, KV cache, etc.) + +During lowering, the MLX partitioner prints a summary of supported and unsupported ops so you can see which ones are delegated and which fall back to CPU. diff --git a/docs/source/backends/mlx/mlx-overview.md b/docs/source/backends/mlx/mlx-overview.md new file mode 100644 index 00000000000..c22e4d7b67d --- /dev/null +++ b/docs/source/backends/mlx/mlx-overview.md @@ -0,0 +1,144 @@ +# MLX Backend + +The MLX delegate is the ExecuTorch backend for Apple Silicon GPUs via the [MLX](https://github.com/ml-explore/mlx) framework. It compiles PyTorch models into a custom FlatBuffer bytecode format at export time and executes them using MLX GPU primitives at runtime. + +::::{note} +The MLX delegate is experimental and under active development. +:::: + +## Features + +- GPU acceleration on Apple Silicon (M1 and later) via MLX. +- INT2/INT4/INT8 weight quantization via [TorchAO](https://github.com/pytorch/ao). +- Dynamic shape support. +- Mutable buffers for persistent state across inference calls (e.g., KV cache). +- Zero-copy constant loading on unified memory. + +## Target Requirements + +- Apple Silicon Mac (M1 or later) +- [macOS](https://developer.apple.com/macos) >= 14.0 + +## Development Requirements + +- [macOS](https://developer.apple.com/macos) on Apple Silicon (M1 or later) +- [Xcode](https://developer.apple.com/xcode/) (full installation, not just Command Line Tools — the Metal compiler is required) + +Verify the Metal compiler is available: + +```bash +xcrun -sdk macosx --find metal +``` + +If this prints a path (e.g., `/Applications/Xcode.app/.../metal`), you're set. If it errors, install Xcode from [developer.apple.com](https://developer.apple.com/xcode/), then switch the active developer directory: + +```bash +sudo xcode-select -s /Applications/Xcode.app/Contents/Developer +``` + +---- + +## Using the MLX Backend + +To target the MLX backend during export and lowering, pass an instance of `MLXPartitioner` to `to_edge_transform_and_lower`. The MLX backend also provides a set of graph optimization passes via `get_default_passes()` that should be passed as `transform_passes`. The example below demonstrates this process using MobileNet V2: + +```python +import torch +import torchvision.models as models +from torchvision.models.mobilenetv2 import MobileNet_V2_Weights +from executorch.backends.mlx import MLXPartitioner +from executorch.backends.mlx.passes import get_default_passes +from executorch.exir import to_edge_transform_and_lower + +mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() +sample_inputs = (torch.randn(1, 3, 224, 224), ) + +et_program = to_edge_transform_and_lower( + torch.export.export(mobilenet_v2, sample_inputs), + transform_passes=get_default_passes(), + partitioner=[MLXPartitioner()], +).to_executorch() + +with open("mv2_mlx.pte", "wb") as file: + et_program.write_to_file(file) +``` + +`get_default_passes()` includes RMSNorm fusion, consecutive view/permute/dtype-cast collapsing, no-op removal, and common subexpression elimination. These are recommended for all models and required for optimal LLM performance. + +::::{note} +The MLX backend is primarily designed for LLM and generative AI workloads on Apple Silicon. The MobileNet V2 example above is shown for simplicity, but in practice you would use this backend for models like Llama, Whisper, and other transformer-based architectures. See [LLM example](https://github.com/pytorch/executorch/tree/main/backends/mlx/examples/llm) for a more representative use case. +:::: + +See [Partitioner API](mlx-partitioner.md) for a reference on available partitioner options. + +---- + +## Quantization + +The MLX backend supports INT4, INT8, and NVFP4 weight quantization via TorchAO for both linear and embedding layers. This is particularly useful for LLM inference. See [MLX Quantization](mlx-quantization.md) for details. + +---- + +## Runtime Integration + +### Python (pybindings) + +The simplest way to get started is to install ExecuTorch with Python bindings. From the repo root: + +```bash +python install_executorch.py +``` + +On Apple Silicon, when the Metal compiler is available, the MLX backend is automatically included. You can then export models in Python using the MLX partitioner and run them via the ExecuTorch Python API. + +### C++ (CMake preset) + +To build the C++ runtime with the MLX delegate, use the `mlx-release` CMake workflow preset from the repo root: + +```bash +cmake --workflow --preset mlx-release +``` + +This configures and builds a Release build of the ExecuTorch runtime with the MLX delegate and installs artifacts into `cmake-out/`. The preset enables the MLX delegate along with commonly needed extensions (module, data loader, flat tensor, LLM runner, etc.). + +Downstream C++ apps can then `find_package(executorch)` and link against `mlxdelegate` and `mlx`. The `executorch_target_link_options_shared_lib` utility handles whole-archive linkage (required for static initializer registration) cross-platform, and `executorch_target_copy_mlx_metallib` copies the Metal kernel library next to the binary so MLX can find it at runtime: + +```cmake +# CMakeLists.txt +find_package(executorch REQUIRED) + +# Link MLX delegate (with whole-archive for static initializer registration) +target_link_libraries(my_target PRIVATE mlxdelegate mlx) +executorch_target_link_options_shared_lib(mlxdelegate) + +# Copy mlx.metallib next to the binary for runtime +executorch_target_copy_mlx_metallib(my_target) +``` + +No additional steps are necessary to use the backend beyond linking the target. An MLX-delegated `.pte` file will automatically run on the registered backend. + +There is also an `mlx-debug` preset useful during development: + +```bash +cmake --workflow --preset mlx-debug +``` + +## Reference + +**→{doc}`/backends/mlx/mlx-troubleshooting` — Debug common issues.** + +**→{doc}`/backends/mlx/mlx-partitioner` — Partitioner options.** + +**→{doc}`/backends/mlx/mlx-quantization` — Supported quantization schemes.** + +**→{doc}`/backends/mlx/mlx-op-support` — Supported operators.** + +```{toctree} +:maxdepth: 2 +:hidden: +:caption: MLX Backend +mlx-troubleshooting +mlx-partitioner +mlx-quantization +mlx-op-support +``` diff --git a/docs/source/backends/mlx/mlx-partitioner.md b/docs/source/backends/mlx/mlx-partitioner.md new file mode 100644 index 00000000000..57fb2468538 --- /dev/null +++ b/docs/source/backends/mlx/mlx-partitioner.md @@ -0,0 +1,41 @@ +# Partitioner API + +The MLX partitioner API allows for configuration of model delegation to the MLX backend. Passing an `MLXPartitioner` instance with no additional parameters will run as much of the model as possible on the MLX backend with default settings. This is the most common use case. + +## Usage + +```python +import torch +from executorch.backends.mlx import MLXPartitioner +from executorch.exir import to_edge_transform_and_lower + +et_program = to_edge_transform_and_lower( + torch.export.export(model, example_inputs), + partitioner=[MLXPartitioner()], +).to_executorch() +``` + +::::{important} +`MLXPartitioner` must be used with `to_edge_transform_and_lower()`. The legacy `to_edge()` + `to_backend()` workflow is **not supported** because it decomposes ops that MLX has optimized implementations for. +:::: + +## Unsupported Op Logging + +During partitioning, the partitioner logs a summary of any unsupported ops. This is useful for understanding what will fall back to CPU: + +``` +================================================================================ +MLX Partitioner: UNSUPPORTED OPS SUMMARY +================================================================================ + [UNSUPPORTED x2] aten.some_op.default + Reason: No handler registered +================================================================================ +``` + +If all ops are supported, you'll see: + +``` + (All call_function nodes are supported!) +``` + +Set `ET_MLX_DEBUG=1` to see detailed per-node support decisions during partitioning. diff --git a/docs/source/backends/mlx/mlx-quantization.md b/docs/source/backends/mlx/mlx-quantization.md new file mode 100644 index 00000000000..14244bd11ca --- /dev/null +++ b/docs/source/backends/mlx/mlx-quantization.md @@ -0,0 +1,88 @@ +# Quantization + +The MLX backend supports weight-only quantization via [TorchAO](https://github.com/pytorch/ao) for reducing model size and improving inference performance, particularly for LLMs on Apple Silicon. Quantization is applied to the eager model in-place **before** `torch.export()`. + +## `quantize_` + +The MLX backend uses TorchAO's [`quantize_`](https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py) API under the hood. You can call it directly for full control over quantization configs and granularity. The key TorchAO configs are: + +- [`IntxWeightOnlyConfig`](https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py) — for INT2/INT4/INT8 weight-only quantization with per-group granularity (group sizes 32, 64, 128) +- [`ExportableNVFP4Config`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/nvfp4.py) — for NVFP4 weight-only quantization + +```python +import torch +from torchao.quantization.quant_api import quantize_, IntxWeightOnlyConfig +from torchao.quantization.granularity import PerGroup + +# INT4 weight-only quantization for linear layers (group_size=32) +quantize_( + model, + IntxWeightOnlyConfig(weight_dtype=torch.int4, granularity=PerGroup(32)), + filter_fn=lambda m, fqn: isinstance(m, torch.nn.Linear), +) + +# INT8 weight-only quantization for embedding layers (group_size=128) +quantize_( + model, + IntxWeightOnlyConfig(weight_dtype=torch.int8, granularity=PerGroup(128)), + filter_fn=lambda m, fqn: isinstance(m, torch.nn.Embedding), +) +``` + +## `quantize_model_` + +For convenience, ExecuTorch provides `quantize_model_` which wraps `quantize_` with sensible defaults for common LLM quantization configurations: + +```python +from executorch.extension.llm.export.quantize import quantize_model_ + +# Quantize linear layers with INT4, embedding layers with INT8 +# Note: 8w defaults to per-axis grouping, which MLX does not support. +# Always pass an explicit group size when using 8w with MLX. +quantize_model_(model, qlinear_config="4w", qembedding_config="8w", qembedding_group_size=128) +``` + +### Supported configs + +| Config | Description | +|--------|-------------| +| `4w` | INT4 weight-only quantization (per-group) | +| `8w` | INT8 weight-only quantization (per-group) | +| `nvfp4` | NVIDIA FP4 weight-only quantization | + +These can be applied independently to linear layers and embedding layers. + +### Using the LLM Export Script + +The simplest way to export a quantized model is via the `export_llm_hf` script, which calls `quantize_model_` internally: + +```bash +# INT4 quantization for both linear and embedding layers +python -m executorch.backends.mlx.examples.llm.export_llm_hf \ + --model-id "unsloth/Llama-3.2-1B-Instruct" \ + --output llama_int4.pte \ + --use-custom-sdpa \ + --use-custom-kv-cache \ + --qlinear 4w \ + --qembedding 4w + +# INT8 quantization for linear layers only +# Note: --qlinear-group-size is required for 8w (default is per-axis, which MLX does not support) +python -m executorch.backends.mlx.examples.llm.export_llm_hf \ + --model-id "unsloth/Llama-3.2-1B-Instruct" \ + --output llama_int8.pte \ + --use-custom-sdpa \ + --use-custom-kv-cache \ + --qlinear 8w \ + --qlinear-group-size 128 +``` + +### CLI Quantization Options + +| Option | Default | Description | +|--------|---------|-------------| +| `--qlinear` | None | Quantization for linear layers (`4w`, `8w`, `nvfp4`) | +| `--qembedding` | None | Quantization for embedding layers (`4w`, `8w`, `nvfp4`) | +| `--qlinear-group-size` | Depends on config | Group size for linear layer quantization (32, 64, or 128). Defaults to 32 for `4w`, 16 for `nvfp4`. **Required for `8w`** (default is per-axis, which MLX does not support). | +| `--qembedding-group-size` | Depends on config | Group size for embedding layer quantization (32, 64, or 128). Defaults to 32 for `4w`, 16 for `nvfp4`. **Required for `8w`** (default is per-axis, which MLX does not support). | +| `--no-tie-word-embeddings` | False | Disable re-tying lm_head to embedding after quantization | diff --git a/docs/source/backends/mlx/mlx-troubleshooting.md b/docs/source/backends/mlx/mlx-troubleshooting.md new file mode 100644 index 00000000000..6be472ff902 --- /dev/null +++ b/docs/source/backends/mlx/mlx-troubleshooting.md @@ -0,0 +1,118 @@ +# Troubleshooting + +This page describes common issues when using the MLX backend and how to debug them. + +## Debug Logging + +### AOT (export/compilation) debugging + +Set `ET_MLX_DEBUG=1` during export to see detailed debug logging from the partitioner and preprocessor — including ops-to-not-decompose lists, graph dumps, per-node support decisions, and serialization details: + +```bash +ET_MLX_DEBUG=1 python my_export_script.py +``` + +### Runtime per-op logging + +Per-op logging prints each MLX instruction as it executes, showing op names and tensor IDs. This is useful for correlating runtime behavior with the compiled graph. + +When using **pybindings** (i.e. `python install_executorch.py`), per-op logging is compiled in by default. Just set the environment variable: + +```bash +ET_MLX_ENABLE_OP_LOGGING=1 python my_inference_script.py +``` + +For **C++ builds**, you need to build with the debug preset first (which compiles in the logging code), then set the environment variable: + +```bash +# Build with debug preset +cmake --workflow --preset mlx-debug + +# Run with per-op logging enabled +ET_MLX_ENABLE_OP_LOGGING=1 ./cmake-out/my_app model.pte +``` + +The release preset (`mlx-release`) strips the logging code for performance. + +## Inspecting `.pte` Files + +The MLX backend includes a `.pte` inspector for debugging exported models. It can parse the ExecuTorch program structure, extract and decode the MLX delegate payload, and display instructions, tensor metadata, and I/O maps. + +### Basic usage + +Dump the full PTE structure as JSON: + +```bash +python -m executorch.backends.mlx.pte_inspector model.pte +``` + +### MLX summary + +Show a high-level summary of the MLX delegate (tensor counts, I/O maps, mutable buffers): + +```bash +python -m executorch.backends.mlx.pte_inspector model.pte --mlx-summary +``` + +### MLX instructions + +Show every instruction in the compiled graph with operands and parameters. This is useful for verifying quantization, inspecting fused patterns, and debugging incorrect outputs: + +```bash +python -m executorch.backends.mlx.pte_inspector model.pte --mlx-instructions +``` + +### Extract delegate payload + +Extract the raw MLX delegate payload to a binary file: + +```bash +python -m executorch.backends.mlx.pte_inspector model.pte --extract-delegate mlx -o delegate.bin +``` + +Parse and dump the extracted payload as JSON: + +```bash +python -m executorch.backends.mlx.pte_inspector model.pte --extract-delegate mlx --parse-mlx -o mlx_graph.json +``` + +### All options + +| Flag | Description | +|------|-------------| +| `--mlx-summary` | High-level summary (tensor counts, I/O maps) | +| `--mlx-instructions` | Detailed instruction list with operands | +| `--extract-delegate ID` | Extract raw delegate payload by ID | +| `--parse-mlx` | Parse extracted MLX payload to JSON (use with `--extract-delegate mlx`) | +| `--delegate-index N` | Index of delegate to extract (0-based, default: first match) | +| `--format json`/`summary` | Output format (default: json) | +| `-o FILE` | Write output to file instead of stdout | + +## Common Issues + +### Metal compiler not found + +**Error:** `xcrun -sdk macosx --find metal` fails. + +**Solution:** Install the full Xcode application (not just Command Line Tools). The Metal compiler ships with Xcode. If Xcode is installed but not selected: + +```bash +sudo xcode-select -s /Applications/Xcode.app/Contents/Developer +``` + +### `MLXPartitioner must be used with to_edge_transform_and_lower()` + +**Error:** RuntimeError when using the legacy `to_edge()` + `to_backend()` workflow. + +**Solution:** Use `to_edge_transform_and_lower()` instead: + +```python +import torch +from executorch.backends.mlx import MLXPartitioner +from executorch.exir import to_edge_transform_and_lower + +et_program = to_edge_transform_and_lower( + torch.export.export(model, example_inputs), + partitioner=[MLXPartitioner()], +).to_executorch() +``` diff --git a/docs/source/desktop-backends.md b/docs/source/desktop-backends.md index c894e01843c..44be7af25e2 100644 --- a/docs/source/desktop-backends.md +++ b/docs/source/desktop-backends.md @@ -11,6 +11,7 @@ Available hardware acceleration backends for desktop platforms. ## macOS Backends - {doc}`backends/coreml/coreml-overview` — CoreML (recommended for Apple Silicon) +- {doc}`backends/mlx/mlx-overview` — MLX (Apple Silicon GPU) - {doc}`backends/mps/mps-overview` — Metal Performance Shaders (Apple Silicon GPU) - {doc}`desktop-xnnpack` — XNNPACK (CPU acceleration)