You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- 🧠 **Smart Model Routing**: Loads HuggingFace format models directly, with native Safetensors parsing.
83
83
- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
84
-
- 💾 **SSD Expert Streaming**: *Experimental* zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models).
84
+
- 💾 **SSD Expert Streaming (10x)**: High-performance NVMe streaming that loads Mixture of Experts (MoE) layers directly from SSD to GPU — engineered by [@ericjlake](https://github.com/ericjlake), achieving **10x speedup** (0.58 → 5.91 tok/s) on 122B+ models with only ~10 GB resident memory. Uses cross-projection batching, concurrent pread (QD=24), asyncEval pipeline, and runtime top-k expert selection.
85
+
- 🔮 **Speculative Decoding**: Load a small draft model (e.g. 9B) alongside a large main model to generate candidate tokens and verify in bulk — accelerating in-RAM inference.
85
86
- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM.
SwiftLM implements a **rewritten SSD expert streaming pipeline** (engineered by [Eric Lake](https://github.com/ericjlake)) that achieves 10x generation speedup for massive Mixture of Experts (MoE) models running on memory-constrained Apple Silicon. This enables running models like **Qwen3.5-122B** (69.6 GB) and **Qwen3.5-397B** (209 GB) on a **64 GB Mac** by streaming expert weights from NVMe SSD.
|**This PR (top-k=4, speed mode)**|**5.91**|**10.2×**| Best quality/speed tradeoff |
162
+
|**This PR (top-k=2, turbo mode)**|**6.52**|**11.2×**| Still coherent output |
163
+
164
+
> Memory stable at **~10.6 GB resident**, no swap activity. Tested over 200-token generation runs.
165
+
166
+
### The Approach: Small Model Helps Large Model
167
+
168
+
A novel aspect of this architecture is the **dual-model speculative decoding** pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs **entirely in RAM** while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.
169
+
170
+
> **Important finding:** Speculative decoding is **counterproductive for SSD-streaming MoE** specifically. The verify pass sends N+1 tokens, each routing to *different* experts — SSD I/O scales with the *union* of all positions' expert selections. Speculative decoding is therefore routed exclusively to **in-RAM models**.
171
+
172
+
### Optimization Techniques
173
+
174
+
1.**Cross-Projection Batching**: Collapses ~1,400 per-expert `eval()` calls down to ~48 per token by orchestrating gate/up/down projections together in `SwitchGLU`.
3.**AsyncEval Pipeline with Speculative Pread**: Overlaps GPU compute with SSD I/O — uses previous-token routing to speculatively pre-load experts for the next token during the GPU async window (~70% hit rate). Only missed experts (~30%) require on-demand pread after routing sync.
177
+
4.**Persistent Metal Buffers**: Expert weight buffers are allocated once per `SwitchGLU` layer and reused across tokens, eliminating per-token allocation overhead.
178
+
5.**Runtime Top-K Expert Selection**: The `SWIFTLM_TOP_K` environment variable reduces the number of active experts per token at runtime without model recompilation — trading marginal quality for significant speed gains.
179
+
180
+
### Key Engineering Findings
181
+
182
+
| Finding | Detail |
183
+
|---|---|
184
+
|**GPU compute is the bottleneck**| At steady state, GPU compute is ~190ms of ~200ms per-token time. The OS page cache serves ~90% of expert reads from RAM. |
185
+
|**Don't cache experts in application memory**| An LRU expert cache *stole* from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it. |
186
+
|**MambaCache requires checkpoint rollback**| Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone. We implemented `checkpoint()`/`restore()` for speculative decoding on hybrid Attention+Mamba architectures (Qwen3.5). |
|`--draft-model`| (none) | Draft model path/ID for speculative decoding (in-RAM models only) |
291
+
|`--num-draft-tokens`|`4`| Number of draft tokens per speculation round |
231
292
232
293
## 📦 Requirements
233
294
@@ -247,7 +308,13 @@ The model instantly woke up from "whispering" whitespace and successfully respon
247
308
248
309
## 🙏 Acknowledgments & Credits
249
310
250
-
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects for their indispensable reference materials and underlying protocols:
311
+
`SwiftLM` leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects and contributors for their indispensable reference materials and underlying protocols:
312
+
313
+
### Contributors
314
+
315
+
-**[Eric Lake](https://github.com/ericjlake)** — Engineered the **SSD Expert Streaming 10x rewrite** ([PR #26](https://github.com/SharpAI/SwiftLM/pull/26)), achieving 10× generation speedup on 122B+ MoE models via cross-projection batching, concurrent NVMe pread (QD=24), asyncEval pipeline with speculative pread, and runtime top-k expert selection. Also implemented the **speculative decoding infrastructure** with `DraftModelRef`, dual-model loading, and **MambaCache checkpoint/restore** for hybrid Attention+Mamba architectures.
316
+
317
+
### Projects & References
251
318
252
319
-**[mlx-swift](https://github.com/ml-explore/mlx-swift)** — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.
253
320
-**[mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm)** — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.
0 commit comments