Large Language Model (LLM) serving faces critical bottlenecks when scaling from single-node deployments to heterogeneous clusters. While engines like vLLM and SGLang optimize throughput on individual GPUs, they lack cluster-wide awareness. This leads to three primary crises:
- The Latency Crisis: Naive scheduling (e.g., Round-Robin) routes short, latency-sensitive requests to slower or overloaded workers, causing severe head-of-line blocking and inflated tail latencies (p99).
- The Memory Crisis: Simple routing policies ignore VRAM constraints. Routing a long-context request to a worker with insufficient memory triggers immediate Out-Of-Memory (OOM) failures or expensive swapping.
- The Deployment Crisis: Existing solutions often require invasive modifications to the inference engine (e.g., custom forks of vLLM), creating a "maintenance trap" that prevents users from easily upgrading their underlying serving software.
Goal: To build a model-agnostic, non-invasive control plane that achieves predictive, memory-safe orchestration for heterogeneous GPU clusters.
Current approaches to LLM orchestration fall into three categories, each with distinct limitations:
| Category | Mechanism | Flaw |
|---|---|---|
|
Naive Load Balancers (Round-Robin, Least-Loaded) |
Route based on simple counters or cyclic order. | Blind to request complexity (token count) and hardware heterogeneity (A100 vs. T4). Fails to prevent OOMs. |
|
General-Purpose Frameworks (Ray Serve, Kubernetes) |
Scale based on coarse metrics like CPU/RAM usage or average latency. | Metrics are "lagging indicators". By the time average latency spikes, the system is already congested. They lack granularity to manage KV-cache memory pressure. |
|
Specialized Research Systems (NexusSched, DistServe) |
Use sophisticated offline profiling or disaggregated prefill/decode pipelines. | Invasiveness. NexusSched requires deep modifications to the engine's scheduler. Complexity. Some use computationally expensive predictors ($O(N^2)$) or fragile offline profiles that drift over time. |
DIO's Contribution: A lightweight sidecar architecture that uses online learning (NLMS) to adapt to heterogeneity without offline profiling, enforcing strict memory safety via Roofline admission control.
DIO employs a decoupled Control Plane / Data Plane architecture (see Figure 1 in context).
A high-performance, centralized orchestrator responsible for all routing decisions.
- HTTP/gRPC API: Accepts requests and parses metadata (tier, prompt length).
- Scheduler: Runs the core logic (Algorithm 1) to score workers based on predicted latency, queue wait time, and VRAM penalties.
- Online Predictor: Maintains the NLMS state vector for every worker.
Stateless workers running standard inference engines (vLLM).
- Sidecar Pattern: Wraps the engine to intercept requests and piggyback telemetry (latency, free VRAM) onto responses. This avoids modifying the engine core.
- Execution: Purely responsible for inference; no local scheduling logic.
To predict request latency
Where
Standard Least Mean Squares (LMS) suffers from gradient explosion with large inputs (long prompts). DIO uses Normalized LMS to stabilize updates:
To distinguish between random noise (jitter) and persistent hardware changes (drift/thermal throttling), DIO maintains two slope estimates:
-
Fast Slope (
$s_{\text{fast}}$ ): Updates quickly ($\alpha=0.1$ ) to react to immediate contention.$$ s_{\text{fast}} \leftarrow s_{\text{fast}} + 0.1 \cdot \nabla_{\text{norm}} $$
-
Slow Slope (
$s_{\text{slow}}$ ): Updates slowly ($\alpha=0.01$ ) to track long-term trends.$$ s_{\text{slow}} \leftarrow s_{\text{slow}} + 0.01 \cdot \nabla_{\text{norm}} $$
Final Effective Slope: A weighted combination (
DIO was evaluated on both constrained (L4) and high-performance (A100) hardware using production traces (ShareGPT, arXiv, Azure Code).
-
Zero-Config Convergence (T1): The NLMS predictor converged to <50ms error within 60 seconds (approx. 20 probes) on both L4 and A100 GPUs, proving no offline profiling is needed.
-
Scalability (T7): The Go-based control plane successfully managed 32 concurrent workers with a scheduling overhead of just 14µs per request, validating the
$O(1)$ complexity claim. -
Tail Latency Reduction (T9-T11):
- arXiv (Long Context) Stress Test: Baselines suffered VRAM thrashing with p99 latency >80s. DIO's Roofline admission kept p99 at 13.5s (a ~6x reduction).
- Azure Code (Bursty): DIO maintained 88% SLO attainment versus 12% for Round-Robin.
-
Resilience (T4): In a "Roofline" stress test (flooding the system beyond capacity), DIO achieved 0% failure rate by gracefully shedding load, whereas baselines crashed with OOM errors.