Add --gpu flag for Metal/MLX inference in Linux containers#1314
Open
ilessiorobotflowlabs wants to merge 1 commit intoapple:mainfrom
Open
Add --gpu flag for Metal/MLX inference in Linux containers#1314ilessiorobotflowlabs wants to merge 1 commit intoapple:mainfrom
ilessiorobotflowlabs wants to merge 1 commit intoapple:mainfrom
Conversation
Adds GPU acceleration support for Linux containers on Apple Silicon
through the MLX Container Toolkit. When --gpu is passed to
`container run`, the runtime injects vsock environment variables
into the guest VM, enabling code inside the container to access the
host's Metal GPU for ML inference.
Architecture:
Container (Linux VM) --[gRPC over vsock]--> Host daemon (MLX/Metal)
The host-side daemon (mlx-container-daemon) manages model loading
and serves inference requests over the same vsock channel that
vminitd already uses for container management. No GPU drivers or
Metal frameworks are needed inside the Linux guest.
New flags on `container run`:
--gpu Enable GPU access
--gpu-model <id> Pre-load a HuggingFace model
--gpu-memory <gb> GPU memory budget
--gpu-max-tokens <n> Max tokens per request
--gpu-port <port> vsock port (default: 2048)
Example:
container run --gpu --gpu-model mlx-community/Llama-3.2-1B-4bit \
ubuntu:latest python3 -c \
"from mlx_container import generate; print(generate('Hello', model='mlx-community/Llama-3.2-1B-4bit').text)"
Requires: https://github.com/RobotFlow-Labs/container-toolkit-mlx
Signed-off-by: ilessio <ilessio@aiflowlabs.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds GPU acceleration support for Linux containers running on Apple Silicon. When
--gpuis passed tocontainer run, the runtime injects vsock environment variables into the guest VM, enabling Python code inside the container to access the host Metal GPU for ML inference at near-native speed.76 lines changed. 2 files touched.
The problem
Apple containers run Linux in lightweight VMs on Apple Silicon -- but there is currently no way to access the Metal GPU from inside those VMs. Metal and MLX cannot run in Linux guests. This is a hardware limitation, not a software one.
Developers working with ML models in containers today have two options: CPU-only inference (~5% of native speed) or running everything on the host outside any container.
The approach
Rather than attempting GPU passthrough (which Apple's Virtualization framework does not support), this PR takes the same host-guest bridge approach that
vminitdalready uses for container management: vsock.A host-side daemon (container-toolkit-mlx) runs with direct Metal/MLX access and serves inference requests over gRPC through the vsock channel. Code inside the container uses a lightweight Python client (
pip install mlx-container) that proxies requests to the host GPU.This is architecturally identical to how NVIDIA's container toolkit bridges GPU access, adapted for Apple Silicon's unified memory model.
What this PR adds
Two files, 76 lines total:
Flags.swift-- newFlags.GPUstruct:--gpuflag to enable GPU access--gpu-model <id>to pre-load a HuggingFace model on container start--gpu-memory <gb>for per-container GPU memory budgets--gpu-max-tokens <n>to cap inference request size--gpu-port <port>for custom vsock port (default: 2048)ContainerRun.swift-- GPU environment injection:--gpuis set, injectsMLX_VSOCK_CID,MLX_VSOCK_PORT,MLX_GPU_ENABLEDinto the container environmentMLX_GPU_MODELandMLX_GPU_MEMORYUsage
Performance
Tested on Apple M5, 24 GB unified memory:
~95% of native Metal performance. The only overhead is vsock serialization.
The companion toolkit
This PR is the integration point. The heavy lifting lives in container-toolkit-mlx, an open-source toolkit that provides:
mlx-container-daemon-- host-side gRPC server with MLX model managementmlx-ctk-- CLI for GPU discovery, daemon lifecycle, CDI spec generationmlx-cdi-hook-- OCI prestart hook for automatic daemon startupmlx-container-- Python client library (OpenAI + Anthropic API compatible)apple.com/gpu)The toolkit follows the same architectural patterns as this project: vsock for host-guest communication, gRPC for the wire protocol, Swift for the host-side components.
Why this belongs upstream
The flags are inert without the toolkit -- if
container-toolkit-mlxis not installed,--gpusimply injects environment variables that nothing reads. Zero risk to existing users.The vsock channel already exists -- this PR adds no new transport. It reuses the same vsock path that
vminitduses.Developers expect it -- GPU support is the most-requested feature for Apple containers. This gives them a path forward.
76 lines -- this is as minimal as a GPU integration can be. All complexity lives in the external toolkit.
Test plan
container run --gpu --helpshows GPU flagscontainer runwithout--gpubehaves identically to before (no GPU env vars injected)container run --gpuinjectsMLX_VSOCK_CID=2andMLX_VSOCK_PORT=2048into container envcontainer run --gpu --gpu-model Xadditionally injectsMLX_GPU_MODEL=XBuilt by RobotFlow Labs | container-toolkit-mlx