Skip to content

Add --gpu flag for Metal/MLX inference in Linux containers#1314

Open
ilessiorobotflowlabs wants to merge 1 commit intoapple:mainfrom
RobotFlow-Labs:feature/gpu-support
Open

Add --gpu flag for Metal/MLX inference in Linux containers#1314
ilessiorobotflowlabs wants to merge 1 commit intoapple:mainfrom
RobotFlow-Labs:feature/gpu-support

Conversation

@ilessiorobotflowlabs
Copy link

Summary

This PR adds GPU acceleration support for Linux containers running on Apple Silicon. When --gpu is passed to container run, the runtime injects vsock environment variables into the guest VM, enabling Python code inside the container to access the host Metal GPU for ML inference at near-native speed.

76 lines changed. 2 files touched.

The problem

Apple containers run Linux in lightweight VMs on Apple Silicon -- but there is currently no way to access the Metal GPU from inside those VMs. Metal and MLX cannot run in Linux guests. This is a hardware limitation, not a software one.

Developers working with ML models in containers today have two options: CPU-only inference (~5% of native speed) or running everything on the host outside any container.

The approach

Rather than attempting GPU passthrough (which Apple's Virtualization framework does not support), this PR takes the same host-guest bridge approach that vminitd already uses for container management: vsock.

A host-side daemon (container-toolkit-mlx) runs with direct Metal/MLX access and serves inference requests over gRPC through the vsock channel. Code inside the container uses a lightweight Python client (pip install mlx-container) that proxies requests to the host GPU.

Container (Linux VM) --[gRPC over vsock]--> Host Daemon (MLX/Metal GPU)

This is architecturally identical to how NVIDIA's container toolkit bridges GPU access, adapted for Apple Silicon's unified memory model.

What this PR adds

Two files, 76 lines total:

Flags.swift -- new Flags.GPU struct:

  • --gpu flag to enable GPU access
  • --gpu-model <id> to pre-load a HuggingFace model on container start
  • --gpu-memory <gb> for per-container GPU memory budgets
  • --gpu-max-tokens <n> to cap inference request size
  • --gpu-port <port> for custom vsock port (default: 2048)

ContainerRun.swift -- GPU environment injection:

  • When --gpu is set, injects MLX_VSOCK_CID, MLX_VSOCK_PORT, MLX_GPU_ENABLED into the container environment
  • Optionally injects MLX_GPU_MODEL and MLX_GPU_MEMORY
  • Logs GPU configuration at info level

Usage

# Run GPU-accelerated inference inside a Linux container
container run --gpu --gpu-model mlx-community/Llama-3.2-1B-4bit \
  ubuntu:latest python3 -c "
from mlx_container import generate, load_model
load_model('mlx-community/Llama-3.2-1B-4bit')
result = generate('Explain Apple Silicon', model='mlx-community/Llama-3.2-1B-4bit')
print(result.text)
print(f'{result.tokens_per_second:.0f} tok/s on host Metal GPU')
"

Performance

Tested on Apple M5, 24 GB unified memory:

Method Tokens/sec Runs in container
This PR + container-toolkit-mlx 99 tok/s Yes
Native MLX (macOS, no container) ~103 tok/s No
CPU fallback (no GPU) ~5 tok/s Yes

~95% of native Metal performance. The only overhead is vsock serialization.

The companion toolkit

This PR is the integration point. The heavy lifting lives in container-toolkit-mlx, an open-source toolkit that provides:

  • mlx-container-daemon -- host-side gRPC server with MLX model management
  • mlx-ctk -- CLI for GPU discovery, daemon lifecycle, CDI spec generation
  • mlx-cdi-hook -- OCI prestart hook for automatic daemon startup
  • mlx-container -- Python client library (OpenAI + Anthropic API compatible)
  • CDI v0.5.0 spec support (apple.com/gpu)
  • 259 tests (Swift + Python), security audited

The toolkit follows the same architectural patterns as this project: vsock for host-guest communication, gRPC for the wire protocol, Swift for the host-side components.

Why this belongs upstream

  1. The flags are inert without the toolkit -- if container-toolkit-mlx is not installed, --gpu simply injects environment variables that nothing reads. Zero risk to existing users.

  2. The vsock channel already exists -- this PR adds no new transport. It reuses the same vsock path that vminitd uses.

  3. Developers expect it -- GPU support is the most-requested feature for Apple containers. This gives them a path forward.

  4. 76 lines -- this is as minimal as a GPU integration can be. All complexity lives in the external toolkit.

Test plan

  • container run --gpu --help shows GPU flags
  • container run without --gpu behaves identically to before (no GPU env vars injected)
  • container run --gpu injects MLX_VSOCK_CID=2 and MLX_VSOCK_PORT=2048 into container env
  • container run --gpu --gpu-model X additionally injects MLX_GPU_MODEL=X
  • End-to-end inference from container at 99 tok/s verified on M5

Built by RobotFlow Labs | container-toolkit-mlx

Adds GPU acceleration support for Linux containers on Apple Silicon
through the MLX Container Toolkit. When --gpu is passed to
`container run`, the runtime injects vsock environment variables
into the guest VM, enabling code inside the container to access the
host's Metal GPU for ML inference.

Architecture:
  Container (Linux VM) --[gRPC over vsock]--> Host daemon (MLX/Metal)

The host-side daemon (mlx-container-daemon) manages model loading
and serves inference requests over the same vsock channel that
vminitd already uses for container management. No GPU drivers or
Metal frameworks are needed inside the Linux guest.

New flags on `container run`:
  --gpu                  Enable GPU access
  --gpu-model <id>       Pre-load a HuggingFace model
  --gpu-memory <gb>      GPU memory budget
  --gpu-max-tokens <n>   Max tokens per request
  --gpu-port <port>      vsock port (default: 2048)

Example:
  container run --gpu --gpu-model mlx-community/Llama-3.2-1B-4bit \
    ubuntu:latest python3 -c \
    "from mlx_container import generate; print(generate('Hello', model='mlx-community/Llama-3.2-1B-4bit').text)"

Requires: https://github.com/RobotFlow-Labs/container-toolkit-mlx

Signed-off-by: ilessio <ilessio@aiflowlabs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant