Environment
| Component |
Version |
| Docker image |
intel/llm-scaler-vllm:latest (1.4 / 0.14.0-b8) |
| vLLM |
0.14.1.dev0+gb17039bcc.d20260227 |
| PyTorch |
2.10.0+xpu |
| IPEX |
2.10.10.post1+xpu |
| oneAPI |
2025.3.2 (hotfix) |
| Level Zero |
1.26.2 |
| OS (host) |
Ubuntu 25.04 |
| Kernel |
6.14.0-37-generic (xe driver) |
| GPUs |
2x Intel Arc B580 (BMG-G21, device ID 0xe20b) |
| PCI addresses |
0000:00:07.0, 0000:00:08.0 |
| DRM devices |
/dev/dri/card0, /dev/dri/card2, /dev/dri/renderD128, /dev/dri/renderD129 |
Description
When running intel/llm-scaler-vllm:latest (1.4) with 2x Arc B580 GPUs, any attempt to allocate a tensor on XPU fails with a false OutOfMemoryError — even though the GPU reports 11.33 GiB free. The error occurs on the first .xpu() call after torch.xpu.device_count() correctly returns 2.
This is a regression from intel/llm-scaler-vllm:1.3 (PyTorch 2.9), which works correctly.
Minimal Reproduction
# With both GPUs visible (device_count = 2) — FAILS
source /opt/intel/oneapi/setvars.sh --force
unset ZE_AFFINITY_MASK
python3 -c "
import torch
print('devices:', torch.xpu.device_count()) # prints 2
x = torch.zeros(1).xpu(0) # OutOfMemoryError
print('GPU0:', x)
"
Output:
devices: 2
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 11.33 GiB of which 11.33 GiB is free.
Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes
is reserved by PyTorch but unallocated.
Key Observations
Single GPU works fine — the error only occurs when both GPUs are visible:
# GPU 0 alone — WORKS
export ZE_AFFINITY_MASK=0
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0') ✓
# GPU 1 alone — WORKS
export ZE_AFFINITY_MASK=1
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0') ✓
# Both GPUs visible — FAILS on whichever device is allocated first
unset ZE_AFFINITY_MASK
python3 -c "import torch; torch.xpu.set_device(1); print(torch.zeros(1).xpu(1))"
# OutOfMemoryError on GPU 1 ✗
P2P access reports True between both devices:
torch.xpu.can_device_access_peer(0, 1) # True
torch.xpu.can_device_access_peer(1, 0) # True
ZE_DEBUG=1 produces no errors — the failure is happening inside PyTorch/IPEX/UMF above the Level Zero layer.
None of the following workarounds fix it with 2 GPUs visible:
SYCL_UR_USE_LEVEL_ZERO_V2=0
SYCL_UR_USE_LEVEL_ZERO_V2=1
SYCL_ENABLE_DEFAULT_CONTEXTS=0
ZE_FLAT_DEVICE_HIERARCHY=FLAT
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
PYTORCH_XPU_ALLOC_CONF=backend:native
UMF_PROXY=0
ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
Impact
vllm serve -tp 1 fails (EngineCore subprocess spawned by V1 engine can't allocate)
vllm serve -tp 2 fails (same)
VLLM_USE_V1=0 does not help — the allocator failure is pre-vLLM
- Only workaround is
ZE_AFFINITY_MASK=0 or =1 for single-GPU inference
Working Reference
intel/llm-scaler-vllm:1.3 (PyTorch 2.9, vLLM 0.11.1) works correctly with both GPUs visible and -tp 2.
Environment
intel/llm-scaler-vllm:latest(1.4 /0.14.0-b8)0.14.1.dev0+gb17039bcc.d202602272.10.0+xpu2.10.10.post1+xpu2025.3.2 (hotfix)1.26.26.14.0-37-generic(xe driver)0xe20b)0000:00:07.0,0000:00:08.0/dev/dri/card0,/dev/dri/card2,/dev/dri/renderD128,/dev/dri/renderD129Description
When running
intel/llm-scaler-vllm:latest(1.4) with 2x Arc B580 GPUs, any attempt to allocate a tensor on XPU fails with a falseOutOfMemoryError— even though the GPU reports 11.33 GiB free. The error occurs on the first.xpu()call aftertorch.xpu.device_count()correctly returns2.This is a regression from
intel/llm-scaler-vllm:1.3(PyTorch 2.9), which works correctly.Minimal Reproduction
Output:
Key Observations
Single GPU works fine — the error only occurs when both GPUs are visible:
P2P access reports True between both devices:
ZE_DEBUG=1produces no errors — the failure is happening inside PyTorch/IPEX/UMF above the Level Zero layer.None of the following workarounds fix it with 2 GPUs visible:
SYCL_UR_USE_LEVEL_ZERO_V2=0SYCL_UR_USE_LEVEL_ZERO_V2=1SYCL_ENABLE_DEFAULT_CONTEXTS=0ZE_FLAT_DEVICE_HIERARCHY=FLATZE_FLAT_DEVICE_HIERARCHY=COMPOSITEPYTORCH_XPU_ALLOC_CONF=backend:nativeUMF_PROXY=0ZE_ENABLE_PCI_ID_DEVICE_ORDER=1Impact
vllm serve -tp 1fails (EngineCore subprocess spawned by V1 engine can't allocate)vllm serve -tp 2fails (same)VLLM_USE_V1=0does not help — the allocator failure is pre-vLLMZE_AFFINITY_MASK=0or=1for single-GPU inferenceWorking Reference
intel/llm-scaler-vllm:1.3(PyTorch 2.9, vLLM 0.11.1) works correctly with both GPUs visible and-tp 2.