Skip to content

[Bug] intel/llm-scaler-vllm:latest (1.4 / vLLM 0.14.0 / PyTorch 2.10): XPU OutOfMemoryError on first tensor allocation when 2x Arc B580 are visible #305

@anil-motupalli

Description

@anil-motupalli

Environment

Component Version
Docker image intel/llm-scaler-vllm:latest (1.4 / 0.14.0-b8)
vLLM 0.14.1.dev0+gb17039bcc.d20260227
PyTorch 2.10.0+xpu
IPEX 2.10.10.post1+xpu
oneAPI 2025.3.2 (hotfix)
Level Zero 1.26.2
OS (host) Ubuntu 25.04
Kernel 6.14.0-37-generic (xe driver)
GPUs 2x Intel Arc B580 (BMG-G21, device ID 0xe20b)
PCI addresses 0000:00:07.0, 0000:00:08.0
DRM devices /dev/dri/card0, /dev/dri/card2, /dev/dri/renderD128, /dev/dri/renderD129

Description

When running intel/llm-scaler-vllm:latest (1.4) with 2x Arc B580 GPUs, any attempt to allocate a tensor on XPU fails with a false OutOfMemoryError — even though the GPU reports 11.33 GiB free. The error occurs on the first .xpu() call after torch.xpu.device_count() correctly returns 2.

This is a regression from intel/llm-scaler-vllm:1.3 (PyTorch 2.9), which works correctly.


Minimal Reproduction

# With both GPUs visible (device_count = 2) — FAILS
source /opt/intel/oneapi/setvars.sh --force
unset ZE_AFFINITY_MASK

python3 -c "
import torch
print('devices:', torch.xpu.device_count())  # prints 2
x = torch.zeros(1).xpu(0)                    # OutOfMemoryError
print('GPU0:', x)
"

Output:

devices: 2
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 11.33 GiB of which 11.33 GiB is free.
Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes
is reserved by PyTorch but unallocated.

Key Observations

Single GPU works fine — the error only occurs when both GPUs are visible:

# GPU 0 alone — WORKS
export ZE_AFFINITY_MASK=0
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0')  ✓

# GPU 1 alone — WORKS  
export ZE_AFFINITY_MASK=1
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0')  ✓

# Both GPUs visible — FAILS on whichever device is allocated first
unset ZE_AFFINITY_MASK
python3 -c "import torch; torch.xpu.set_device(1); print(torch.zeros(1).xpu(1))"
# OutOfMemoryError on GPU 1  ✗

P2P access reports True between both devices:

torch.xpu.can_device_access_peer(0, 1)  # True
torch.xpu.can_device_access_peer(1, 0)  # True

ZE_DEBUG=1 produces no errors — the failure is happening inside PyTorch/IPEX/UMF above the Level Zero layer.

None of the following workarounds fix it with 2 GPUs visible:

  • SYCL_UR_USE_LEVEL_ZERO_V2=0
  • SYCL_UR_USE_LEVEL_ZERO_V2=1
  • SYCL_ENABLE_DEFAULT_CONTEXTS=0
  • ZE_FLAT_DEVICE_HIERARCHY=FLAT
  • ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
  • PYTORCH_XPU_ALLOC_CONF=backend:native
  • UMF_PROXY=0
  • ZE_ENABLE_PCI_ID_DEVICE_ORDER=1

Impact

  • vllm serve -tp 1 fails (EngineCore subprocess spawned by V1 engine can't allocate)
  • vllm serve -tp 2 fails (same)
  • VLLM_USE_V1=0 does not help — the allocator failure is pre-vLLM
  • Only workaround is ZE_AFFINITY_MASK=0 or =1 for single-GPU inference

Working Reference

intel/llm-scaler-vllm:1.3 (PyTorch 2.9, vLLM 0.11.1) works correctly with both GPUs visible and -tp 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions