[Bug] intel/llm-scaler-vllm:latest (1.4 / vLLM 0.14.0 / PyTorch 2.10): XPU OutOfMemoryError on first tensor allocation when 2x Arc B580 are visible

## Environment

| Component | Version |
|-----------|---------|
| Docker image | `intel/llm-scaler-vllm:latest` (1.4 / `0.14.0-b8`) |
| vLLM | `0.14.1.dev0+gb17039bcc.d20260227` |
| PyTorch | `2.10.0+xpu` |
| IPEX | `2.10.10.post1+xpu` |
| oneAPI | `2025.3.2 (hotfix)` |
| Level Zero | `1.26.2` |
| OS (host) | Ubuntu 25.04 |
| Kernel | `6.14.0-37-generic` (xe driver) |
| GPUs | 2x Intel Arc B580 (BMG-G21, device ID `0xe20b`) |
| PCI addresses | `0000:00:07.0`, `0000:00:08.0` |
| DRM devices | `/dev/dri/card0`, `/dev/dri/card2`, `/dev/dri/renderD128`, `/dev/dri/renderD129` |

---

## Description

When running `intel/llm-scaler-vllm:latest` (1.4) with 2x Arc B580 GPUs, any attempt to allocate a tensor on XPU fails with a false `OutOfMemoryError` — even though the GPU reports 11.33 GiB free. The error occurs on the **first** `.xpu()` call after `torch.xpu.device_count()` correctly returns `2`.

This is a regression from `intel/llm-scaler-vllm:1.3` (PyTorch 2.9), which works correctly.

---

## Minimal Reproduction
```bash
# With both GPUs visible (device_count = 2) — FAILS
source /opt/intel/oneapi/setvars.sh --force
unset ZE_AFFINITY_MASK

python3 -c "
import torch
print('devices:', torch.xpu.device_count())  # prints 2
x = torch.zeros(1).xpu(0)                    # OutOfMemoryError
print('GPU0:', x)
"
```

**Output:**
```
devices: 2
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 2.00 MiB.
GPU 0 has a total capacity of 11.33 GiB of which 11.33 GiB is free.
Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes
is reserved by PyTorch but unallocated.
```

---

## Key Observations

**Single GPU works fine** — the error only occurs when both GPUs are visible:
```bash
# GPU 0 alone — WORKS
export ZE_AFFINITY_MASK=0
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0')  ✓

# GPU 1 alone — WORKS  
export ZE_AFFINITY_MASK=1
python3 -c "import torch; print(torch.zeros(1).xpu())"
# tensor([0.], device='xpu:0')  ✓

# Both GPUs visible — FAILS on whichever device is allocated first
unset ZE_AFFINITY_MASK
python3 -c "import torch; torch.xpu.set_device(1); print(torch.zeros(1).xpu(1))"
# OutOfMemoryError on GPU 1  ✗
```

**P2P access reports True** between both devices:
```python
torch.xpu.can_device_access_peer(0, 1)  # True
torch.xpu.can_device_access_peer(1, 0)  # True
```

**`ZE_DEBUG=1` produces no errors** — the failure is happening inside PyTorch/IPEX/UMF above the Level Zero layer.

**None of the following workarounds fix it with 2 GPUs visible:**
- `SYCL_UR_USE_LEVEL_ZERO_V2=0`
- `SYCL_UR_USE_LEVEL_ZERO_V2=1`  
- `SYCL_ENABLE_DEFAULT_CONTEXTS=0`
- `ZE_FLAT_DEVICE_HIERARCHY=FLAT`
- `ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE`
- `PYTORCH_XPU_ALLOC_CONF=backend:native`
- `UMF_PROXY=0`
- `ZE_ENABLE_PCI_ID_DEVICE_ORDER=1`

---

## Impact

- `vllm serve -tp 1` fails (EngineCore subprocess spawned by V1 engine can't allocate)
- `vllm serve -tp 2` fails (same)
- `VLLM_USE_V1=0` does not help — the allocator failure is pre-vLLM
- Only workaround is `ZE_AFFINITY_MASK=0` or `=1` for single-GPU inference

---

## Working Reference

`intel/llm-scaler-vllm:1.3` (PyTorch 2.9, vLLM 0.11.1) works correctly with both GPUs visible and `-tp 2`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] intel/llm-scaler-vllm:latest (1.4 / vLLM 0.14.0 / PyTorch 2.10): XPU OutOfMemoryError on first tensor allocation when 2x Arc B580 are visible #305

Environment

Description

Minimal Reproduction

Key Observations

Impact

Working Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
Docker image	`intel/llm-scaler-vllm:latest` (1.4 / `0.14.0-b8`)
vLLM	`0.14.1.dev0+gb17039bcc.d20260227`
PyTorch	`2.10.0+xpu`
IPEX	`2.10.10.post1+xpu`
oneAPI	`2025.3.2 (hotfix)`
Level Zero	`1.26.2`
OS (host)	Ubuntu 25.04
Kernel	`6.14.0-37-generic` (xe driver)
GPUs	2x Intel Arc B580 (BMG-G21, device ID `0xe20b`)
PCI addresses	`0000:00:07.0`, `0000:00:08.0`
DRM devices	`/dev/dri/card0`, `/dev/dri/card2`, `/dev/dri/renderD128`, `/dev/dri/renderD129`

[Bug] intel/llm-scaler-vllm:latest (1.4 / vLLM 0.14.0 / PyTorch 2.10): XPU OutOfMemoryError on first tensor allocation when 2x Arc B580 are visible #305

Description

Environment

Description

Minimal Reproduction

Key Observations

Impact

Working Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions