HIP/ROCm: system RAM grows unbounded with parallel slots due to CUDA graph cache never being evicted

## Environment

- **GPU**: AMD Radeon AI Pro R9700, gfx1201 (RDNA4), 32 GB VRAM
- **ROCm**: 7.2.3
- **Build**: b9595-1bfbdb134 (dynamically linked, `GGML_HIP_GRAPHS=ON`, `GGML_CUDA_GRAPHS=ON`)
- **OS**: Linux (systemd service, running as root)
- **Mode**: `llama-server` with `-np 3` (3 parallel slots), `--flash-attn on`, no KV quantization

## What happens

When 3 parallel inference sessions run simultaneously over a long task (e.g. parallel agent reviewers), system RAM grows continuously and does **not** return to baseline after work completes:

| State | System RAM (llama-server RSS) | VRAM |
|---|---|---|
| Model loaded, idle | ~8 GB | 30.5 GB |
| 3 slots active, mid-task | ~13.7 GB | 33.0 GB |
| Work done, slots idle | ~12 GB | 33.0 GB |

The large anonymous block (observed via `pmap -x`) grows from ~5 GB to ~7 GB during a run and **stays elevated** after requests complete.

## Root cause — traced to `common.cuh`

The HIP graph cache in `ggml_backend_cuda_context` is an **unbounded** `unordered_map`:

```cpp
// common.cuh:1402
// Map from first_node_ptr to cuda_graph - allows multiple graphs per context
std::unordered_map<const void *, std::unique_ptr<ggml_cuda_graph>> cuda_graphs;
```

With `-np 3`, each slot produces compute graphs of different shapes (different sequence lengths as conversations grow). Each unique shape maps to a new key → a new `hipGraph_t` + `hipGraphExec_t` entry. These entries are **never evicted** during normal operation, causing the map to grow for the lifetime of the process.

The `last_used_time` field exists on `ggml_cuda_graph` but no eviction/cleanup path appears to use it to trim the map.

## Workaround

Setting `GGML_CUDA_DISABLE_GRAPHS=1` (checked at `common.cuh:1234`) disables HIP graph capture entirely and stops the leak:

```bash
GGML_CUDA_DISABLE_GRAPHS=1 llama-server -m model.gguf -np 3 ...
```

Confirmed working: RAM stays stable across long parallel runs with this set.

## Expected fix

Either:
1. Cap `cuda_graphs` map size and evict LRU entries using the existing `last_used_time` field, or
2. Clear the map between requests / when a slot becomes idle

## Related issues

- #21967 — HIP/ROCM: Memory-usage growing until server crashes
- #19979 — Eval bug: Memory leak? using ROCm  
- #20315 — RPC server leaks CUDA graphs during inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIP/ROCm: system RAM grows unbounded with parallel slots due to CUDA graph cache never being evicted #25082

Environment

What happens

Root cause — traced to `common.cuh`

Workaround

Expected fix

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

State	System RAM (llama-server RSS)	VRAM
Model loaded, idle	~8 GB	30.5 GB
3 slots active, mid-task	~13.7 GB	33.0 GB
Work done, slots idle	~12 GB	33.0 GB

Uh oh!

HIP/ROCm: system RAM grows unbounded with parallel slots due to CUDA graph cache never being evicted #25082

Description

Environment

What happens

Root cause — traced to common.cuh

Workaround

Expected fix

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root cause — traced to `common.cuh`