Environment
- GPU: AMD Radeon AI Pro R9700, gfx1201 (RDNA4), 32 GB VRAM
- ROCm: 7.2.3
- Build: b9595-1bfbdb134 (dynamically linked,
GGML_HIP_GRAPHS=ON, GGML_CUDA_GRAPHS=ON)
- OS: Linux (systemd service, running as root)
- Mode:
llama-server with -np 3 (3 parallel slots), --flash-attn on, no KV quantization
What happens
When 3 parallel inference sessions run simultaneously over a long task (e.g. parallel agent reviewers), system RAM grows continuously and does not return to baseline after work completes:
| State |
System RAM (llama-server RSS) |
VRAM |
| Model loaded, idle |
~8 GB |
30.5 GB |
| 3 slots active, mid-task |
~13.7 GB |
33.0 GB |
| Work done, slots idle |
~12 GB |
33.0 GB |
The large anonymous block (observed via pmap -x) grows from ~5 GB to ~7 GB during a run and stays elevated after requests complete.
Root cause — traced to common.cuh
The HIP graph cache in ggml_backend_cuda_context is an unbounded unordered_map:
// common.cuh:1402
// Map from first_node_ptr to cuda_graph - allows multiple graphs per context
std::unordered_map<const void *, std::unique_ptr<ggml_cuda_graph>> cuda_graphs;
With -np 3, each slot produces compute graphs of different shapes (different sequence lengths as conversations grow). Each unique shape maps to a new key → a new hipGraph_t + hipGraphExec_t entry. These entries are never evicted during normal operation, causing the map to grow for the lifetime of the process.
The last_used_time field exists on ggml_cuda_graph but no eviction/cleanup path appears to use it to trim the map.
Workaround
Setting GGML_CUDA_DISABLE_GRAPHS=1 (checked at common.cuh:1234) disables HIP graph capture entirely and stops the leak:
GGML_CUDA_DISABLE_GRAPHS=1 llama-server -m model.gguf -np 3 ...
Confirmed working: RAM stays stable across long parallel runs with this set.
Expected fix
Either:
- Cap
cuda_graphs map size and evict LRU entries using the existing last_used_time field, or
- Clear the map between requests / when a slot becomes idle
Related issues
Environment
GGML_HIP_GRAPHS=ON,GGML_CUDA_GRAPHS=ON)llama-serverwith-np 3(3 parallel slots),--flash-attn on, no KV quantizationWhat happens
When 3 parallel inference sessions run simultaneously over a long task (e.g. parallel agent reviewers), system RAM grows continuously and does not return to baseline after work completes:
The large anonymous block (observed via
pmap -x) grows from ~5 GB to ~7 GB during a run and stays elevated after requests complete.Root cause — traced to
common.cuhThe HIP graph cache in
ggml_backend_cuda_contextis an unboundedunordered_map:With
-np 3, each slot produces compute graphs of different shapes (different sequence lengths as conversations grow). Each unique shape maps to a new key → a newhipGraph_t+hipGraphExec_tentry. These entries are never evicted during normal operation, causing the map to grow for the lifetime of the process.The
last_used_timefield exists onggml_cuda_graphbut no eviction/cleanup path appears to use it to trim the map.Workaround
Setting
GGML_CUDA_DISABLE_GRAPHS=1(checked atcommon.cuh:1234) disables HIP graph capture entirely and stops the leak:Confirmed working: RAM stays stable across long parallel runs with this set.
Expected fix
Either:
cuda_graphsmap size and evict LRU entries using the existinglast_used_timefield, orRelated issues