Skip to content

HIP/ROCm: system RAM grows unbounded with parallel slots due to CUDA graph cache never being evicted #25082

Description

@lukascechovic

Environment

  • GPU: AMD Radeon AI Pro R9700, gfx1201 (RDNA4), 32 GB VRAM
  • ROCm: 7.2.3
  • Build: b9595-1bfbdb134 (dynamically linked, GGML_HIP_GRAPHS=ON, GGML_CUDA_GRAPHS=ON)
  • OS: Linux (systemd service, running as root)
  • Mode: llama-server with -np 3 (3 parallel slots), --flash-attn on, no KV quantization

What happens

When 3 parallel inference sessions run simultaneously over a long task (e.g. parallel agent reviewers), system RAM grows continuously and does not return to baseline after work completes:

State System RAM (llama-server RSS) VRAM
Model loaded, idle ~8 GB 30.5 GB
3 slots active, mid-task ~13.7 GB 33.0 GB
Work done, slots idle ~12 GB 33.0 GB

The large anonymous block (observed via pmap -x) grows from ~5 GB to ~7 GB during a run and stays elevated after requests complete.

Root cause — traced to common.cuh

The HIP graph cache in ggml_backend_cuda_context is an unbounded unordered_map:

// common.cuh:1402
// Map from first_node_ptr to cuda_graph - allows multiple graphs per context
std::unordered_map<const void *, std::unique_ptr<ggml_cuda_graph>> cuda_graphs;

With -np 3, each slot produces compute graphs of different shapes (different sequence lengths as conversations grow). Each unique shape maps to a new key → a new hipGraph_t + hipGraphExec_t entry. These entries are never evicted during normal operation, causing the map to grow for the lifetime of the process.

The last_used_time field exists on ggml_cuda_graph but no eviction/cleanup path appears to use it to trim the map.

Workaround

Setting GGML_CUDA_DISABLE_GRAPHS=1 (checked at common.cuh:1234) disables HIP graph capture entirely and stops the leak:

GGML_CUDA_DISABLE_GRAPHS=1 llama-server -m model.gguf -np 3 ...

Confirmed working: RAM stays stable across long parallel runs with this set.

Expected fix

Either:

  1. Cap cuda_graphs map size and evict LRU entries using the existing last_used_time field, or
  2. Clear the map between requests / when a slot becomes idle

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions