Skip to content

server: add strict prompt cache RAM limit#25070

Open
tarruda wants to merge 1 commit into
ggml-org:masterfrom
tarruda:cache-ram-strict
Open

server: add strict prompt cache RAM limit#25070
tarruda wants to merge 1 commit into
ggml-org:masterfrom
tarruda:cache-ram-strict

Conversation

@tarruda

@tarruda tarruda commented Jun 27, 2026

Copy link
Copy Markdown

Overview

Implement --cache-ram-strict option, which makes --cache-ram a hard cache limit.

When this option is enabled, prompt states that wouldn't fit within the specified limit are skipped. If it fits, older entries are evicted before allocating a new entry.

Additional information

The --cache-ram option works as a soft limit to the cache ram. When enabled (--cache-ram > 0), it will always keep one entry even if exceeds the value passed to --cache-ram. Another problem is that it creates the new entry before evicting old ones, which can temporarily cause a big increase in the used memory.

When the user has a significant amount of free memory, the current behavior is fine. For users like me that run models that use most of the RAM capacity of the device, it would be useful to have more control over the maximum amount of RAM used and prevent unnecessary swapping which can wear SSDs.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Minor codex assistance to get familiar with the relevants part of the code.

@tarruda tarruda requested review from a team as code owners June 27, 2026 02:13
Implement `--cache-ram-strict` option, which makes `--cache-ram` a hard cache
limit.

When this option is enabled, prompt states that wouldn't fit within the
specified limit are skipped. If it fits, older entries are evicted before
allocating a new entry.
@tarruda tarruda force-pushed the cache-ram-strict branch from 7702d13 to 4782510 Compare June 28, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant