server: add strict prompt cache RAM limit#25070
Open
tarruda wants to merge 1 commit into
Open
Conversation
Implement `--cache-ram-strict` option, which makes `--cache-ram` a hard cache limit. When this option is enabled, prompt states that wouldn't fit within the specified limit are skipped. If it fits, older entries are evicted before allocating a new entry.
7702d13 to
4782510
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Implement
--cache-ram-strictoption, which makes--cache-rama hard cache limit.When this option is enabled, prompt states that wouldn't fit within the specified limit are skipped. If it fits, older entries are evicted before allocating a new entry.
Additional information
The
--cache-ramoption works as a soft limit to the cache ram. When enabled (--cache-ram > 0), it will always keep one entry even if exceeds the value passed to--cache-ram. Another problem is that it creates the new entry before evicting old ones, which can temporarily cause a big increase in the used memory.When the user has a significant amount of free memory, the current behavior is fine. For users like me that run models that use most of the RAM capacity of the device, it would be useful to have more control over the maximum amount of RAM used and prevent unnecessary swapping which can wear SSDs.
Requirements