You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're exploring the use of prefix caching to optimize inference performance in our application. Our use case involves only three distinct system prompts, with requests arriving with any of these three prompts.
Currently, the prefix caching feature is controlled by the enable_prefix_caching parameter, which is a simple on/off switch. The documentation states that it caches and reuses k/v blocks of identical prompt prefixes, improving performance by avoiding redundant computations.
Our specific question is:
Is it possible to control the number of prefixes that are cached?
Specifically, we want to ensure that all three of our distinct system prompts are always cached, regardless of the order in which requests are received. Given that we only have three system prompts, we want the cache to be large enough to accommodate all of them. We would prefer not to risk one of the prompts not being cached due to cache replacement or capacity limitations.
We understand that k/v blocks are the smallest unit for reuse, and there’s no performance gain if the prefix is smaller than a block. We are also aware that the current implementation only enables or disables the feature.
Feature Request:
We propose exploring options to:
Configure the maximum number of cached prefixes. This would allow users to ensure that a specific number of commonly used prompts are always available in the cache.
Possibly implement a pinning mechanism that ensures specific prefixes are always kept in the cache, preventing them from being evicted due to capacity limitations.
Consider a configuration to cache all encountered prefixes up to a given limit or until the cache is full. This would be a more dynamic approach than needing to configure which specific prompts to keep.
This level of control would be highly beneficial in scenarios like ours, where a small set of distinct prompts are used frequently and would result in more predictable and optimal usage of the prefix caching feature.
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
This implementation is similar to VLLM/8333, and I have been looking at related implementations of LMcache recently. I think moving the KV block to other media is also a good approach
Motivation
Issue Description:
We're exploring the use of prefix caching to optimize inference performance in our application. Our use case involves only three distinct system prompts, with requests arriving with any of these three prompts.
Currently, the prefix caching feature is controlled by the enable_prefix_caching parameter, which is a simple on/off switch. The documentation states that it caches and reuses k/v blocks of identical prompt prefixes, improving performance by avoiding redundant computations.
Our specific question is:
Is it possible to control the number of prefixes that are cached?
Specifically, we want to ensure that all three of our distinct system prompts are always cached, regardless of the order in which requests are received. Given that we only have three system prompts, we want the cache to be large enough to accommodate all of them. We would prefer not to risk one of the prompts not being cached due to cache replacement or capacity limitations.
We understand that k/v blocks are the smallest unit for reuse, and there’s no performance gain if the prefix is smaller than a block. We are also aware that the current implementation only enables or disables the feature.
Feature Request:
We propose exploring options to:
Configure the maximum number of cached prefixes. This would allow users to ensure that a specific number of commonly used prompts are always available in the cache.
Possibly implement a pinning mechanism that ensures specific prefixes are always kept in the cache, preventing them from being evicted due to capacity limitations.
Consider a configuration to cache all encountered prefixes up to a given limit or until the cache is full. This would be a more dynamic approach than needing to configure which specific prompts to keep.
This level of control would be highly beneficial in scenarios like ours, where a small set of distinct prompts are used frequently and would result in more predictable and optimal usage of the prefix caching feature.
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: