Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

Open
ZhuQian0909 opened this issue Dec 23, 2024 · 1 comment
Open

Comments

@ZhuQian0909
Copy link

Motivation

Issue Description:

We're exploring the use of prefix caching to optimize inference performance in our application. Our use case involves only three distinct system prompts, with requests arriving with any of these three prompts.

Currently, the prefix caching feature is controlled by the enable_prefix_caching parameter, which is a simple on/off switch. The documentation states that it caches and reuses k/v blocks of identical prompt prefixes, improving performance by avoiding redundant computations.

Our specific question is:

Is it possible to control the number of prefixes that are cached?

Specifically, we want to ensure that all three of our distinct system prompts are always cached, regardless of the order in which requests are received. Given that we only have three system prompts, we want the cache to be large enough to accommodate all of them. We would prefer not to risk one of the prompts not being cached due to cache replacement or capacity limitations.

We understand that k/v blocks are the smallest unit for reuse, and there’s no performance gain if the prefix is smaller than a block. We are also aware that the current implementation only enables or disables the feature.

Feature Request:

We propose exploring options to:

Configure the maximum number of cached prefixes. This would allow users to ensure that a specific number of commonly used prompts are always available in the cache.

Possibly implement a pinning mechanism that ensures specific prefixes are always kept in the cache, preventing them from being evicted due to capacity limitations.

Consider a configuration to cache all encountered prefixes up to a given limit or until the cache is full. This would be a more dynamic approach than needing to configure which specific prompts to keep.

This level of control would be highly beneficial in scenarios like ours, where a small set of distinct prompts are used frequently and would result in more predictable and optimal usage of the prefix caching feature.

Related resources

No response

Additional context

No response

@akai-shuuichi
Copy link
Contributor

akai-shuuichi commented Dec 24, 2024

This implementation is similar to VLLM/8333, and I have been looking at related implementations of LMcache recently. I think moving the KV block to other media is also a good approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants