[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

ZhuQian0909 · 2024-12-23T13:32:08Z

Motivation

Issue Description:

We're exploring the use of prefix caching to optimize inference performance in our application. Our use case involves only three distinct system prompts, with requests arriving with any of these three prompts.

Currently, the prefix caching feature is controlled by the enable_prefix_caching parameter, which is a simple on/off switch. The documentation states that it caches and reuses k/v blocks of identical prompt prefixes, improving performance by avoiding redundant computations.

Our specific question is:

Is it possible to control the number of prefixes that are cached?

Specifically, we want to ensure that all three of our distinct system prompts are always cached, regardless of the order in which requests are received. Given that we only have three system prompts, we want the cache to be large enough to accommodate all of them. We would prefer not to risk one of the prompts not being cached due to cache replacement or capacity limitations.

We understand that k/v blocks are the smallest unit for reuse, and there’s no performance gain if the prefix is smaller than a block. We are also aware that the current implementation only enables or disables the feature.

Feature Request:

We propose exploring options to:

Configure the maximum number of cached prefixes. This would allow users to ensure that a specific number of commonly used prompts are always available in the cache.

Possibly implement a pinning mechanism that ensures specific prefixes are always kept in the cache, preventing them from being evicted due to capacity limitations.

Consider a configuration to cache all encountered prefixes up to a given limit or until the cache is full. This would be a more dynamic approach than needing to configure which specific prompts to keep.

This level of control would be highly beneficial in scenarios like ours, where a small set of distinct prompts are used frequently and would result in more predictable and optimal usage of the prefix caching feature.

Related resources

No response

Additional context

No response

akai-shuuichi · 2024-12-24T02:25:05Z

This implementation is similar to VLLM/8333, and I have been looking at related implementations of LMcache recently. I think moving the KV block to other media is also a good approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

ZhuQian0909 commented Dec 23, 2024

akai-shuuichi commented Dec 24, 2024 •

edited

Loading

[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

[Feature] Control over Prefix Cache Capacity and Guaranteed Caching #2942

Comments

ZhuQian0909 commented Dec 23, 2024

Motivation

Related resources

Additional context

akai-shuuichi commented Dec 24, 2024 • edited Loading

akai-shuuichi commented Dec 24, 2024 •

edited

Loading