-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: KV Cache exploded #91
Comments
@rakkit Hello, could you provide a minimal script for reproduction. I didn't meet the errors you reported by setting |
Hi, I tried to reproduce this for a while, but I cannot see this problem anymore. I am not sure if i messed up something there. sorry about that. |
@rakkit Great to know! |
Actually, the window_size = cache_kwargs.get('window_size', None) However, cache_kwargs is set to None by default, as defined here: cache_kwargs: Optional[Dict[str, Any]] = None In the cache_kwargs = dict(window_size=self.window_size) though, in some cases where users are unaware of this requirement in cache_kwargs will get an error |
Thank you I will check it later. Are you willing to make some PRs? |
The easiest solution here
or
|
Describe the bug
In the case of using softmax attention or any other attention with
window_size=None
, the KV cache update falls into this branch. This logic concatenates all historical sequence states with the new states (attn_state[0] and attn_state[1]), causing exponential growth in the KV cache.Steps to reproduce the bug
Inference with attention with
window_size=None
Expected behavior
KV-cache exploded
Environment info
The text was updated successfully, but these errors were encountered: