-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize HSH_Lookup #4236
base: master
Are you sure you want to change the base?
Optimize HSH_Lookup #4236
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you rightly explained, the root cause of your problems lies with the excessive use of variants and bans. Regarding variants, we introduced vary_notice
, which does not avoid the problem, but at least helps awareness of it. Regarding bans, we added ban_cutoff
, which, if used, would have avoided the problematic scenario you described.
That said, I can see that the change you propose makes sense, so 👍
Thank you for reviewing and approving my proposal, @nigoroll! |
@sobolevsv Yes, I agree that you did the right thing by tracking down the root cause. Can you please squash your commits regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution. I can only see one issue that could be problematic with this change: this is introducing a change of behavior in that a request matching a req.*
ban expression would only evict the variant it matches, whereas previously all the variants would be evicted by the same request. This means that for variants that are only requested once in a while, the objects would stay for much longer, potentially until they expire (since the ban lurker thread does not act on req.*
based bans). These objects would keep a reference to the ban, which will also take much longer to complete.
I am not sure about the impact that this could have, but it seems like something we should think about.
good point @walid-git. I agree that this could be a reason not to add this change, on the other hand we clearly discourage |
Maybe not relevant for a minor version bump, I agree, but would not a |
It depends on the intent of the ban:
Maybe a solution could be to teach the ban code to be smarter about variants and do the right thing in right situation. |
I find @asadsa92's argument compelling that this is actually a bug fix. Without checking Vary, we do not know if the ban should even apply. |
540477b
to
f81e912
Compare
@nigoroll, the commits have been squashed into a single one. I agree that the list of We should consider the space-time tradeoff: increased memory consumption for maintaining a longer ban list (and variants that have not been banned yet) is an acceptable tradeoff if it helps reduce response times or prevents Varnish from getting stuck for minutes in edge cases, as I described above. From the user's perspective, what do they expect?
This PR addresses both expectations. It reduces the computational workload required for a worker to serve client requests. By distributing the banning of all variants fairly across multiple workers (those handling a particular object's variant), it ensures that a single worker holds the mutex for only a brief period, allowing other workers to serve other object variants without a big delay. It might be worth considering introducing the changes in this PR as part of a new major version. |
The workaround suggested here is simple and effective, but I think we should keep the current behavior since it appears to act as a better alarm than the The problem faced here is "too many bans" times "too many variants" grinding lookups to a halt. We already "solved" the ban accumulation with The variants accumulation could be solved with a |
We should not change the order, for the reasons @dridi has already mentioned: Requests have to do "community service" to help us get rid of banned, but unrequested objects. However, it would make a lot of sense to have a parameters which prevents the first request in the morning from having to sweep the entire floor. I think that limit should be expressed in "You have to check up to this many objects for bans, before you are allowed to just hunt for a vary match" Of course, those vary match(es) must still be ban-checked before they are used. We can safely set the default parameter quite high, 10, 100 or 1000, but I'm not even sure we need to. Even a parameter of one will - eventually - get the unloved req-banned objects tossed. (The current code operates as if the value of that parameter is infinity) |
FTR, here is a vtc that demonstrates the "bug" being fixed with this PR: varnishtest "test ban + vary behavior"
server s0 {
rxreq
txresp -hdr "vary: version" -body "Variant A"
rxreq
txresp -hdr "vary: version" -body "Variant B"
rxreq
txresp -hdr "vary: version" -body "New variant A"
rxreq
txresp -hdr "vary: version" -body "New variant B"
} -start
varnish v1 -vcl+backend {} -start
client c1 {
txreq -hdr "version: a"
rxresp
expect resp.body == "Variant A"
} -run
client c2 {
txreq -hdr "version: b"
rxresp
expect resp.body == "Variant B"
} -run
varnish v1 -cliok "ban req.http.version == a"
# Should this remove a single variant from cache ?
client c3 {
txreq -hdr "version: a"
rxresp
expect resp.body == "New variant A"
} -run
# It does not seem to be the case ..
client c4 {
txreq -hdr "version: b"
rxresp
expect resp.body == "Variant B"
} -run |
Thank you. This is what I had in mind when I got confused during bugwash. It should be noted that the VTC currently fails:
So the question really is what the scope of a |
Yes, it fails on master but passes with this PR. We could probably consider adding it to our test suite, as part of this PR. |
Fix potential case when varnish stops responding with a large number of bans and variations
f81e912
to
770ae90
Compare
@walid-git Thank you for the good example of what else this PR fixes. I added your test to the PR. |
bugwash:
for for Proposed parameter names were @walid-git volunteered to write up a doc proposal. |
I've checked ancient notes: The current behavior of also banning non-matching variants was intentional. The worry was that a matching req might never arrive, needlessly holding the variant in cache until TTL, which might in principle be forever, so it was decided req processed bans (= all of them, this was before the lurker) would take all variants. This had something to do with ancient mobile phones and norvegian newspapers, and today we might decide that it would be more useful to be able to ban specific variants. However, changing the order of the checks now, could massively change cache behavior for existing users (Vary: User-Agent!) so that would be a breaking change. One way to deal with this, is to introduce a new parameter, which limits how many ban-checks a req have to do before the vary check, so the code looks like this:
A sensible value for threshold solves the problem this ticket was opened on. As a side effect, setting the threshold to zero, changes the behavior, so req-processed bans only act on vary-matching hits. |
This will control the number of object variants that we evaluate against the ban list before looking for a vary match. Refs: varnishcache#4236
This PR optimizes the HSH_Lookup function to prevent Varnish Cache from becoming unresponsive for extended periods when handling a large number of bans and variations.
Earlier this year we encountered issues where Varnish would suddenly become unresponsive.
The issues always followed the same pattern. They occurred infrequently—about once a week or less, typically in the early morning on weekends. Before Varnish stopped responding, the n_expired counter would suddenly stop updating. Simultaneously, the rate of bans_tests_tested would increase while bans_tested did not significantly change. Then, the number of threads would increase until it reached the thread pool limit, at which point Varnish stopped accepting new connections.
Initially, I suspected a deadlock. We upgraded from version 5 to 6.6, but this did not resolve the issue. We also increased the thread pool size, which helped once—Varnish managed to unblock itself before the DevOps team restarted it. This indicated that the root cause was not a deadlock.
A thorough investigation revealed the following:
We used a TTL of 900 seconds for cacheable pages and 1 day for non-cacheable pages. Several non-cacheable pages were configured in the VCL file to bypass the cache. For other non-cacheable pages, hit-for-miss was configured with a TTL of 1 day.
After 1 day of uptime, Varnish's cache could contain 500,000–1.5 million objects, many of which are hit-for-miss objects (as the investigation showed).
We use lurker-friendly bans to invalidate objects in the cache. Each ban uses a regex to match objects by their URL. We add around 50k bans per hour. The lurker was unable to keep up with the rate of adding bans due to the large number of objects in the cache. After 1 day of uptime, Varnish could have as many as 300k active bans.
Additionally, we use Vary by User-Agent. While this is not ideal, it is part of our current configuration and it worked for years. As a result, a single page could be accessed by more than a thousand unique User-Agent daily, which creates a long list of variations for a single object in the cache.
Bans were added more or less at a constant rate throughout the day, but traffic significantly decreased during nighttime. Many objects were not requested overnight, meaning they needed to be checked against more bans when traffic resumed if the lurker had not yet processed them.
A rough estimate showed that the lurker could take up to 1 day to check all bans against all objects, so we can ignore its role in the investigation.
In the morning, as traffic increased, Varnish received a request for an object that had not been accessed overnight. This object needed to be checked against a large number of bans. If the requested object was non-cacheable with a TTL of 1 day, it also had to go through thousands of variations to find a match. Recently accessed objects were at the beginning of the list, so the probability of the requested object being near the end of the list was low. What is more likely is that a ban invalidating the requested object was added at the start of the night, followed by many more bans added throughout the night.
During lookup, once an objhead is found, Varnish needs to check all variants against all bans from the start of the ban list to the ban that invalidated the object. This process could take minutes to complete (hundreds of thousands of bans multiplied by thousands of variations). During this operation, the mutex oh->mtx is held, which blocks the expire thread. This explains why n_expired stopped updating. This mutex also blocks new incoming requests for the same object (page). This explains why threads started increasing.
We fixed the configuration—reduced the TTL for hit-for-miss objects and added additional non-cacheable pages to bypass the cache. This helped to reduce the number of objects in the cache so the lurker could now keep up with newly added bans. As a result, the issue with Varnish disappeared. We will also revise the Vary header to populate the cache with fewer object variants.
While we had a not very good configuration that allowed our Varnish to work for years without any issues, a critical system software should not stop responding unexpectedly because of wrong or suboptimal configuration.
To prevent Varnish from becoming stuck unexpectedly for minutes, we need to swap the order of the Vary check and the Ban check. If we first find the object's variant that exactly matches the request and only then check it against all bans, the worker will return the object from the cache (or from backends) with a much smaller latency spike. The worker will hold the mutex for a much shorter time, allowing other requests for the same object to be served. Latency should be a top priority—returning a requested object variant to the client as quickly as possible. Let other object variants expire, or let the lurker clean them up asynchronously.