Optimize HSH_Lookup #4236

sobolevsv · 2024-11-26T05:44:35Z

This PR optimizes the HSH_Lookup function to prevent Varnish Cache from becoming unresponsive for extended periods when handling a large number of bans and variations.

Earlier this year we encountered issues where Varnish would suddenly become unresponsive.

The issues always followed the same pattern. They occurred infrequently—about once a week or less, typically in the early morning on weekends. Before Varnish stopped responding, the n_expired counter would suddenly stop updating. Simultaneously, the rate of bans_tests_tested would increase while bans_tested did not significantly change. Then, the number of threads would increase until it reached the thread pool limit, at which point Varnish stopped accepting new connections.

Initially, I suspected a deadlock. We upgraded from version 5 to 6.6, but this did not resolve the issue. We also increased the thread pool size, which helped once—Varnish managed to unblock itself before the DevOps team restarted it. This indicated that the root cause was not a deadlock.

A thorough investigation revealed the following:

We used a TTL of 900 seconds for cacheable pages and 1 day for non-cacheable pages. Several non-cacheable pages were configured in the VCL file to bypass the cache. For other non-cacheable pages, hit-for-miss was configured with a TTL of 1 day.

After 1 day of uptime, Varnish's cache could contain 500,000–1.5 million objects, many of which are hit-for-miss objects (as the investigation showed).

We use lurker-friendly bans to invalidate objects in the cache. Each ban uses a regex to match objects by their URL. We add around 50k bans per hour. The lurker was unable to keep up with the rate of adding bans due to the large number of objects in the cache. After 1 day of uptime, Varnish could have as many as 300k active bans.

Additionally, we use Vary by User-Agent. While this is not ideal, it is part of our current configuration and it worked for years. As a result, a single page could be accessed by more than a thousand unique User-Agent daily, which creates a long list of variations for a single object in the cache.

Bans were added more or less at a constant rate throughout the day, but traffic significantly decreased during nighttime. Many objects were not requested overnight, meaning they needed to be checked against more bans when traffic resumed if the lurker had not yet processed them.

A rough estimate showed that the lurker could take up to 1 day to check all bans against all objects, so we can ignore its role in the investigation.

In the morning, as traffic increased, Varnish received a request for an object that had not been accessed overnight. This object needed to be checked against a large number of bans. If the requested object was non-cacheable with a TTL of 1 day, it also had to go through thousands of variations to find a match. Recently accessed objects were at the beginning of the list, so the probability of the requested object being near the end of the list was low. What is more likely is that a ban invalidating the requested object was added at the start of the night, followed by many more bans added throughout the night.

During lookup, once an objhead is found, Varnish needs to check all variants against all bans from the start of the ban list to the ban that invalidated the object. This process could take minutes to complete (hundreds of thousands of bans multiplied by thousands of variations). During this operation, the mutex oh->mtx is held, which blocks the expire thread. This explains why n_expired stopped updating. This mutex also blocks new incoming requests for the same object (page). This explains why threads started increasing.

We fixed the configuration—reduced the TTL for hit-for-miss objects and added additional non-cacheable pages to bypass the cache. This helped to reduce the number of objects in the cache so the lurker could now keep up with newly added bans. As a result, the issue with Varnish disappeared. We will also revise the Vary header to populate the cache with fewer object variants.

While we had a not very good configuration that allowed our Varnish to work for years without any issues, a critical system software should not stop responding unexpectedly because of wrong or suboptimal configuration.

To prevent Varnish from becoming stuck unexpectedly for minutes, we need to swap the order of the Vary check and the Ban check. If we first find the object's variant that exactly matches the request and only then check it against all bans, the worker will return the object from the cache (or from backends) with a much smaller latency spike. The worker will hold the mutex for a much shorter time, allowing other requests for the same object to be served. Latency should be a top priority—returning a requested object variant to the client as quickly as possible. Let other object variants expire, or let the lurker clean them up asynchronously.

nigoroll

As you rightly explained, the root cause of your problems lies with the excessive use of variants and bans. Regarding variants, we introduced vary_notice, which does not avoid the problem, but at least helps awareness of it. Regarding bans, we added ban_cutoff, which, if used, would have avoided the problematic scenario you described.

That said, I can see that the change you propose makes sense, so 👍

.gitignore

sobolevsv · 2024-11-28T21:52:27Z

Thank you for reviewing and approving my proposal, @nigoroll!
Regarding ban_cutoff - yes, I came across this during the investigation. I understood that it could serve as a workaround, but I wanted to identify the root cause. After pinpointing the issue—an excessive number of variations for non-cacheable objects—we addressed it by fixing the configuration, and the issue disappeared. Moving forward, we will consider both ban_cutoff and vary_notice as safety measures.

nigoroll · 2024-11-29T08:13:50Z

@sobolevsv Yes, I agree that you did the right thing by tracking down the root cause.

Can you please squash your commits regarding .gitignore and force-push?

walid-git

Thanks for your contribution. I can only see one issue that could be problematic with this change: this is introducing a change of behavior in that a request matching a req.* ban expression would only evict the variant it matches, whereas previously all the variants would be evicted by the same request. This means that for variants that are only requested once in a while, the objects would stay for much longer, potentially until they expire (since the ban lurker thread does not act on req.* based bans). These objects would keep a reference to the ban, which will also take much longer to complete.
I am not sure about the impact that this could have, but it seems like something we should think about.

nigoroll · 2024-11-29T13:38:27Z

good point @walid-git. I agree that this could be a reason not to add this change, on the other hand we clearly discourage req.* bans.

asadsa92 · 2024-11-29T13:51:15Z

Maybe not relevant for a minor version bump, I agree, but would not a req.* bans to hit only the relevant variants be what the user expects? Maybe this needs be a doc fix? https://varnish-cache.org/docs/trunk/users-guide/purging.html
This seems to be defined for purges, but not bans.

walid-git · 2024-11-29T14:16:15Z

but would not a req.* bans to hit only the relevant variants be what the user expects?

It depends on the intent of the ban:

If you issue a ban on req.http.user-agent while you have several objects varying of this header in cache, then yes, this PR can even be considered a bug fix.
But on the other hand, it you issue a ban on req.url, then having the first request matching the ban expression remove all the variants is probably the best thing to do.

Maybe a solution could be to teach the ban code to be smarter about variants and do the right thing in right situation.

nigoroll · 2024-11-30T13:42:41Z

I find @asadsa92's argument compelling that this is actually a bug fix. Without checking Vary, we do not know if the ban should even apply.

sobolevsv · 2024-12-01T03:35:22Z

@nigoroll, the commits have been squashed into a single one.

I agree that the list of req.* bans could grow longer, and this PR introduces new behavior for bans, which is likely not clearly specified in the documentation. As a result, many users may not be fully aware of the current implementation details.

We should consider the space-time tradeoff: increased memory consumption for maintaining a longer ban list (and variants that have not been banned yet) is an acceptable tradeoff if it helps reduce response times or prevents Varnish from getting stuck for minutes in edge cases, as I described above.

From the user's perspective, what do they expect?

Stability and resilience
Low latency

This PR addresses both expectations. It reduces the computational workload required for a worker to serve client requests. By distributing the banning of all variants fairly across multiple workers (those handling a particular object's variant), it ensures that a single worker holds the mutex for only a brief period, allowing other workers to serve other object variants without a big delay.

It might be worth considering introducing the changes in this PR as part of a new major version.

dridi · 2024-12-02T09:04:24Z

The workaround suggested here is simple and effective, but I think we should keep the current behavior since it appears to act as a better alarm than the vary_notice. I believe that the change is correct in the sense that we would no longer evaluate bans for objects we wouldn't consider for delivery, but it's just hiding the problem. However, in a sane environment with a low number of variants I would actually prefer a single lookup to test them all, especially for ban expressions with request criteria, to process them sooner (a better optimization than "correct behavior" in my opinion).

The problem faced here is "too many bans" times "too many variants" grinding lookups to a halt.

We already "solved" the ban accumulation with ban_cutoff but it does nothing by default. We should consider giving it a non-zero default value.

The variants accumulation could be solved with a vary_limit parameter that could default for example to 20 (twice vary_notice) that would turn new variants exceeding the limit into pass transactions. I was planning to submit (or ask someone to submit) such a change after #4073 where I briefly mention it in the pull request description.

bsdphk · 2024-12-02T14:43:02Z

We should not change the order, for the reasons @dridi has already mentioned: Requests have to do "community service" to help us get rid of banned, but unrequested objects.

However, it would make a lot of sense to have a parameters which prevents the first request in the morning from having to sweep the entire floor.

I think that limit should be expressed in "You have to check up to this many objects for bans, before you are allowed to just hunt for a vary match"

Of course, those vary match(es) must still be ban-checked before they are used.

We can safely set the default parameter quite high, 10, 100 or 1000, but I'm not even sure we need to. Even a parameter of one will - eventually - get the unloved req-banned objects tossed.

(The current code operates as if the value of that parameter is infinity)

walid-git · 2024-12-02T14:46:39Z

FTR, here is a vtc that demonstrates the "bug" being fixed with this PR:

varnishtest "test ban + vary behavior"

server s0 {
	rxreq
	txresp -hdr "vary: version" -body "Variant A"
	rxreq
	txresp -hdr "vary: version" -body "Variant B"
	rxreq
	txresp -hdr "vary: version" -body "New variant A"
	rxreq
	txresp -hdr "vary: version" -body "New variant B"
} -start

varnish v1 -vcl+backend {} -start

client c1 {
	txreq -hdr "version: a"
	rxresp
	expect resp.body == "Variant A"
} -run

client c2 {
	txreq -hdr "version: b"
	rxresp
	expect resp.body == "Variant B"
} -run

varnish v1 -cliok "ban req.http.version == a"

# Should this remove a single variant from cache ?
client c3 {
	txreq -hdr "version: a"
	rxresp
	expect resp.body == "New variant A"
} -run

# It does not seem to be the case ..
client c4 {
	txreq -hdr "version: b"
	rxresp
	expect resp.body == "Variant B"
} -run

nigoroll · 2024-12-02T15:12:21Z

FTR, here is a vtc that demonstrates the "bug" being fixed with this PR:

Thank you. This is what I had in mind when I got confused during bugwash.

It should be noted that the VTC currently fails:

**** c4    c-l|New variant B
**** c4    bodylen = 13
**   c4    === expect resp.body == "Variant B"
---- c4    EXPECT resp.body (New variant B) == "Variant B" failed

So the question really is what the scope of a req.* ban is: Is it the hash or the object? We currently implement the former, but I lean towards the latter: my implicit understanding has always been that a req.* ban applies if a matching request would otherwise find the object, which is clearly not the case for the demonstration VTC.

walid-git · 2024-12-02T15:42:42Z

It should be noted that the VTC currently fails:

**** c4    c-l|New variant B
**** c4    bodylen = 13
**   c4    === expect resp.body == "Variant B"
---- c4    EXPECT resp.body (New variant B) == "Variant B" failed

Yes, it fails on master but passes with this PR. We could probably consider adding it to our test suite, as part of this PR.

Fix potential case when varnish stops responding with a large number of bans and variations

sobolevsv · 2024-12-03T05:56:49Z

@walid-git Thank you for the good example of what else this PR fixes. I added your test to the PR.

nigoroll · 2025-01-13T14:54:08Z

bugwash:
because phk's proposal covers both cases, we want to take that route. The proposal basically is

(15:36:24) phk:     if (counter < threshold)
(15:36:27) phk:         ban-check
(15:36:38) phk:         counter++
(15:36:42) phk:     vary check
(15:36:53) phk:     if counter == threshold
(15:37:01) phk:         ban-check

for threshold == 0, we will get the POLA semantics at the expensive of potentially longer object lifetime. For this case, ban_cutoff can be used to avoid holding onto long tail req-banned objects for too long.

for threshold > 0, we get the old semantics (one ban applies to all variants)

Proposed parameter names were mandatory_ban_community_service, ban_variants, ban_vary_limit, max_ban_checks(_per_request).

@walid-git volunteered to write up a doc proposal.

bsdphk · 2025-01-13T14:59:39Z

I've checked ancient notes: The current behavior of also banning non-matching variants was intentional.

The worry was that a matching req might never arrive, needlessly holding the variant in cache until TTL, which might in principle be forever, so it was decided req processed bans (= all of them, this was before the lurker) would take all variants.

This had something to do with ancient mobile phones and norvegian newspapers, and today we might decide that it would be more useful to be able to ban specific variants.

However, changing the order of the checks now, could massively change cache behavior for existing users (Vary: User-Agent!) so that would be a breaking change.

One way to deal with this, is to introduce a new parameter, which limits how many ban-checks a req have to do before the vary check, so the code looks like this:

if counter < threshold:
     ban_check
vary_check
if not ban_checked
    ban_check

A sensible value for threshold solves the problem this ticket was opened on.

As a side effect, setting the threshold to zero, changes the behavior, so req-processed bans only act on vary-matching hits.

Refs: varnishcache#4236

This will control the number of object variants that we evaluate against the ban list before looking for a vary match. Refs: varnishcache#4236

Refs: varnishcache#4236

nigoroll approved these changes Nov 28, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

walid-git reviewed Nov 29, 2024

View reviewed changes

sobolevsv force-pushed the optimize-lookup branch from 540477b to f81e912 Compare November 30, 2024 22:13

Optimize lookup

770ae90

Fix potential case when varnish stops responding with a large number of bans and variations

sobolevsv force-pushed the optimize-lookup branch from f81e912 to 770ae90 Compare December 3, 2024 05:26

walid-git added a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025

HSH: Limit the number of ban checks before vary matching

0c87a12

Refs: varnishcache#4236

walid-git mentioned this pull request Jan 15, 2025

HSH: Limit the number of non matching variants evaluated against the ban list during lookup #4253

Open

walid-git added a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025

HSH: Limit the number of ban checks before vary matching

5f2fc18

Refs: varnishcache#4236

walid-git pushed a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025

HSH: Add more coverage for ban_any_variant

51b0fc9

Refs: varnishcache#4236

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize HSH_Lookup #4236

Optimize HSH_Lookup #4236

sobolevsv commented Nov 26, 2024 •

edited

Loading

nigoroll left a comment

sobolevsv commented Nov 28, 2024

nigoroll commented Nov 29, 2024

walid-git left a comment

nigoroll commented Nov 29, 2024

asadsa92 commented Nov 29, 2024

walid-git commented Nov 29, 2024

nigoroll commented Nov 30, 2024

sobolevsv commented Dec 1, 2024 •

edited

Loading

dridi commented Dec 2, 2024

bsdphk commented Dec 2, 2024 •

edited

Loading

walid-git commented Dec 2, 2024

nigoroll commented Dec 2, 2024

walid-git commented Dec 2, 2024

sobolevsv commented Dec 3, 2024

nigoroll commented Jan 13, 2025 •

edited

Loading

bsdphk commented Jan 13, 2025

Optimize HSH_Lookup #4236

Are you sure you want to change the base?

Optimize HSH_Lookup #4236

Conversation

sobolevsv commented Nov 26, 2024 • edited Loading

nigoroll left a comment

Choose a reason for hiding this comment

sobolevsv commented Nov 28, 2024

nigoroll commented Nov 29, 2024

walid-git left a comment

Choose a reason for hiding this comment

nigoroll commented Nov 29, 2024

asadsa92 commented Nov 29, 2024

walid-git commented Nov 29, 2024

nigoroll commented Nov 30, 2024

sobolevsv commented Dec 1, 2024 • edited Loading

dridi commented Dec 2, 2024

bsdphk commented Dec 2, 2024 • edited Loading

walid-git commented Dec 2, 2024

nigoroll commented Dec 2, 2024

walid-git commented Dec 2, 2024

sobolevsv commented Dec 3, 2024

nigoroll commented Jan 13, 2025 • edited Loading

bsdphk commented Jan 13, 2025

sobolevsv commented Nov 26, 2024 •

edited

Loading

sobolevsv commented Dec 1, 2024 •

edited

Loading

bsdphk commented Dec 2, 2024 •

edited

Loading

nigoroll commented Jan 13, 2025 •

edited

Loading