Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HSH_Lookup #4236

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sobolevsv
Copy link

@sobolevsv sobolevsv commented Nov 26, 2024

This PR optimizes the HSH_Lookup function to prevent Varnish Cache from becoming unresponsive for extended periods when handling a large number of bans and variations.

Earlier this year we encountered issues where Varnish would suddenly become unresponsive.

The issues always followed the same pattern. They occurred infrequently—about once a week or less, typically in the early morning on weekends. Before Varnish stopped responding, the n_expired counter would suddenly stop updating. Simultaneously, the rate of bans_tests_tested would increase while bans_tested did not significantly change. Then, the number of threads would increase until it reached the thread pool limit, at which point Varnish stopped accepting new connections.

Initially, I suspected a deadlock. We upgraded from version 5 to 6.6, but this did not resolve the issue. We also increased the thread pool size, which helped once—Varnish managed to unblock itself before the DevOps team restarted it. This indicated that the root cause was not a deadlock.

A thorough investigation revealed the following:

We used a TTL of 900 seconds for cacheable pages and 1 day for non-cacheable pages. Several non-cacheable pages were configured in the VCL file to bypass the cache. For other non-cacheable pages, hit-for-miss was configured with a TTL of 1 day.

After 1 day of uptime, Varnish's cache could contain 500,000–1.5 million objects, many of which are hit-for-miss objects (as the investigation showed).

We use lurker-friendly bans to invalidate objects in the cache. Each ban uses a regex to match objects by their URL. We add around 50k bans per hour. The lurker was unable to keep up with the rate of adding bans due to the large number of objects in the cache. After 1 day of uptime, Varnish could have as many as 300k active bans.

Additionally, we use Vary by User-Agent. While this is not ideal, it is part of our current configuration and it worked for years. As a result, a single page could be accessed by more than a thousand unique User-Agent daily, which creates a long list of variations for a single object in the cache.

Bans were added more or less at a constant rate throughout the day, but traffic significantly decreased during nighttime. Many objects were not requested overnight, meaning they needed to be checked against more bans when traffic resumed if the lurker had not yet processed them.

A rough estimate showed that the lurker could take up to 1 day to check all bans against all objects, so we can ignore its role in the investigation.

In the morning, as traffic increased, Varnish received a request for an object that had not been accessed overnight. This object needed to be checked against a large number of bans. If the requested object was non-cacheable with a TTL of 1 day, it also had to go through thousands of variations to find a match. Recently accessed objects were at the beginning of the list, so the probability of the requested object being near the end of the list was low. What is more likely is that a ban invalidating the requested object was added at the start of the night, followed by many more bans added throughout the night.

During lookup, once an objhead is found, Varnish needs to check all variants against all bans from the start of the ban list to the ban that invalidated the object. This process could take minutes to complete (hundreds of thousands of bans multiplied by thousands of variations). During this operation, the mutex oh->mtx is held, which blocks the expire thread. This explains why n_expired stopped updating. This mutex also blocks new incoming requests for the same object (page). This explains why threads started increasing.

We fixed the configuration—reduced the TTL for hit-for-miss objects and added additional non-cacheable pages to bypass the cache. This helped to reduce the number of objects in the cache so the lurker could now keep up with newly added bans. As a result, the issue with Varnish disappeared. We will also revise the Vary header to populate the cache with fewer object variants.

While we had a not very good configuration that allowed our Varnish to work for years without any issues, a critical system software should not stop responding unexpectedly because of wrong or suboptimal configuration.

To prevent Varnish from becoming stuck unexpectedly for minutes, we need to swap the order of the Vary check and the Ban check. If we first find the object's variant that exactly matches the request and only then check it against all bans, the worker will return the object from the cache (or from backends) with a much smaller latency spike. The worker will hold the mutex for a much shorter time, allowing other requests for the same object to be served. Latency should be a top priority—returning a requested object variant to the client as quickly as possible. Let other object variants expire, or let the lurker clean them up asynchronously.

Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you rightly explained, the root cause of your problems lies with the excessive use of variants and bans. Regarding variants, we introduced vary_notice, which does not avoid the problem, but at least helps awareness of it. Regarding bans, we added ban_cutoff, which, if used, would have avoided the problematic scenario you described.

That said, I can see that the change you propose makes sense, so 👍

.gitignore Outdated Show resolved Hide resolved
@sobolevsv
Copy link
Author

Thank you for reviewing and approving my proposal, @nigoroll!
Regarding ban_cutoff - yes, I came across this during the investigation. I understood that it could serve as a workaround, but I wanted to identify the root cause. After pinpointing the issue—an excessive number of variations for non-cacheable objects—we addressed it by fixing the configuration, and the issue disappeared. Moving forward, we will consider both ban_cutoff and vary_notice as safety measures.

@nigoroll
Copy link
Member

@sobolevsv Yes, I agree that you did the right thing by tracking down the root cause.

Can you please squash your commits regarding .gitignore and force-push?

Copy link
Member

@walid-git walid-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. I can only see one issue that could be problematic with this change: this is introducing a change of behavior in that a request matching a req.* ban expression would only evict the variant it matches, whereas previously all the variants would be evicted by the same request. This means that for variants that are only requested once in a while, the objects would stay for much longer, potentially until they expire (since the ban lurker thread does not act on req.* based bans). These objects would keep a reference to the ban, which will also take much longer to complete.
I am not sure about the impact that this could have, but it seems like something we should think about.

@nigoroll
Copy link
Member

good point @walid-git. I agree that this could be a reason not to add this change, on the other hand we clearly discourage req.* bans.

@asadsa92
Copy link
Contributor

Maybe not relevant for a minor version bump, I agree, but would not a req.* bans to hit only the relevant variants be what the user expects? Maybe this needs be a doc fix? https://varnish-cache.org/docs/trunk/users-guide/purging.html
This seems to be defined for purges, but not bans.

@walid-git
Copy link
Member

but would not a req.* bans to hit only the relevant variants be what the user expects?

It depends on the intent of the ban:

  • If you issue a ban on req.http.user-agent while you have several objects varying of this header in cache, then yes, this PR can even be considered a bug fix.
  • But on the other hand, it you issue a ban on req.url, then having the first request matching the ban expression remove all the variants is probably the best thing to do.

Maybe a solution could be to teach the ban code to be smarter about variants and do the right thing in right situation.

@nigoroll
Copy link
Member

I find @asadsa92's argument compelling that this is actually a bug fix. Without checking Vary, we do not know if the ban should even apply.

@sobolevsv
Copy link
Author

sobolevsv commented Dec 1, 2024

@nigoroll, the commits have been squashed into a single one.

I agree that the list of req.* bans could grow longer, and this PR introduces new behavior for bans, which is likely not clearly specified in the documentation. As a result, many users may not be fully aware of the current implementation details.

We should consider the space-time tradeoff: increased memory consumption for maintaining a longer ban list (and variants that have not been banned yet) is an acceptable tradeoff if it helps reduce response times or prevents Varnish from getting stuck for minutes in edge cases, as I described above.

From the user's perspective, what do they expect?

  • Stability and resilience
  • Low latency

This PR addresses both expectations. It reduces the computational workload required for a worker to serve client requests. By distributing the banning of all variants fairly across multiple workers (those handling a particular object's variant), it ensures that a single worker holds the mutex for only a brief period, allowing other workers to serve other object variants without a big delay.

It might be worth considering introducing the changes in this PR as part of a new major version.

@dridi
Copy link
Member

dridi commented Dec 2, 2024

The workaround suggested here is simple and effective, but I think we should keep the current behavior since it appears to act as a better alarm than the vary_notice. I believe that the change is correct in the sense that we would no longer evaluate bans for objects we wouldn't consider for delivery, but it's just hiding the problem. However, in a sane environment with a low number of variants I would actually prefer a single lookup to test them all, especially for ban expressions with request criteria, to process them sooner (a better optimization than "correct behavior" in my opinion).

The problem faced here is "too many bans" times "too many variants" grinding lookups to a halt.

We already "solved" the ban accumulation with ban_cutoff but it does nothing by default. We should consider giving it a non-zero default value.

The variants accumulation could be solved with a vary_limit parameter that could default for example to 20 (twice vary_notice) that would turn new variants exceeding the limit into pass transactions. I was planning to submit (or ask someone to submit) such a change after #4073 where I briefly mention it in the pull request description.

@bsdphk
Copy link
Contributor

bsdphk commented Dec 2, 2024

We should not change the order, for the reasons @dridi has already mentioned: Requests have to do "community service" to help us get rid of banned, but unrequested objects.

However, it would make a lot of sense to have a parameters which prevents the first request in the morning from having to sweep the entire floor.

I think that limit should be expressed in "You have to check up to this many objects for bans, before you are allowed to just hunt for a vary match"

Of course, those vary match(es) must still be ban-checked before they are used.

We can safely set the default parameter quite high, 10, 100 or 1000, but I'm not even sure we need to. Even a parameter of one will - eventually - get the unloved req-banned objects tossed.

(The current code operates as if the value of that parameter is infinity)

@walid-git
Copy link
Member

FTR, here is a vtc that demonstrates the "bug" being fixed with this PR:

varnishtest "test ban + vary behavior"

server s0 {
	rxreq
	txresp -hdr "vary: version" -body "Variant A"
	rxreq
	txresp -hdr "vary: version" -body "Variant B"
	rxreq
	txresp -hdr "vary: version" -body "New variant A"
	rxreq
	txresp -hdr "vary: version" -body "New variant B"
} -start

varnish v1 -vcl+backend {} -start

client c1 {
	txreq -hdr "version: a"
	rxresp
	expect resp.body == "Variant A"
} -run

client c2 {
	txreq -hdr "version: b"
	rxresp
	expect resp.body == "Variant B"
} -run

varnish v1 -cliok "ban req.http.version == a"

# Should this remove a single variant from cache ?
client c3 {
	txreq -hdr "version: a"
	rxresp
	expect resp.body == "New variant A"
} -run

# It does not seem to be the case ..
client c4 {
	txreq -hdr "version: b"
	rxresp
	expect resp.body == "Variant B"
} -run

@nigoroll
Copy link
Member

nigoroll commented Dec 2, 2024

FTR, here is a vtc that demonstrates the "bug" being fixed with this PR:

Thank you. This is what I had in mind when I got confused during bugwash.

It should be noted that the VTC currently fails:

**** c4    c-l|New variant B
**** c4    bodylen = 13
**   c4    === expect resp.body == "Variant B"
---- c4    EXPECT resp.body (New variant B) == "Variant B" failed

So the question really is what the scope of a req.* ban is: Is it the hash or the object? We currently implement the former, but I lean towards the latter: my implicit understanding has always been that a req.* ban applies if a matching request would otherwise find the object, which is clearly not the case for the demonstration VTC.

@walid-git
Copy link
Member

It should be noted that the VTC currently fails:

**** c4    c-l|New variant B
**** c4    bodylen = 13
**   c4    === expect resp.body == "Variant B"
---- c4    EXPECT resp.body (New variant B) == "Variant B" failed

Yes, it fails on master but passes with this PR. We could probably consider adding it to our test suite, as part of this PR.

Fix potential case when varnish stops responding with a large number of bans and variations
@sobolevsv
Copy link
Author

@walid-git Thank you for the good example of what else this PR fixes. I added your test to the PR.

@nigoroll
Copy link
Member

nigoroll commented Jan 13, 2025

bugwash:
because phk's proposal covers both cases, we want to take that route. The proposal basically is

(15:36:24) phk:     if (counter < threshold)
(15:36:27) phk:         ban-check
(15:36:38) phk:         counter++
(15:36:42) phk:     vary check
(15:36:53) phk:     if counter == threshold
(15:37:01) phk:         ban-check

for threshold == 0, we will get the POLA semantics at the expensive of potentially longer object lifetime. For this case, ban_cutoff can be used to avoid holding onto long tail req-banned objects for too long.

for threshold > 0, we get the old semantics (one ban applies to all variants)

Proposed parameter names were mandatory_ban_community_service, ban_variants, ban_vary_limit, max_ban_checks(_per_request).

@walid-git volunteered to write up a doc proposal.

@bsdphk
Copy link
Contributor

bsdphk commented Jan 13, 2025

I've checked ancient notes: The current behavior of also banning non-matching variants was intentional.

The worry was that a matching req might never arrive, needlessly holding the variant in cache until TTL, which might in principle be forever, so it was decided req processed bans (= all of them, this was before the lurker) would take all variants.

This had something to do with ancient mobile phones and norvegian newspapers, and today we might decide that it would be more useful to be able to ban specific variants.

However, changing the order of the checks now, could massively change cache behavior for existing users (Vary: User-Agent!) so that would be a breaking change.

One way to deal with this, is to introduce a new parameter, which limits how many ban-checks a req have to do before the vary check, so the code looks like this:

if counter < threshold:
     ban_check
vary_check
if not ban_checked
    ban_check

A sensible value for threshold solves the problem this ticket was opened on.

As a side effect, setting the threshold to zero, changes the behavior, so req-processed bans only act on vary-matching hits.

walid-git added a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025
walid-git added a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025
This will control the number of object variants that we evaluate
against the ban list before looking for a vary match.

Refs: varnishcache#4236
walid-git added a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025
walid-git pushed a commit to walid-git/varnish-cache that referenced this pull request Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants