-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoffs and 429s on /keys/claim over federation causes UTDs #17267
Comments
I think this is a duplicate of #8917 |
Also: it might be better described as "backoff period" than "retry period", since synapse doesn't itself initiate any retries for |
I've seen this happen again but this time between element.io <--> matrix.org, where the client on element.io saw in response to /keys/claim: Rate limiting this endpoint feels suboptimal... |
matrix-org/matrix-spec-proposals#4081 would be a solution to this because then the sending server could just serve up the fallback key if it cannot talk to the recipient server. |
@kegsay pretty sure this isn't a 429 but really a 503 (see the |
Crypto are not actively working on this because the best solution would be to do matrix-org/matrix-spec-proposals#4081 |
Another one, element.io <-> matrix.org failing with a different error: |
This happened again in a large E2EE room. The failure mode was subtly different though because |
It's unclear to me how this is different from element-hq/element-meta#2154, which covers the implementation of MSC4081. Closing for now, unless someone can clarify |
Description
Debugging a UTD (rageshake) and the cause of this appears to be
/keys/claim
failing with:This happens again 40 minutes later, which feels very wrong if the retry period is 40mins+.
A long retry period like this will cause UTDs because the sender cannot claim the OTK for one or more of the device's recipients.
The error message originates here which is called from here for claiming keys. This calls through to the transport layer which does post_json which shows it can throw NotRetryingDestination.
It seems to be thrown in get_retry_limiter here. The retry interval controls the duration, which seems to be persisted. This is loaded here and is modified according to:
The retry multiplier is a configurable value and retry interval defaults to the min:
self.retry_interval = self.destination_min_retry_interval_ms
. So what's matrix.org's config?Steps to reproduce
I'm guessing:
Homeserver
matrix.org
Synapse Version
Whatever was running on May 30
Installation Method
I don't know
Database
postgres
Workers
Multiple workers
Platform
?
Configuration
No response
Relevant log output
client-side: `2024-05-30 09:29:57.003 Element[7064:2329509] [MXCryptoSDK] DEBUG receive_keys_claim_response ... failures={"connecteu.rs": Object {"message": String("Not ready for retry"), "status": Number(503)}, "sw1v.org": Object {"message": String("Not ready for retry"), "status": Number(503)}}`
Anything else that would be useful to know?
Proposed solution here would be to ignore the backoff for
/keys/claim
requests, as if they fail it will definitely cause a UTD. If we don't want to do that, having a suitably low retry period (capped in the order of minutes) could be a viable alternative.Alternatively, I had assumed that Synapse cleared backoffs when the other HS sent something to matrix.org..? Surely this would have happened here?
The text was updated successfully, but these errors were encountered: