-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4162: One-Time Key Reset Endpoint #4162
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,90 @@ | ||||||
## MSC4162: One-Time Key Reset Endpoint | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Looks like this MSC is covering more than just reset. |
||||||
|
||||||
|
||||||
### Context | ||||||
One-time keys (OTKs) consist of two parts: a public part which is stored on the server and a private part which is stored on the client device. One-time keys can be uploaded to the server with the `/keys/upload` endpoint. They can then be consumed via the `/keys/claim` endpoint. The client is kept informed of the total number of OTKs stored on the server via the `/sync` endpoint, with the JSON key `device_one_time_keys_count`. It is critical that the set of keys on the client and server remain in-sync, otherwise an encrypted channel between two devices cannot be established. "In-sync" is defined by the set of public keys on the server having their matching private keys on the client. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please wrap lines to ~100 characters There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I feel like this is misleading. What we really mean is "the set of keys on the server is a subset of those on the client". It's not a problem if the client has extra keys (generally, we assume they've been claimed but not used, but in any case they are harmless, modulo the fact they take up slots in the client). |
||||||
|
||||||
OTKs can be "in-flight". This describes the scenario where a OTK has been claimed by another device via `/keys/claim`, but a message encrypted with this key has not been received yet. There are many valid reasons for why the message has not been received yet, but it is also a potential attack vector if the client is forced to remember all keys for all time. | ||||||
|
||||||
### Desync causes | ||||||
In practice, OTKs desync for a myriad of reasons, some of which are outlined below. | ||||||
|
||||||
#### Client issues | ||||||
|
||||||
- If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can probably be fixed without a whole new endpoint? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but it outlines a common failure mode which causes OTKs to desync. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would probably be helpful to break this down into "implementation bugs which could theoretically be fixed" vs "real protocol problems" (such as backup rollback) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really follow this. Are you saying "clients might forget to re-upload keys if they do not get a successful response"? Why would that cause a 400? |
||||||
- If the client runs across multiple processes, both processes can upload different sets of OTKs in response to some trigger event (e.g being notified that a OTK has been consumed via `/sync`). [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/3110)] | ||||||
- Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)] | ||||||
Comment on lines
+15
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These two feel like implementation bugs rather than spec concerns. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but it outlines a common failure mode which causes OTKs to desync. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Client databases may become corrupted due to hardware failure. | ||||||
- Client databases may be rolled back (e.g full phone backups). | ||||||
|
||||||
#### Server issues | ||||||
- Server databases may become corrupted due to hardware failure. | ||||||
- Server databases may be rolled back (e.g during a bad upgrade process). | ||||||
|
||||||
#### Protocol issues | ||||||
- Clients only store a limited set of OTKs. When they delete excess OTKs they have no mechanism to tell the server to delete those old OTKs. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are they deleting them? If it's just good hygiene, do we add expiry timestamps to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clients delete them to not be forced into storing an unbounded set of OTKs. Clients try to keep to a number (typically 50 or 100) and when the server says the count is < this the client uploads more. But it can't delete any key as there may be in-flight requests. Hence, you can cause clients to use an unbounded amount of disk space by repeatedly telling the client that the OTK count is 0. This is bad. JS SDK and Rust SDK have maximum bounds to the number of OTKs it will store, currently this is set to 5000. |
||||||
- Multiple users could hit `/keys/claim` at the same time as a client is uploading OTKs, which creates race conditions which can cause the same key to be sent to multiple users. [[issue](https://github.com/matrix-org/matrix-spec/issues/1124)] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels like an implementation issue, though the spec may need to be clarified how the locking works. |
||||||
|
||||||
### Proposal | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem feels under-described: why is a reset required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There's existing devices who have desynced OTKs. This is guaranteed to cause UTDs when those keys get used. It would be nice to fix that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I see it, there are basically three usecases for a full reset:
Our best idea for resolving all three situations is to blow away all the OTKs on the server and start again. It would be good if the MSC spelled this out. |
||||||
|
||||||
#### Reset endpoint | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a list of why OTKs could get desync, and an end-point to reset and re-resync. But how technically do we detect that there is a desync? Unless I missed it I don't think it's noted here. Would a new end-point that would allow to check that client/server are in sync would be helpfull? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't detect desyncs because the client and server are not continually checking (and they do need to continually check, as DB corruption / rollbacks can happen at arbitrary points in time). Only |
||||||
A new endpoint `POST /keys/reset` is added. It accepts the following request body: | ||||||
|
||||||
```js | ||||||
{ | ||||||
"all": true|false | ||||||
"key_ids": [ | ||||||
"<algorithm>:<key_id>", ... | ||||||
] | ||||||
} | ||||||
``` | ||||||
- If `all` is `true`, all stored one-time keys for the calling device are deleted on the server, and `key_ids` is ignored. | ||||||
- If `key_ids` is not empty, every specified key ID is deleted from the database. An example key ID is `signed_curve25519:AAAAHQ`. | ||||||
kegsay marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The server will then return the following response: | ||||||
```js | ||||||
{ | ||||||
"device_one_time_keys_count": { | ||||||
"<algorithm>": 0 // or whatever the count value is after this operation is applied | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
#### Claiming keys ordering | ||||||
In addition to these changes, the semantics around `/keys/claim` has been slightly modified. Currently, the semantics are simply: | ||||||
|
||||||
> Claims one-time keys for use in pre-key messages. | ||||||
|
||||||
This does not define a sort ordering (i.e _which_ key should be given out to the caller). This MSC proposes tweaking this wording to be: | ||||||
|
||||||
> Claims one-time keys for use in pre-key messages. The key claimed MUST be the lowest lexicographically sorted key ID for that algorithm. For example, given two key IDs of `AAAAAu` and `AAAAHg`, `AAAAAu` MUST be returned before `AAAAHg`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there have been discussion for the KeyId to be the public key and not an "increment" like now. Would it be possible to have conflicts here? or it's not important? What about just using a timestamp ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There needs to be a way to sort the keys and perhaps more importantly, to generate new keys > old keys. Base 64 public keys don't allow for this, so that would make sorting impossible. We'd probably need to use timestamps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for timestamps here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per element-hq/element-meta#2356 (comment): lexicographical ordering on key IDs turns out to be somewhat disastrous anyway |
||||||
|
||||||
### Rationale | ||||||
|
||||||
The primary motivation of this endpoint is to provide a way for a client and server to resync OTKs. To do this, a client would call `/keys/reset` with `all: true` to delete all OTKs on the server. Once this 200 OKs, the client is then safe to perform `/keys/upload` with a fresh set of OTKs. This functionality is particularly important for existing clients who currently have desynced OTKs. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we don't want something a bit different for fallback keys as they have a longer life time (option to not reset them?). As it may interfer with #4081 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fallback keys are not one-time keys, so wouldn't be reset here. Fallback keys can be replaced in |
||||||
|
||||||
A secondary motivation of this endpoint is to provide a way for clients to tell servers to delete OTKs which the client has decided to delete (e.g due to low disk space). To do this, a client would call `/keys/reset` with `key_ids: [ the, keys, being, deleted, from, the, client ]`. Once this 200 OKs, the client is then safe to perform the deletion. | ||||||
|
||||||
The introduction of a dedicated endpoint, as opposed to allowing OTKs to replace each other when performing `/keys/upload`, is to make it an explicit decision for clients to reset OTKs. This removes the risk of hiding client bugs, as the client will only know of the desync when `/keys/upload` returns an HTTP 400. | ||||||
|
||||||
Finally, the motivation to change `/keys/claim` allows clients to detect and delete in-flight OTKs. They can do this by performing the following: | ||||||
- Upload OTKs A,B,C,D. | ||||||
- 2 users claim OTKs for this device. The server MUST return A then B. | ||||||
- Client receives an encrypted message encrypted with B. | ||||||
- At this point, the client knows that A is in-flight as A comes before B. It can then wait to see if an encrypted message will be sent using that key, or it can eventually give up (e.g after 30 days) and then delete the key using `/keys/reset`. | ||||||
|
||||||
### Alternatives | ||||||
- Allow `/keys/upload` to replace OTKs which have a matching key ID. This risks hiding client and server bugs. It also does nothing to help desync causes due to other issues e.g database issues. | ||||||
- Do nothing. Clients and servers will continue to desync over time, causing more Olm sessions to be unable to be established, causing unable to decrypt errors. | ||||||
|
||||||
### Security Considerations | ||||||
|
||||||
- If an access token is compromised, a malicious attacker can reset OTKs for that device. This would force the fallback key to be used when establishing Olm sessions. | ||||||
|
||||||
### Unstable prefix | ||||||
|
||||||
Whilst in development, the path shall be `/_matrix/client/unstable/org.matrix.msc4162/keys/reset`. There is no proposed unstable prefix for the changed semantics to `/keys/claim` due to the change being backwards compatible and valid for servers to do today. | ||||||
|
||||||
### Appendices | ||||||
|
||||||
- [Meta-issue](https://github.com/element-hq/element-meta/issues/2406) describing the general problem of OTK desyncing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation requirements: