Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4081: Eagerly sharing fallback keys with federated servers #4081

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
218 changes: 218 additions & 0 deletions proposals/4081-claim-fallback-keys-on-network-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# MSC4081: Eagerly sharing fallback keys with federated servers

*Abstract: This MSC aims to increase the robustness of the Olm session setup protocol over federation.
With this MSC, transient network failures over federation will not cause undecryptable messages due to
failing to claim OTKs.*

In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys
(OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must
only be used once. However, this presents several problems:
- what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion)
- what happens if the OTK cannot be claimed due to transient network failures.

[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys"
which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys,
specifically impacting forward secrecy, which protects past sessions against future compromises of keys or
passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to
decrypt earlier sessions. This can be mitigated by creating a new fallback key as soon as the old one has been used
(and hence later deleting the private key, with some lag time to account for slow networks).

For reference, https://crypto.stackexchange.com/a/52825 is a good explanation of why OTKs are preferable
to fallback keys, where they are available. (The question is about Signal rather than Olm, however the principles
are much the same. Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to
OTK.)

## Proposal

Currently, fallback keys are _only_ used on key exhaustion, not due to transient network failures. This MSC
proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server
the target device is on is unreachable. In order for servers to return fallback keys during the network failure,
the fallback keys must be cached _in advance_ on the claiming user's homeserver.

### Extend `/_matrix/client/v3/keys/upload` request

Clients have to opt in to this process when uploading fallback keys. To allow this, we extend the [`POST
/_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload)
endpoint with a new request body parameter, `eager_share_fallback_keys`, as follows (bold is new):

| Name | Type | Description |
|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `device_keys` | `DeviceKeys` | Identity keys for the device. May be absent if no new identity keys are required.
| `fallback_keys` | `OneTimeKeys` | The public key which should be used if the device’s one-time keys are exhausted, **or if the user's homeserver is unreachable**. [etc]
| `one_time_keys` | `OneTimeKeys` | One-time public keys for “pre-key” messages. The names of the properties should be in the format <algorithm>:<key_id>. The format of the key is determined by the key algorithm. May be absent if no new one-time keys are required.
| **`eager_share_fallback_keys`** | **`boolean`** | **Whether the `fallback_keys` should immediately be sent to other homeservers which have a user which share a room with this user. Omitting this property is the same as setting it to `false`.**

### Extend `m.device_list_update` EDU

This MSC proposes adding a new key `fallback_keys` to the [`m.device_list_update`
EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). We change the spec wording as
follows:

> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and
> must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a
> room which contains servers which are not already receiving updates for that user’s device list, or changes in
> device information such as the device’s human-readable name **or, if the client has opted into eager sharing of
> fallback keys, the fallback keys**).

A new property `fallback_keys` is added to the body of the `m.device_list_update` EDU, as shown below (bold is new):

| Name | Type | Description |
|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `deleted` | `boolean` | True if the server is announcing that this device has been deleted.
| `device_display_name` | `string` | The public human-readable name of this device. Will be absent if the device has no name.
| `device_id` | `string` | Required: The ID of the device whose details are changing.
| `keys` | `DeviceKeys` | The updated identity keys (if any) for this device. May be absent if the device has no E2E keys defined.
| `prev_id` | `[integer]` | The `stream_ids` of any prior `m.device_list_update` EDUs sent for this user which have not been referred to already in an EDU’s `prev_id` field. If the receiving server does not recognise any of the `prev_ids`, it means an EDU has been lost and the server should query a snapshot of the device list via `/user/keys/query` in order to correctly interpret future `m.device_list_update` EDUs. May be missing or empty for the first EDU in a sequence.
| `stream_id` | `integer` | Required: An ID sent by the server for this update, unique for a given `user_id`. Used to identify any gaps in the sequence of m.device_list_update EDUs broadcast by a server.
| `user_id` | `string` | Required: The user ID who owns this device.
| **`fallback_keys`** | **`{string: KeyObject}`** | **The fallback keys for this device, if set, and if the client has opted in to eager sharing. This is the same as the most recent `fallback_keys` uploaded by this device via [`POST /_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload).**


An example of an EDU with the new property:
```js
{
"content": {
"device_display_name": "Mobile",
"device_id": "QBUAZIFURK",
"keys": {
"algorithms": [
"m.olm.v1.curve25519-aes-sha2",
"m.megolm.v1.aes-sha2"
],
"device_id": "JLAFKJWSCS",
"keys": {
"curve25519:JLAFKJWSCS": "3C5BFWi2Y8MaVvjM8M22DBmh24PmgR0nPvJOIArzgyI",
"ed25519:JLAFKJWSCS": "lEuiRJBit0IG6nUf5pUzWTUEsRVVe/HJkoKuEww9ULI"
},
"signatures": {
"@john:example.com": {
"ed25519:JLAFKJWSCS": "dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA"
}
},
"user_id": "@john:example.com"
},
"prev_id": [
5
],
"stream_id": 6,
"user_id": "@john:example.com",
"fallback_keys": {
"signed_curve25519:AAAAHg": {
"fallback": true,
"key": "zKbLg+NrIjpnagy+pIY6uPL4ZwEG2v+8F9lmgsnlZzs",
"signatures": {
"@johh:example.com": {
"ed25519:JLAFKJWSCS": "FLWxXqGbwrb8SM3Y795eB6OA8bwBcoMZFXBqnTn58AYWZSqiD45tlBVcDa2L7RwdKXebW/VzDlnfVJ+9jok1Bw"
}
}
}
}
},
"edu_type": "m.device_list_update"
}
```

### Changed semantics for `/keys/claim`

[`POST /_matrix/client/v3/keys/claim`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysclaim) can
now respond with a cached fallback key if the remote server is unreachable. "Unreachable" includes:
- unable to connect to the server
- the sending server is backing off the remote server
- the remote server responded with an error code such as 429 Too Many Requests.

### Changed semantics for rotating fallback keys

As a reminder, clients SHOULD upload a new fallback key when they realise it has been "used".

The definition of when a fallback key is "used" is changed by this MSC. Previously, a fallback key is "used"
_if it is claimed by another device_. When this happens, the client is told this via `/sync`, by removing the
algorithm from the `device_unused_fallback_key_types` array. This is no longer a useful mechanism, as the key is
sent eagerly over federation.

Therefore, we change the definition of "used" to be "when the device receives and successfully decrypts an initial
pre-key to-device event which uses that key". As soon as such an event is received, a new fallback key should be
created and uploaded via `/keys/upload`. (As above, this will then trigger `m.device_list_update` EDUs.)

We also add a recommendation that the fallback key is also **rotated periodically** _even if the key isn't "used"_,
e.g once per week. This reduces the risk of the key being used without the client knowing about it (such as a
networking problem). Some clients [already do this](https://github.com/matrix-org/matrix-rust-sdk/pull/3151).

Once a new key has been uploaded, the private part of the old key should be scheduled for deletion. This cannot
happen immediately, since there may be other messages in flight which rely on the old key. This was also true of
the original fallback keys implementation
([MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732)), however there could now be a much more
significant delay between the old key being used to encrypt a message and that message being received at the
recipient, and MSC2732's recommendation (the lesser of "as soon as the new key is used" and 1 hour) is inadequate
We therefore recommend significantly increasing the period for which an old fallback key is kept on the client, to
30 days after the key was replaced, but making sure that at least one old fallback key is kept at all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that after 30d of netsplit, a user on a server which has cached an old fallback key will no longer be able to establish an Olm session to you?

This feels like a pretty major limitation which should be called out - and communicated to the sending user when it happens?

I still like the idea of warning the user in general what users they can’t communicate with (due to no OTKs, or due to expired fallback keys), so the user can go and complain and get the problem solved.

Copy link
Member

@richvdh richvdh Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth saying that, after 30 days of netsplit, it's a good bet that any to-device messages you send aren't going to be delivered for a while anyway. (In other words: what good does obtaining a fallback key do if you then can't actually send anything you encrypt with it?)

Still, you're not wrong. It's also worth saying that this 30 days is entirely under the client's control; so if you are working in an environment where you expect your homeserver to go incommunicado for 3 months, perhaps you can configure your client to keep old fallback keys that long.

As for detecting and reporting the situation to the user: yes, that might be nice, if only so that it can end up in a rageshake (I can hear @pmaier1's voice in my head saying "users don't want to be bothered with this sort of technical detail!"). The problem is, how can we detect it? The problem is that Alice (who is using the fallback key) has no way of knowing that Bob has expired that key, because they are netsplit. Timestamping the keys doesn't help, because an old key could also mean that Bob hasn't used his client for 30 days. We could maybe detect the situation once Bob comes back online, but that doesn't help the user get the problem solved in the first place.

IMHO this starts to get into questions about "how long is it reasonable for clients/servers to be offline/unreachable, and still expect messages to get delivered". FB Messenger, for example, actually logs out any devices that are unused after 30 days. We don't necessarily need to go that far, but "forever" seems an unreasonable expectation; if we actually set some expectations then we could work towards having sensible UX when that time expires.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this starts to get into questions about "how long is it reasonable for clients/servers to be offline/unreachable, and still expect messages to get delivered". FB Messenger, for example, actually logs out any devices that are unused after 30 days. We don't necessarily need to go that far, but "forever" seems an unreasonable expectation; if we actually set some expectations then we could work towards having sensible UX when that time expires.

💯 - if we actually bothered to do this then maybe we could clear out the to_device_inbox at some point...

I disagree that "I still like the idea of warning the user in general what users they can’t communicate with (due to no OTKs, or due to expired fallback keys), so the user can go and complain and get the problem solved." is a useful property to be trying to preserve here. If I'm on a HS and it is down, then you cannot talk to me no matter what you try to do. In some cases there may be an alternative way of contacting me, to which I can either A) thank you for being a real-life PagerDuty and restart the server or B) shrug and say I don't actually control it, as it's on $homeserver I don't control.

times. (Since we recommend rotating keys every week, normally there will be several old keys on the
client. However, if a user does not use their client for a month, there could be a backlog of messages for the most
recent old key; this is why we always keep at least one.)

## Comparisons with X3DH (Signal)

X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, it is worth researching what X3DH
does with respect to OTKs.

> To perform an X3DH key agreement with Bob, Alice contacts the server and fetches a "prekey bundle" containing the following values:
>
> - Bob's identity key IKB
> - Bob's signed prekey SPKB
> - Bob's prekey signature Sig(IKB, Encode(SPKB))
> - (Optionally) Bob's one-time prekey OPKB

https://signal.org/docs/specifications/x3dh/#sending-the-initial-message


Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to OTK. In X3DH, one-time
keys are optional. If they are exhausted, the protocol simply continues without it. If they are present, an additional
DH operation is performed.

This optionality makes the protocol robust to OTK exhaustion and transient network failures (e.g to a database to
claim OTKs as Signal is not federated).

## Security Considerations

1. Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they
dislike the slightly weaker security properties fallback keys provide. Since fallback keys are marked as such
with `fallback: true`, such clients can detect this situation and act accordingly (eg by refusing to send a
message, or by retrying later).

2. A malicious actor who can control network conditions (but not the servers themselves) can force a client to use
a fallback key by temporarily preventing two homeservers from communicating. Previously, the only way such an
actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance
to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback
keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not
the fallback key. If this assumption holds, forcing use of a fallback key does nothing to compromise those
sessions. This means this attack is only useful for _active attacks_, where an attacker wants to compromise
_sessions that have yet to be established_, and wants to force those sessions to be set up with the fallback
key.

3. By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time
than before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user
requests it. At that point, the client is notified via `/sync` that the fallback key has been used and hence
should be rotated. With this MSC, the client would not be notified when the fallback key is used on the remote
server, because this MSC is robust to network partitions. Instead, the user will be notified when they receive a
to-device event encrypted with the fallback key. If having access to the public part of the fallback key _for an
extended period of time_ is useful for an attacker, then this MSC decreases security.

We are not aware of any scenario where having access to the public key for a longer period of time is a security
risk. If there is a risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on
long-lived public keys as addresses would also be vulnerable. Furthermore, the user's own homeserver has access
to the fallback key today. If access to the key for an extended time is a security risk, and the user does not
trust their own homeserver (not unreasonable given this is for E2EE) then any concerns _are already present
today_, just not over federation.

## Alternatives

1. Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the
message will not be encrypted for that device, and the end user will be unable to decrypt the message. What's
worse, this will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or
much longer. As a data point, Matrix Rust SDK currently uses [15
seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) and this is seen as very low.

2. Clients could remember that they were unable to claim keys for a given device, and retry periodically. The main
problem with this approach (other than increased complexity in the client) is that it requires the sending
client to still be online when the remote server comes online, and to notice that has happened. There may be
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically rules out asynchronous communication when initially talking to someone, as both HSes need to be online at exactly the same time. This feels very suboptimal and rules out the ability to run E2EE Matrix under certain network conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both HSes need to be online at exactly the same time.

Whatever we do, both HSes need to be online at the same time because one has to make an HTTP request to the other. I guess you mean that the sending client, and both HSes, have to be online. In which case, yes I agree with you and this is a succinct summary of why this approach is insufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I meant, sorry.

other benefits to such an approach, but we feel that this MSC nevertheless represents an achievable, incremental
improvement in reliability.