From ef28b251981f6185c73f956302771a9114749dd0 Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Tue, 21 Nov 2023 17:50:03 +0000 Subject: [PATCH 01/10] MSC4081: Claim fallback key on network failure --- ...-claim-fallback-keys-on-network-failure.md | 75 +++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 proposals/4081-claim-fallback-keys-on-network-failure.md diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md new file mode 100644 index 00000000000..ee1c75ffc90 --- /dev/null +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -0,0 +1,75 @@ +# MSC4081: Claim fallback key on network failures + +In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys (OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must only be used once. However, this presents several problems: + - what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion) + - what happens if the OTK cannot be claimed due to transient network failiures. + +[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been used, with some lag time to account for slow networks. + +## Proposal + +Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server the target device is on is unreachable. In order for servers to return fallback keys during the network failure, the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new key `fallback_keys` to the `m.device_list_update` EDU. This MSC proposes changing the spec wording (bold is new): + +> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a room which contains servers which are not already receiving updates for that user’s device list, or changes in device information such as the device’s human-readable name **or fallback key**). + +The following key/values are added to the `DeviceKeys` object definition (bold is new): + +| Name | Type | Description | +|------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| algorithms | [string] | Required: The encryption algorithms supported by this device. | +| device_id | string | Required: The ID of the device these keys belong to. Must match the device ID used when logging in. | +| keys | {string: string} | Required: Public identity keys. The names of the properties should be in the format :. The keys themselves should be encoded as specified by the key algorithm. | +| signatures | Signatures | Required: Signatures for the device key object. A map from user ID, to a map from : to the signature. The signature is calculated using the process described at Signing JSON. | +| user_id | string | Required: The ID of the user the device belongs to. Must match the user ID used when logging in. | +| **fallback_key** | **{string: KeyObject}** | **The fallback key for this device, if set. The format of this object is identical to the /keys/claim response for a single device. This replaces any previously sent fallback key.** | + +An example of the new field: +```js +{ + // ... + "fallback_key": { + "signed_curve25519:AAAAHg": { + "key": "zKbLg+NrIjpnagy+pIY6uPL4ZwEG2v+8F9lmgsnlZzs", + "signatures": { + "@alice:example.com": { + "ed25519:JLAFKJWSCS": "FLWxXqGbwrb8SM3Y795eB6OA8bwBcoMZFXBqnTn58AYWZSqiD45tlBVcDa2L7RwdKXebW/VzDlnfVJ+9jok1Bw" + } + } + } + } +} +``` + +As a reminder, clients SHOULD rotate their fallback key when they realise it has been used, with some lag time to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new `m.device_list_update` EDU MUST be sent. + +This proposal has no client-side changes. + +## Comparisons with X3DH (Signal) + +X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, it is worth researching what X3DH does with respect to OTKs. + +> To perform an X3DH key agreement with Bob, Alice contacts the server and fetches a "prekey bundle" containing the following values: +> +> - Bob's identity key IKB +> - Bob's signed prekey SPKB +> - Bob's prekey signature Sig(IKB, Encode(SPKB)) +> - (Optionally) Bob's one-time prekey OPKB + +https://signal.org/docs/specifications/x3dh/#sending-the-initial-message + + +Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to OTK. In X3DH, one-time keys are optional. If they are exhausted, the protocol simply continues without it. If they are present, an additional DH operation is performed. + +This optionality makes the protocol robust to OTK exhaustion and transient network failures (e.g to a database to claim OTKs as Signal is not federated). + +## Security Considerations + +Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like security theater to add this flag. + +A malicious actor who can control network conditions can force a client to use a fallback key by temporarily preventing two homeservers from communicating. Previously, the only way a malicious actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for _active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to force those sessions to be set up with the fallback key. + +## Alternatives + +Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer. As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) and this is seen as very low. + + From 770f66059be66d1ba5894a9ac5c0bfb85e38c0e2 Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Wed, 22 Nov 2023 10:33:37 +0000 Subject: [PATCH 02/10] Formatting; mention increased access to fallback key --- ...-claim-fallback-keys-on-network-failure.md | 76 +++++++++++++++---- 1 file changed, 63 insertions(+), 13 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index ee1c75ffc90..cf251b7906e 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -1,16 +1,30 @@ # MSC4081: Claim fallback key on network failures -In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys (OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must only be used once. However, this presents several problems: +In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys +(OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must +only be used once. However, this presents several problems: - what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion) - - what happens if the OTK cannot be claimed due to transient network failiures. + - what happens if the OTK cannot be claimed due to transient network failures. -[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been used, with some lag time to account for slow networks. +[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" +which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, +specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. +The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier +sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been +used, with some lag time to account for slow networks. ## Proposal -Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server the target device is on is unreachable. In order for servers to return fallback keys during the network failure, the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new key `fallback_keys` to the `m.device_list_update` EDU. This MSC proposes changing the spec wording (bold is new): +Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC +proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server +the target device is on is unreachable. In order for servers to return fallback keys during the network failure, +the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new +key `fallback_keys` to the `m.device_list_update` EDU. This MSC proposes changing the spec wording (bold is new): -> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a room which contains servers which are not already receiving updates for that user’s device list, or changes in device information such as the device’s human-readable name **or fallback key**). +> Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and +> must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a +> room which contains servers which are not already receiving updates for that user’s device list, or changes in +> device information such as the device’s human-readable name **or fallback key**). The following key/values are added to the `DeviceKeys` object definition (bold is new): @@ -40,13 +54,16 @@ An example of the new field: } ``` -As a reminder, clients SHOULD rotate their fallback key when they realise it has been used, with some lag time to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new `m.device_list_update` EDU MUST be sent. +As a reminder, clients SHOULD rotate their fallback key when they realise it has been used, with some lag time +to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new +`m.device_list_update` EDU MUST be sent. This proposal has no client-side changes. ## Comparisons with X3DH (Signal) -X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, it is worth researching what X3DH does with respect to OTKs. +X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, it is worth researching what X3DH +does with respect to OTKs. > To perform an X3DH key agreement with Bob, Alice contacts the server and fetches a "prekey bundle" containing the following values: > @@ -58,18 +75,51 @@ X3DH is very similar to Matrix's key agreement protocol. Due to this similarity, https://signal.org/docs/specifications/x3dh/#sending-the-initial-message -Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to OTK. In X3DH, one-time keys are optional. If they are exhausted, the protocol simply continues without it. If they are present, an additional DH operation is performed. +Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to OTK. In X3DH, one-time +keys are optional. If they are exhausted, the protocol simply continues without it. If they are present, an additional +DH operation is performed. -This optionality makes the protocol robust to OTK exhaustion and transient network failures (e.g to a database to claim OTKs as Signal is not federated). +This optionality makes the protocol robust to OTK exhaustion and transient network failures (e.g to a database to +claim OTKs as Signal is not federated). ## Security Considerations -Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like security theater to add this flag. - -A malicious actor who can control network conditions can force a client to use a fallback key by temporarily preventing two homeservers from communicating. Previously, the only way a malicious actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for _active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to force those sessions to be set up with the fallback key. +Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they +dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to +the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this +flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely +opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and +return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like +security theater to add this flag. + +A malicious actor who can control network conditions can force a client to use a fallback key by temporarily +preventing two homeservers from communicating. Previously, the only way a malicious actor could force a client to +use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this +MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It +is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, +forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for +_active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to +force those sessions to be set up with the fallback key. + +By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time than +before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user requests it. +At that point, the client is notified via `/sync` that the fallback key has been used and hence should be rotated. +With this MSC, the client would not be notified when the fallback key is used on the remote server, because this MSC +is robust to network partitions. Instead, the user will be notified when they receive a to-device event encrypted with +the fallback key. If having access to the public part of the fallback key +_for an extended period of time_ is useful for an attacker, then this MSC decreases security. The author is not aware +of any scenario where having access to the public key for a longer period of time is a security risk. If there is a +risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on long-lived public keys as +addresses would also be vulnerable. Furthermore, the user's own homeserver has access to the fallback key today. If +access to the key for an extended time is a security risk, and the user does not trust their own homeserver (not +unreasonable given this is for E2EE) then any concerns _are already present today_, just not over federation. ## Alternatives -Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer. As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) and this is seen as very low. +Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message +will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this +will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer. +As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) +and this is seen as very low. From 7e8a59df1ba17ed3bd9613f87f35e84a9fdb25b8 Mon Sep 17 00:00:00 2001 From: kegsay <7190048+kegsay@users.noreply.github.com> Date: Wed, 22 Nov 2023 10:40:11 +0000 Subject: [PATCH 03/10] Update 4081-claim-fallback-keys-on-network-failure.md --- proposals/4081-claim-fallback-keys-on-network-failure.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index cf251b7906e..8a9531ebca9 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -1,5 +1,9 @@ # MSC4081: Claim fallback key on network failures +*Abstract: This MSC aims to increase the robustness of the Olm session setup protocol over federation. +With this MSC, transient network failures over federation will not cause undecryptable messages due to +failing to claim OTKs.* + In order for clients to establish secure communication channels between devices, they need to "claim" one-time keys (OTKs) that were previously uploaded by the device they wish to talk to. One-time keys, as the name suggests, must only be used once. However, this presents several problems: From 9e9fa886a0912c923de387373601f91d4306e65a Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Thu, 23 Nov 2023 11:03:51 +0000 Subject: [PATCH 04/10] Clarity around when fallback keys count as used --- ...4081-claim-fallback-keys-on-network-failure.md | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index 8a9531ebca9..e3373364f27 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -15,7 +15,7 @@ which can be claimed when OTKs are exhausted. Fallback keys provide weaker secur specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been -used, with some lag time to account for slow networks. +"used", with some lag time to account for slow networks. ## Proposal @@ -58,11 +58,20 @@ An example of the new field: } ``` -As a reminder, clients SHOULD rotate their fallback key when they realise it has been used, with some lag time +As a reminder, clients SHOULD rotate their fallback key when they realise it has been "used", with some lag time to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new `m.device_list_update` EDU MUST be sent. -This proposal has no client-side changes. +The definition of when a fallback key is "used" also needs to change. Previously, a key is "used" +_if it is claimed by another device_. When this happens, the client is told this via `/sync`, either by reducing +the one-time key count by 1, or by removing the algorithm from the `device_unused_fallback_key_types`. This proposal +makes it impossible to know if the fallback key has been claimed by another device, as it is sent eagerly over +federation. Therefore, this changes the fallback key semantics to be "when the device receives and successfully +decrypts an initial pre-key to-device event which uses that key". As per the specification, this is identified as +`type: 0` messages. This will require client-side changes to change when new fallback keys get uploaded. + +Due to this change, it is recommended that the fallback key is _also cycled periodically_, e.g once per week. +This ensures that any sessions the client is unaware of will eventually expire. ## Comparisons with X3DH (Signal) From 84079ea77bd5dd90edccd3427cd967e34fb78740 Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Thu, 23 Nov 2023 11:19:04 +0000 Subject: [PATCH 05/10] More clarity --- proposals/4081-claim-fallback-keys-on-network-failure.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index e3373364f27..9efc5ab2935 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -64,14 +64,15 @@ to account for federation. As per MSC2732, 1 hour is recommended. When clients c The definition of when a fallback key is "used" also needs to change. Previously, a key is "used" _if it is claimed by another device_. When this happens, the client is told this via `/sync`, either by reducing -the one-time key count by 1, or by removing the algorithm from the `device_unused_fallback_key_types`. This proposal +the one-time key count by 1, or by removing the algorithm from the `device_unused_fallback_key_types` array. This proposal makes it impossible to know if the fallback key has been claimed by another device, as it is sent eagerly over -federation. Therefore, this changes the fallback key semantics to be "when the device receives and successfully +federation. Therefore, this changes the definition of "used" to be "when the device receives and successfully decrypts an initial pre-key to-device event which uses that key". As per the specification, this is identified as `type: 0` messages. This will require client-side changes to change when new fallback keys get uploaded. -Due to this change, it is recommended that the fallback key is _also cycled periodically_, e.g once per week. -This ensures that any sessions the client is unaware of will eventually expire. +Due to this change, it is recommended that the fallback key is also **cycled periodically** +_even if the key isn't "used"_, e.g once per week. This reduces the risk of >1 session being established with the same +key, but for some reason the client isn't able to detect it. ## Comparisons with X3DH (Signal) From c32227e0607123109c1be54f2ec423e59ced8647 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> Date: Wed, 28 Feb 2024 18:00:17 +0000 Subject: [PATCH 06/10] Apply suggestions from code review --- proposals/4081-claim-fallback-keys-on-network-failure.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index 9efc5ab2935..4fec4e0f18a 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -23,7 +23,7 @@ Currently, fallback keys are _only_ claimed on key exhaustion, not due to transi proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server the target device is on is unreachable. In order for servers to return fallback keys during the network failure, the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new -key `fallback_keys` to the `m.device_list_update` EDU. This MSC proposes changing the spec wording (bold is new): +key `fallback_keys` to the [`m.device_list_update` EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). This MSC proposes changing the spec wording (bold is new): > Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and > must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a @@ -106,8 +106,8 @@ opt-in, and hence require client-side changes. However, a malicious server can t return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like security theater to add this flag. -A malicious actor who can control network conditions can force a client to use a fallback key by temporarily -preventing two homeservers from communicating. Previously, the only way a malicious actor could force a client to +A malicious actor who can control network conditions (but not the servers themselves) can force a client to use a fallback key by temporarily +preventing two homeservers from communicating. Previously, the only way such an actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, From ea992d0fe487a25dabf5b125b8a164745a9438b1 Mon Sep 17 00:00:00 2001 From: Richard van der Hoff Date: Wed, 28 Feb 2024 21:54:37 +0000 Subject: [PATCH 07/10] address review comments --- ...-claim-fallback-keys-on-network-failure.md | 222 ++++++++++++------ 1 file changed, 149 insertions(+), 73 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index 4fec4e0f18a..f0749c0138a 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -1,4 +1,4 @@ -# MSC4081: Claim fallback key on network failures +# MSC4081: Eagerly sharing fallback keys with federated servers *Abstract: This MSC aims to increase the robustness of the Olm session setup protocol over federation. With this MSC, transient network failures over federation will not cause undecryptable messages due to @@ -10,69 +10,141 @@ only be used once. However, this presents several problems: - what happens when the device does not upload more keys and the uploaded keys are all used up? (key exhaustion) - what happens if the OTK cannot be claimed due to transient network failures. -[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" -which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, -specifically impacting forward secrecy, which protects past sessions against future compromises of keys or passwords. -The risk is that if the private part of the fallback key is exposed, an attacker may use the key to decrypt earlier -sessions. This can be mitigated by cycling the fallback key (and hence deleting the private key) once it has been -"used", with some lag time to account for slow networks. +[MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732) introduced the concept of "fallback keys" +which can be claimed when OTKs are exhausted. Fallback keys provide weaker security properties than one-time keys, +specifically impacting forward secrecy, which protects past sessions against future compromises of keys or +passwords. The risk is that if the private part of the fallback key is exposed, an attacker may use the key to +decrypt earlier sessions. This can be mitigated by creating a new fallback key as soon as the old one has been used +(and hence later deleting the private key, with some lag time to account for slow networks). + +For reference, https://crypto.stackexchange.com/a/52825 is a good explanation of why OTKs are preferable +to fallback keys, where they are available. (The question is about Signal rather than Olm, however the principles +are much the same. Signal uses the terms "prekey" to refer to "fallback key" and "one-time prekey" to refer to +OTK.) ## Proposal -Currently, fallback keys are _only_ claimed on key exhaustion, not due to transient network failures. This MSC +Currently, fallback keys are _only_ used on key exhaustion, not due to transient network failures. This MSC proposes to change the semantics to allow fallback keys to be returned by the `/keys/claim` endpoint if the server the target device is on is unreachable. In order for servers to return fallback keys during the network failure, -the fallback keys must be cached _in advance_ on the claiming user's homeserver. This MSC proposes adding a new -key `fallback_keys` to the [`m.device_list_update` EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). This MSC proposes changing the spec wording (bold is new): +the fallback keys must be cached _in advance_ on the claiming user's homeserver. + +### Extend `/_matrix/client/v3/keys/upload` request + +Clients have to opt in to this process when uploading fallback keys. To allow this, we extend the [`POST +/_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload) +endpoint with a new request body parameter, `eager_share_fallback_keys`, as follows (bold is new): + +| Name | Type | Description | +|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `device_keys` | `DeviceKeys` | Identity keys for the device. May be absent if no new identity keys are required. +| `fallback_keys` | `OneTimeKeys` | The public key which should be used if the device’s one-time keys are exhausted, **or if the user's homeserver is unreachable**. [etc] +| `one_time_keys` | `OneTimeKeys` | One-time public keys for “pre-key” messages. The names of the properties should be in the format :. The format of the key is determined by the key algorithm. May be absent if no new one-time keys are required. +| **`eager_share_fallback_keys`** | **`boolean`** | **Whether the `fallback_keys` should immediately be sent to other homeservers which have a user which share a room with this user. Omitting this property is the same as setting it to `false`. + +### Extend `m.device_list_update` EDU + +This MSC proposes adding a new key `fallback_keys` to the [`m.device_list_update` +EDU](https://spec.matrix.org/v1.9/server-server-api/#definition-mdevice_list_update). We change the spec wording as +follows: > Servers must send `m.device_list_update` EDUs to all the servers who share a room with a given local user, and > must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a > room which contains servers which are not already receiving updates for that user’s device list, or changes in -> device information such as the device’s human-readable name **or fallback key**). +> device information such as the device’s human-readable name **or, if the client has opted into eager sharing of +> fallback keys, the fallback keys**). -The following key/values are added to the `DeviceKeys` object definition (bold is new): +A new property `fallback_keys` is added to the body of the `m.device_list_update` EDU, as shown below (bold is new): -| Name | Type | Description | -|------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| algorithms | [string] | Required: The encryption algorithms supported by this device. | -| device_id | string | Required: The ID of the device these keys belong to. Must match the device ID used when logging in. | -| keys | {string: string} | Required: Public identity keys. The names of the properties should be in the format :. The keys themselves should be encoded as specified by the key algorithm. | -| signatures | Signatures | Required: Signatures for the device key object. A map from user ID, to a map from : to the signature. The signature is calculated using the process described at Signing JSON. | -| user_id | string | Required: The ID of the user the device belongs to. Must match the user ID used when logging in. | -| **fallback_key** | **{string: KeyObject}** | **The fallback key for this device, if set. The format of this object is identical to the /keys/claim response for a single device. This replaces any previously sent fallback key.** | +| Name | Type | Description | +|-----------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `deleted` | `boolean` | True if the server is announcing that this device has been deleted. +| `device_display_name` | `string` | The public human-readable name of this device. Will be absent if the device has no name. +| `device_id` | `string` | Required: The ID of the device whose details are changing. +| `keys` | `DeviceKeys` | The updated identity keys (if any) for this device. May be absent if the device has no E2E keys defined. +| `prev_id` | `[integer]` | The `stream_ids` of any prior `m.device_list_update` EDUs sent for this user which have not been referred to already in an EDU’s `prev_id` field. If the receiving server does not recognise any of the `prev_ids`, it means an EDU has been lost and the server should query a snapshot of the device list via `/user/keys/query` in order to correctly interpret future `m.device_list_update` EDUs. May be missing or empty for the first EDU in a sequence. +| `stream_id` | `integer` | Required: An ID sent by the server for this update, unique for a given `user_id`. Used to identify any gaps in the sequence of m.device_list_update EDUs broadcast by a server. +| `user_id` | `string` | Required: The user ID who owns this device. +| **`fallback_keys`** | **`{string: KeyObject}`** | **The fallback keys for this device, if set, and if the client has opted in to eager sharing. This is the same as the most recent `fallback_keys` uploaded by this device via [`POST /_matrix/client/v3/keys/upload`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysupload).** -An example of the new field: + +An example of an EDU with the new property: ```js { - // ... - "fallback_key": { + "content": { + "device_display_name": "Mobile", + "device_id": "QBUAZIFURK", + "keys": { + "algorithms": [ + "m.olm.v1.curve25519-aes-sha2", + "m.megolm.v1.aes-sha2" + ], + "device_id": "JLAFKJWSCS", + "keys": { + "curve25519:JLAFKJWSCS": "3C5BFWi2Y8MaVvjM8M22DBmh24PmgR0nPvJOIArzgyI", + "ed25519:JLAFKJWSCS": "lEuiRJBit0IG6nUf5pUzWTUEsRVVe/HJkoKuEww9ULI" + }, + "signatures": { + "@alice:example.com": { + "ed25519:JLAFKJWSCS": "dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA" + } + }, + "user_id": "@alice:example.com" + }, + "prev_id": [ + 5 + ], + "stream_id": 6, + "user_id": "@john:example.com", + "fallback_keys": { "signed_curve25519:AAAAHg": { + "fallback": true, "key": "zKbLg+NrIjpnagy+pIY6uPL4ZwEG2v+8F9lmgsnlZzs", "signatures": { - "@alice:example.com": { + "@johh:example.com": { "ed25519:JLAFKJWSCS": "FLWxXqGbwrb8SM3Y795eB6OA8bwBcoMZFXBqnTn58AYWZSqiD45tlBVcDa2L7RwdKXebW/VzDlnfVJ+9jok1Bw" } } } } + }, + "edu_type": "m.device_list_update" } ``` -As a reminder, clients SHOULD rotate their fallback key when they realise it has been "used", with some lag time -to account for federation. As per MSC2732, 1 hour is recommended. When clients change their fallback key, a new -`m.device_list_update` EDU MUST be sent. +### Changed semantics for `/keys/claim` -The definition of when a fallback key is "used" also needs to change. Previously, a key is "used" -_if it is claimed by another device_. When this happens, the client is told this via `/sync`, either by reducing -the one-time key count by 1, or by removing the algorithm from the `device_unused_fallback_key_types` array. This proposal -makes it impossible to know if the fallback key has been claimed by another device, as it is sent eagerly over -federation. Therefore, this changes the definition of "used" to be "when the device receives and successfully -decrypts an initial pre-key to-device event which uses that key". As per the specification, this is identified as -`type: 0` messages. This will require client-side changes to change when new fallback keys get uploaded. +[`POST /_matrix/client/v3/keys/claim`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysclaim) can +now respond with a cached fallback key if the remote server is unreachable. -Due to this change, it is recommended that the fallback key is also **cycled periodically** -_even if the key isn't "used"_, e.g once per week. This reduces the risk of >1 session being established with the same -key, but for some reason the client isn't able to detect it. +### Changed semantics for rotating fallback keys + +As a reminder, clients SHOULD upload a new fallback key when they realise it has been "used". + +The definition of when a fallback key is "used" is changed by this MSC. Previously, a fallback key is "used" +_if it is claimed by another device_. When this happens, the client is told this via `/sync`, by removing the +algorithm from the `device_unused_fallback_key_types` array. This is no longer a useful mechanism, as the key is +sent eagerly over federation. + +Therefore, we change the definition of "used" to be "when the device receives and successfully decrypts an initial +pre-key to-device event which uses that key". As soon as such an event is received, a new fallback key should be +created and uploaded via `/keys/upload`. (As above, this will then trigger `m.device_list_update` EDUs.) + +We also add a recommendation that the fallback key is also **rotated periodically** _even if the key isn't "used"_, +e.g once per week. This reduces the risk of the key being used without the client knowing about it (such as a +networking problem). + +Once a new key has been uploaded, the private part of the old key should be scheduled for deletion. This cannot +happen immediately, since there may be other messages in flight which rely on the old key. This was also true of +the original fallback keys implementation +([MSC2732](https://github.com/matrix-org/matrix-spec-proposals/pull/2732)), however there could now be a much more +significant delay between the old key being used to encrypt a message and that message being received at the +recipient, and MSC2732's recommendation (the lesser of "as soon as the new key is used" and 1 hour) is inadequate +We therefore recommend significantly increasing the period for which an old fallback key is kept on the client, to +30 days after the key was replaced, but making sure that at least one old fallback key is kept at all +times. (Since we recommend rotating keys every week, normally there will be several old keys on the +client. However, if a user does not use their client for a month, there could be a backlog of messages for the most +recent old key; this is why we always keep at least one.) ## Comparisons with X3DH (Signal) @@ -98,42 +170,46 @@ claim OTKs as Signal is not federated). ## Security Considerations -Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they -dislike the slightly weaker security properties fallback keys provide. This could be resolved by adding a flag to -the `/keys/claim` endpoint to state whether returning a fallback key is acceptable to the client or not. If this -flag is not set/missing, fallback keys would not be returned in place of OTKs, meaning this MSC would be entirely -opt-in, and hence require client-side changes. However, a malicious server can trivially ignore this flag and -return the fallback key anyway, and the client would not be able to detect this. For this reason, it feels like -security theater to add this flag. - -A malicious actor who can control network conditions (but not the servers themselves) can force a client to use a fallback key by temporarily -preventing two homeservers from communicating. Previously, the only way such an actor could force a client to -use a fallback key would be to claim all the OTKs before the client had a chance to upload more. Therefore, this -MSC increases the ways attackers can force clients to use fallback keys. Fallback keys weaken forward secrecy. It -is assumed that "most" sessions will be set up using OTKs and not the fallback key. If this assumption holds, -forcing use of a fallback key does nothing to compromise those sessions. This means this attack is only useful for -_active attacks_, where an attacker wants to compromise _sessions that have yet to be established_, and wants to -force those sessions to be set up with the fallback key. - -By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time than -before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user requests it. -At that point, the client is notified via `/sync` that the fallback key has been used and hence should be rotated. -With this MSC, the client would not be notified when the fallback key is used on the remote server, because this MSC -is robust to network partitions. Instead, the user will be notified when they receive a to-device event encrypted with -the fallback key. If having access to the public part of the fallback key -_for an extended period of time_ is useful for an attacker, then this MSC decreases security. The author is not aware -of any scenario where having access to the public key for a longer period of time is a security risk. If there is a -risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on long-lived public keys as -addresses would also be vulnerable. Furthermore, the user's own homeserver has access to the fallback key today. If -access to the key for an extended time is a security risk, and the user does not trust their own homeserver (not -unreasonable given this is for E2EE) then any concerns _are already present today_, just not over federation. +1. Ultra secure clients may be unhappy that fallback keys are being returned and not one-time keys, because they + dislike the slightly weaker security properties fallback keys provide. Since fallback keys are marked as such + with `fallback: true`, such clients can detect this situation and act accordingly (eg by refusing to send a + message, or by retrying later). + +2. A malicious actor who can control network conditions (but not the servers themselves) can force a client to use + a fallback key by temporarily preventing two homeservers from communicating. Previously, the only way such an + actor could force a client to use a fallback key would be to claim all the OTKs before the client had a chance + to upload more. Therefore, this MSC increases the ways attackers can force clients to use fallback + keys. Fallback keys weaken forward secrecy. It is assumed that "most" sessions will be set up using OTKs and not + the fallback key. If this assumption holds, forcing use of a fallback key does nothing to compromise those + sessions. This means this attack is only useful for _active attacks_, where an attacker wants to compromise + _sessions that have yet to be established_, and wants to force those sessions to be set up with the fallback + key. + +3. By sending the fallback key eagerly, an attacker would have access to the public key for a longer period of time + than before. Without this MSC, the fallback key remains on the uploader's homeserver until a federated user + requests it. At that point, the client is notified via `/sync` that the fallback key has been used and hence + should be rotated. With this MSC, the client would not be notified when the fallback key is used on the remote + server, because this MSC is robust to network partitions. Instead, the user will be notified when they receive a + to-device event encrypted with the fallback key. If having access to the public part of the fallback key _for an + extended period of time_ is useful for an attacker, then this MSC decreases security. + + We are not aware of any scenario where having access to the public key for a longer period of time is a security + risk. If there is a risk, other decentralised systems such as bitcoin, etheruem and libp2p which all rely on + long-lived public keys as addresses would also be vulnerable. Furthermore, the user's own homeserver has access + to the fallback key today. If access to the key for an extended time is a security risk, and the user does not + trust their own homeserver (not unreasonable given this is for E2EE) then any concerns _are already present + today_, just not over federation. ## Alternatives -Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the message -will not be encrypted for that device, and the end user will be unable to decrypt the message. What's worse, this -will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or much longer. -As a data point, Matrix Rust SDK currently uses [15 seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) -and this is seen as very low. - - +1. Do nothing. In this scenario, if the remote server is unreachable when the client calls `/keys/claim`, the + message will not be encrypted for that device, and the end user will be unable to decrypt the message. What's + worse, this will persist until the client decides to retry the `/keys/claim` endpoint, which could be seconds or + much longer. As a data point, Matrix Rust SDK currently uses [15 + seconds](https://github.com/matrix-org/matrix-rust-sdk/issues/2804) and this is seen as very low. + +2. Clients could remember that they were unable to claim keys for a given device, and retry periodically. The main + problem with this approach (other than increased complexity in the client) is that it requires the sending + client to still be online when the remote server comes online, and to notice that has happened. There may be + other benefits to such an approach, but we feel that this MSC nevertheless represents an achievable, incremental + improvement in reliability. From 5262868f865f34fc7402c96206b7cf79eddaeabf Mon Sep 17 00:00:00 2001 From: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> Date: Thu, 29 Feb 2024 13:16:39 +0000 Subject: [PATCH 08/10] Apply suggestions from code review Co-authored-by: kegsay <7190048+kegsay@users.noreply.github.com> --- proposals/4081-claim-fallback-keys-on-network-failure.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index f0749c0138a..d8750c15adb 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -40,7 +40,7 @@ endpoint with a new request body parameter, `eager_share_fallback_keys`, as foll | `device_keys` | `DeviceKeys` | Identity keys for the device. May be absent if no new identity keys are required. | `fallback_keys` | `OneTimeKeys` | The public key which should be used if the device’s one-time keys are exhausted, **or if the user's homeserver is unreachable**. [etc] | `one_time_keys` | `OneTimeKeys` | One-time public keys for “pre-key” messages. The names of the properties should be in the format :. The format of the key is determined by the key algorithm. May be absent if no new one-time keys are required. -| **`eager_share_fallback_keys`** | **`boolean`** | **Whether the `fallback_keys` should immediately be sent to other homeservers which have a user which share a room with this user. Omitting this property is the same as setting it to `false`. +| **`eager_share_fallback_keys`** | **`boolean`** | **Whether the `fallback_keys` should immediately be sent to other homeservers which have a user which share a room with this user. Omitting this property is the same as setting it to `false`.** ### Extend `m.device_list_update` EDU @@ -85,11 +85,11 @@ An example of an EDU with the new property: "ed25519:JLAFKJWSCS": "lEuiRJBit0IG6nUf5pUzWTUEsRVVe/HJkoKuEww9ULI" }, "signatures": { - "@alice:example.com": { + "@john:example.com": { "ed25519:JLAFKJWSCS": "dSO80A01XiigH3uBiDVx/EjzaoycHcjq9lfQX0uWsqxl2giMIiSPR8a4d291W1ihKJL/a+myXS367WT6NAIcBA" } }, - "user_id": "@alice:example.com" + "user_id": "@john:example.com" }, "prev_id": [ 5 From e8e5b8588eeea6c330bb8f4ec4a51d39c9299b7b Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Fri, 17 May 2024 13:01:50 +0100 Subject: [PATCH 09/10] Update 4081-claim-fallback-keys-on-network-failure.md --- proposals/4081-claim-fallback-keys-on-network-failure.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index d8750c15adb..694b0da55b0 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -132,7 +132,7 @@ created and uploaded via `/keys/upload`. (As above, this will then trigger `m.de We also add a recommendation that the fallback key is also **rotated periodically** _even if the key isn't "used"_, e.g once per week. This reduces the risk of the key being used without the client knowing about it (such as a -networking problem). +networking problem). Some clients [already do this](https://github.com/matrix-org/matrix-rust-sdk/pull/3151). Once a new key has been uploaded, the private part of the old key should be scheduled for deletion. This cannot happen immediately, since there may be other messages in flight which rely on the old key. This was also true of From 94040521a0dd17450e8482fc9012656049ec39ca Mon Sep 17 00:00:00 2001 From: Kegan Dougal <7190048+kegsay@users.noreply.github.com> Date: Fri, 7 Jun 2024 12:13:08 +0100 Subject: [PATCH 10/10] Give examples of "unreachable" servers --- proposals/4081-claim-fallback-keys-on-network-failure.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/proposals/4081-claim-fallback-keys-on-network-failure.md b/proposals/4081-claim-fallback-keys-on-network-failure.md index 694b0da55b0..80a5be8677c 100644 --- a/proposals/4081-claim-fallback-keys-on-network-failure.md +++ b/proposals/4081-claim-fallback-keys-on-network-failure.md @@ -115,7 +115,10 @@ An example of an EDU with the new property: ### Changed semantics for `/keys/claim` [`POST /_matrix/client/v3/keys/claim`](https://spec.matrix.org/v1.9/client-server-api/#post_matrixclientv3keysclaim) can -now respond with a cached fallback key if the remote server is unreachable. +now respond with a cached fallback key if the remote server is unreachable. "Unreachable" includes: + - unable to connect to the server + - the sending server is backing off the remote server + - the remote server responded with an error code such as 429 Too Many Requests. ### Changed semantics for rotating fallback keys