Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

timbru · 2023-11-23T16:12:07Z

This should fix issues. It still needs more manual testing. Unfortunately, it's not easy to test this in automated way currently as our upgrade test code assumes that values can be read before putting them into a memstore (which only accepts valid json) for testing. This needs some thought - or a very thorough review and manual testing.

There is a happy case test, but that one essentially does nothing - it covers the case where an upgrade is done and there were no corrupt files.

test-resources/migrations/v0_14_2/tasks/pending/1700751070000-sweep_login_cache

ximon18 · 2023-11-28T15:43:53Z

Cargo.lock

@@ -1081,7 +1081,7 @@ dependencies = [

 [[package]]
 name = "krill"
-version = "0.15.0-dev"
+version = "0.14.3"


Shouldn't this be left as 0.15.0-dev?

Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.

ximon18 · 2023-11-28T15:43:58Z

Cargo.toml

@@ -1,7 +1,7 @@
 [package]
 # Note: some of these values are also used when building Debian packages below.
 name = "krill"
-version = "0.15.0-dev"
+version = "0.14.3"


Shouldn't this be left as 0.15.0-dev?

Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.

ximon18 · 2023-11-28T15:46:49Z

src/upgrades/mod.rs

+            "test-resources/migrations/v0_14_2/",
+            &[
+                "ca_objects",
+                "cas", // testbed/command-6.json was intentionally removed


"Removed" - is that right? I don't see it in earlier test-resources/migrations/xxx/cas/ directories so it wasn't removed compared to them, but it is missing in the sequence as there is command-5.json and command-7.json but no command-6.json. If so I would say "is intentionally missing".

src/upgrades/mod.rs

ximon18 · 2023-11-28T15:53:56Z

src/upgrades/pre_0_14_3/mod.rs

+
+    for running_key in task_store
+        .list_keys(&running_scope)
+        .map_err(|e| UpgradeError::Custom(format!("Cannot read running tasks: {}", e)))?


Given the description on this fn, why can this fail?

ximon18 · 2023-11-28T15:54:15Z

src/upgrades/pre_0_14_3/mod.rs

+
+    for pending_key in task_store
+        .list_keys(&pending_scope)
+        .map_err(|e| UpgradeError::Custom(format!("Cannot read pending tasks: {}", e)))?


Given the description on this fn, why can this fail?

E.g. can't you call .wipe() like you do in upgrade_status() below?

ximon18 · 2023-11-28T22:19:41Z

src/upgrades/mod.rs

+                pre_0_14_3::upgrade_status(config)?;
+
+                migrated = pre_0_14_3::upgrade_agg::<CertAuth>(CASERVER_NS, config)? || migrated;
+                // note: ca_objects is not affected by #1160, as it replaces existing objects and in that


Does this comment refer to the line above?

src/upgrades/mod.rs

ximon18 · 2023-11-29T06:50:41Z

src/upgrades/pre_0_14_3/mod.rs

+        if task_store.get(&pending_key).is_err() {
+            warn!("Pending task could not be parsed. Dropping: {}", pending_key);
+            task_store.delete(&pending_key).map_err(|e| {
+                UpgradeError::Custom(format!("Cannot delete corrupt task: {}. Error: {}", pending_key, e))


Rather than error out here, if this fails could you not try and wipe the whole store as is done in upgrade_status()?

ximon18 · 2023-11-29T06:54:49Z

src/upgrades/pre_0_14_3/mod.rs

+pub fn upgrade_status(config: &Config) -> UpgradeResult<()> {
+    if StatusStore::create(&config.storage_uri, STATUS_NS).is_err() {
+        let status_kv_store = KeyValueStore::create(&config.storage_uri, STATUS_NS)?;
+        status_kv_store.wipe()?;


I just discovered that the Disk implementation of clear() (which is called by wipe()) swallows any error that occurs, so calling wipe() here can appear to succeed but actually leave the store files present on disk, in their corrupt form. Won't that be a problem?

ximon18 · 2023-11-29T06:56:45Z

src/upgrades/pre_0_14_3/mod.rs

+}
+
+/// Check the task store for corrupted tasks due to issue #1160, and if present
+/// drop them in place. This is safe to do because missing tasks are re-added at start up.


Why, here and in the warning messages below, do you use the term "drop"? "Drop" has specific meaning in Rust but that isn't the meaning that applies here, rather items are being removed/deleted from the store, and in the most common (currently the only) end-user scenarion this will be deletion of files on disk. Wouldn't delete be a word that operators will understand better than drop?

ximon18 · 2023-11-29T06:58:40Z

src/upgrades/mod.rs

@@ -769,6 +777,29 @@ pub fn prepare_upgrade_data_migrations(
                pre_0_14_0::UpgradeAggregateStoreTrustAnchorProxy::upgrade(TA_PROXY_SERVER_NS, mode, config)?;

                Ok(Some(UpgradeReport::new(aspa_configs, true, versions)))
+            } else if versions.from < KrillVersion::release(0, 14, 3) {
+                // Check all possibly affected stores for corrupted files resulting
+                // from issue #1160 and fix them if needed.


This comment, to me at least, is misleading. Nothing will be "fixed", if problematic store items are found they are deleted which to me is quite different than fixing them.

Correction: Deletion only happens in the upgrade_tasks() and upgrade_status() calls, I see that the upgrade_agg() call for example does attempt a sort of fix via replay (though replay as far as it can which presumably is also "deleting" the bits that it can't).

timbru · 2023-11-29T14:26:58Z

After discussion - we will close this. It's only useful to do this under very specific circumstances: i.e. there is a 0.14.0/.1/.2 installation that runs into the issue that a tempfile is not used, and a resulting file is half-written due to a hard server crash or disk full situation, and then the user upgrades to this. Otoh it could cause issues in other situations. So, it was decided that it's best not to try any automated recover here. Just let users update to 0.14.3, and talk to use if they get hit by the issue in 0.14.0-0.14.2

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160)

eb518c8

timbru requested a review from ximon18 November 23, 2023 16:14

ximon18 reviewed Nov 28, 2023

View reviewed changes

test-resources/migrations/v0_14_2/tasks/pending/1700751070000-sweep_login_cache Show resolved Hide resolved

ximon18 reviewed Nov 28, 2023

View reviewed changes

src/upgrades/mod.rs Show resolved Hide resolved

ximon18 reviewed Nov 28, 2023

View reviewed changes

src/upgrades/mod.rs Show resolved Hide resolved

ximon18 reviewed Nov 29, 2023

View reviewed changes

timbru closed this Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

timbru commented Nov 23, 2023 •

edited

Loading

ximon18 Nov 28, 2023

timbru Nov 29, 2023

ximon18 Nov 28, 2023

timbru Nov 29, 2023

ximon18 Nov 28, 2023 •

edited

Loading

ximon18 Nov 28, 2023

ximon18 Nov 28, 2023 •

edited

Loading

ximon18 Nov 29, 2023

ximon18 Nov 28, 2023

ximon18 Nov 29, 2023

ximon18 Nov 29, 2023

ximon18 Nov 29, 2023

ximon18 Nov 29, 2023

ximon18 Nov 29, 2023

timbru commented Nov 29, 2023

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

Conversation

timbru commented Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ximon18 Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ximon18 Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timbru commented Nov 29, 2023

timbru commented Nov 23, 2023 •

edited

Loading

ximon18 Nov 28, 2023 •

edited

Loading

ximon18 Nov 28, 2023 •

edited

Loading