Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162

Closed
wants to merge 1 commit into from

Conversation

timbru
Copy link
Contributor

@timbru timbru commented Nov 23, 2023

This should fix issues. It still needs more manual testing. Unfortunately, it's not easy to test this in automated way currently as our upgrade test code assumes that values can be read before putting them into a memstore (which only accepts valid json) for testing. This needs some thought - or a very thorough review and manual testing.

There is a happy case test, but that one essentially does nothing - it covers the case where an upgrade is done and there were no corrupt files.

@timbru timbru requested a review from ximon18 November 23, 2023 16:14
@@ -1081,7 +1081,7 @@ dependencies = [

[[package]]
name = "krill"
version = "0.15.0-dev"
version = "0.14.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be left as 0.15.0-dev?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.

@@ -1,7 +1,7 @@
[package]
# Note: some of these values are also used when building Debian packages below.
name = "krill"
version = "0.15.0-dev"
version = "0.14.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be left as 0.15.0-dev?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.

"test-resources/migrations/v0_14_2/",
&[
"ca_objects",
"cas", // testbed/command-6.json was intentionally removed
Copy link
Member

@ximon18 ximon18 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Removed" - is that right? I don't see it in earlier test-resources/migrations/xxx/cas/ directories so it wasn't removed compared to them, but it is missing in the sequence as there is command-5.json and command-7.json but no command-6.json. If so I would say "is intentionally missing".


for running_key in task_store
.list_keys(&running_scope)
.map_err(|e| UpgradeError::Custom(format!("Cannot read running tasks: {}", e)))?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the description on this fn, why can this fail?


for pending_key in task_store
.list_keys(&pending_scope)
.map_err(|e| UpgradeError::Custom(format!("Cannot read pending tasks: {}", e)))?
Copy link
Member

@ximon18 ximon18 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the description on this fn, why can this fail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. can't you call .wipe() like you do in upgrade_status() below?

pre_0_14_3::upgrade_status(config)?;

migrated = pre_0_14_3::upgrade_agg::<CertAuth>(CASERVER_NS, config)? || migrated;
// note: ca_objects is not affected by #1160, as it replaces existing objects and in that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment refer to the line above?

if task_store.get(&pending_key).is_err() {
warn!("Pending task could not be parsed. Dropping: {}", pending_key);
task_store.delete(&pending_key).map_err(|e| {
UpgradeError::Custom(format!("Cannot delete corrupt task: {}. Error: {}", pending_key, e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than error out here, if this fails could you not try and wipe the whole store as is done in upgrade_status()?

pub fn upgrade_status(config: &Config) -> UpgradeResult<()> {
if StatusStore::create(&config.storage_uri, STATUS_NS).is_err() {
let status_kv_store = KeyValueStore::create(&config.storage_uri, STATUS_NS)?;
status_kv_store.wipe()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just discovered that the Disk implementation of clear() (which is called by wipe()) swallows any error that occurs, so calling wipe() here can appear to succeed but actually leave the store files present on disk, in their corrupt form. Won't that be a problem?

}

/// Check the task store for corrupted tasks due to issue #1160, and if present
/// drop them in place. This is safe to do because missing tasks are re-added at start up.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, here and in the warning messages below, do you use the term "drop"? "Drop" has specific meaning in Rust but that isn't the meaning that applies here, rather items are being removed/deleted from the store, and in the most common (currently the only) end-user scenarion this will be deletion of files on disk. Wouldn't delete be a word that operators will understand better than drop?

@@ -769,6 +777,29 @@ pub fn prepare_upgrade_data_migrations(
pre_0_14_0::UpgradeAggregateStoreTrustAnchorProxy::upgrade(TA_PROXY_SERVER_NS, mode, config)?;

Ok(Some(UpgradeReport::new(aspa_configs, true, versions)))
} else if versions.from < KrillVersion::release(0, 14, 3) {
// Check all possibly affected stores for corrupted files resulting
// from issue #1160 and fix them if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment, to me at least, is misleading. Nothing will be "fixed", if problematic store items are found they are deleted which to me is quite different than fixing them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: Deletion only happens in the upgrade_tasks() and upgrade_status() calls, I see that the upgrade_agg() call for example does attempt a sort of fix via replay (though replay as far as it can which presumably is also "deleting" the bits that it can't).

@timbru
Copy link
Contributor Author

timbru commented Nov 29, 2023

After discussion - we will close this. It's only useful to do this under very specific circumstances: i.e. there is a 0.14.0/.1/.2 installation that runs into the issue that a tempfile is not used, and a resulting file is half-written due to a hard server crash or disk full situation, and then the user upgrades to this. Otoh it could cause issues in other situations. So, it was decided that it's best not to try any automated recover here. Just let users update to 0.14.3, and talk to use if they get hit by the issue in 0.14.0-0.14.2

@timbru timbru closed this Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants