-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from issues with corrupt files on upgrade to 0.14.3 (#1160) #1162
Conversation
test-resources/migrations/v0_14_2/tasks/pending/1700751070000-sweep_login_cache
Show resolved
Hide resolved
@@ -1081,7 +1081,7 @@ dependencies = [ | |||
|
|||
[[package]] | |||
name = "krill" | |||
version = "0.15.0-dev" | |||
version = "0.14.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be left as 0.15.0-dev?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.
@@ -1,7 +1,7 @@ | |||
[package] | |||
# Note: some of these values are also used when building Debian packages below. | |||
name = "krill" | |||
version = "0.15.0-dev" | |||
version = "0.14.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be left as 0.15.0-dev?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be more clean to make this change in a separate PR. If you prefer I can do that. But, this is going to be in a bug fix release, so 0.14.3 is more appropriate than 0.15.0.
"test-resources/migrations/v0_14_2/", | ||
&[ | ||
"ca_objects", | ||
"cas", // testbed/command-6.json was intentionally removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Removed" - is that right? I don't see it in earlier test-resources/migrations/xxx/cas/ directories so it wasn't removed compared to them, but it is missing in the sequence as there is command-5.json and command-7.json but no command-6.json. If so I would say "is intentionally missing".
|
||
for running_key in task_store | ||
.list_keys(&running_scope) | ||
.map_err(|e| UpgradeError::Custom(format!("Cannot read running tasks: {}", e)))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the description on this fn, why can this fail?
|
||
for pending_key in task_store | ||
.list_keys(&pending_scope) | ||
.map_err(|e| UpgradeError::Custom(format!("Cannot read pending tasks: {}", e)))? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the description on this fn, why can this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. can't you call .wipe()
like you do in upgrade_status()
below?
pre_0_14_3::upgrade_status(config)?; | ||
|
||
migrated = pre_0_14_3::upgrade_agg::<CertAuth>(CASERVER_NS, config)? || migrated; | ||
// note: ca_objects is not affected by #1160, as it replaces existing objects and in that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this comment refer to the line above?
if task_store.get(&pending_key).is_err() { | ||
warn!("Pending task could not be parsed. Dropping: {}", pending_key); | ||
task_store.delete(&pending_key).map_err(|e| { | ||
UpgradeError::Custom(format!("Cannot delete corrupt task: {}. Error: {}", pending_key, e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than error out here, if this fails could you not try and wipe the whole store as is done in upgrade_status()
?
pub fn upgrade_status(config: &Config) -> UpgradeResult<()> { | ||
if StatusStore::create(&config.storage_uri, STATUS_NS).is_err() { | ||
let status_kv_store = KeyValueStore::create(&config.storage_uri, STATUS_NS)?; | ||
status_kv_store.wipe()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just discovered that the Disk implementation of clear()
(which is called by wipe()
) swallows any error that occurs, so calling wipe()
here can appear to succeed but actually leave the store files present on disk, in their corrupt form. Won't that be a problem?
} | ||
|
||
/// Check the task store for corrupted tasks due to issue #1160, and if present | ||
/// drop them in place. This is safe to do because missing tasks are re-added at start up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why, here and in the warning messages below, do you use the term "drop"? "Drop" has specific meaning in Rust but that isn't the meaning that applies here, rather items are being removed/deleted from the store, and in the most common (currently the only) end-user scenarion this will be deletion of files on disk. Wouldn't delete be a word that operators will understand better than drop?
@@ -769,6 +777,29 @@ pub fn prepare_upgrade_data_migrations( | |||
pre_0_14_0::UpgradeAggregateStoreTrustAnchorProxy::upgrade(TA_PROXY_SERVER_NS, mode, config)?; | |||
|
|||
Ok(Some(UpgradeReport::new(aspa_configs, true, versions))) | |||
} else if versions.from < KrillVersion::release(0, 14, 3) { | |||
// Check all possibly affected stores for corrupted files resulting | |||
// from issue #1160 and fix them if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment, to me at least, is misleading. Nothing will be "fixed", if problematic store items are found they are deleted which to me is quite different than fixing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: Deletion only happens in the upgrade_tasks()
and upgrade_status()
calls, I see that the upgrade_agg()
call for example does attempt a sort of fix via replay (though replay as far as it can which presumably is also "deleting" the bits that it can't).
After discussion - we will close this. It's only useful to do this under very specific circumstances: i.e. there is a 0.14.0/.1/.2 installation that runs into the issue that a tempfile is not used, and a resulting file is half-written due to a hard server crash or disk full situation, and then the user upgrades to this. Otoh it could cause issues in other situations. So, it was decided that it's best not to try any automated recover here. Just let users update to 0.14.3, and talk to use if they get hit by the issue in 0.14.0-0.14.2 |
This should fix issues. It still needs more manual testing. Unfortunately, it's not easy to test this in automated way currently as our upgrade test code assumes that values can be read before putting them into a memstore (which only accepts valid json) for testing. This needs some thought - or a very thorough review and manual testing.
There is a happy case test, but that one essentially does nothing - it covers the case where an upgrade is done and there were no corrupt files.