Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Slow/failed start-up after crash on v1.16.18 #34062

Closed
bji opened this issue Nov 14, 2023 · 1 comment
Closed

Slow/failed start-up after crash on v1.16.18 #34062

bji opened this issue Nov 14, 2023 · 1 comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@bji
Copy link
Contributor

bji commented Nov 14, 2023

Problem

Validator crashed due to #33740. On attempt to restart, the validator went into the same bad state that I've seen before after this crash, where a ton of logs of the form:

[2023-11-14T01:54:30.623212419Z WARN  solana_vote_program::vote_state] A79u1awz7CqnxmNYEVtzWwSzup3eKPNW6w2Jrd56oZ3y dropped vote VoteStateUpdate { lockouts: [Lockout { slot: 229873452, confirmation_count: 31 }, Lockout { slot: 229873453, confirmation_count: 30 }, Lockout { slot: 229873454, confirmation_count: 29 }, Lockout { slot: 229873455, confirmation_count: 28 }, Lockout { slot: 229873456, confirmation_count: 27 }, Lockout { slot: 229873457, confirmation_count: 26 }, Lockout { slot: 229873458, confirmation_count: 25 }, Lockout { slot: 229873459, confirmation_count: 24 }, Lockout { slot: 229873460, confirmation_count: 23 }, Lockout { slot: 229873461, confirmation_count: 22 }, Lockout { slot: 229873462, confirmation_count: 21 }, Lockout { slot: 229873463, confirmation_count: 20 }, Lockout { slot: 229873464, confirmation_count: 19 }, Lockout { slot: 229873465, confirmation_count: 18 }, Lockout { slot: 229873466, confirmation_count: 17 }, Lockout { slot: 229873467, confirmation_count: 16 }, Lockout { slot: 229873468, confirmation_count: 15 }, Lockout { slot: 229873469, confirmation_count: 14 }, Lockout { slot: 229873470, confirmation_count: 13 }, Lockout { slot: 229873471, confirmation_count: 12 }, Lockout { slot: 229873472, confirmation_count: 11 }, Lockout { slot: 229873473, confirmation_count: 10 }, Lockout { slot: 229873481, confirmation_count: 9 }, Lockout { slot: 229873482, confirmation_count: 8 }, Lockout { slot: 229873483, confirmation_count: 7 }, Lockout { slot: 229873484, confirmation_count: 6 }, Lockout { slot: 229873485, confirmation_count: 5 }, Lockout { slot: 229873486, confirmation_count: 4 }, Lockout { slot: 229873487, confirmation_count: 3 }, Lockout { slot: 229873488, confirmation_count: 2 }, Lockout { slot: 229873489, confirmation_count: 1 }], root: Some(229873445), hash: 9N9wFVpTHTJv1Q1fRoRyhr5CbytkMKVadzeKFMbvHejK, timestamp: Some(1699926493) } failed to match hash 9N9wFVpTHTJv1Q1fRoRyhr5CbytkMKVadzeKFMbvHejK Y15c1bxBTuFo9s1UJb4HMF9JR97EvdsMgaPg2c6CBAj

are logged. That's potential issue 1.

The solution I have used in this case is to delete the rocksdb from my ledger, download a fresh snapshot, and restart.

After doing that, the validator went into a mode where it would never get to the "voting" phase. Instead it logged like this periodically:

[2023-11-14T18:06:14.733317350Z INFO  solana_core::cluster_slot_state_verifier] check_slot_agrees_with_cluster() slot: 230006728, root: 230006574, slot_state_update: DuplicateConfirmed(DuplicateConfirmedState { duplicate_confirmed_hash: 7dZqqGgf1TCaasmxX2eAiZiPxUdNYGPewXBC1xTjH6w9, bank_status: Frozen(7dZqqGgf1TCaasmxX2eAiZiPxUdNYGPewXBC1xTjH6w9) })
[2023-11-14T18:06:14.733363116Z INFO  solana_core::replay_stage] Waiting to switch vote to 230006729,
                            resetting to slot None for now,
                            switch proof stake: 0,
                            threshold stake: 153264047423939550,
                            total stake: 403326440589314600
[2023-11-14T18:06:14.733418090Z INFO  solana_core::replay_stage] Haven't landed a vote, so skipping my leader slot

That's potential issue 2.

The full logs for almost 1 day leading up to these issues:

https://s3.us-west-1.amazonaws.com/shinobi-systems.com/BLADE.2014.11.14.log.gz

(not sure why I named the file 2014.11.14, it's from 2023.11.14, some serious fumble fingering there)

Proposed Solution

@bji bji added the community Community contribution label Nov 14, 2023
@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

Some points of interest from within the log file:

Line 919481, YESTERDAY'S crash (issue #33740):

[2023-11-14T01:48:44.265809110Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 1910826, validator starting up and going into the POTENTIAL ISSUE 1 mode:

[2023-11-14T01:57:47.176418508Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 2043435, validator starting up after being restarted because it was in the POTENTIAL ISSUE 1 mode. A possibly faulty attempt had been made to delete the ledger before start-up. But after this the validator was still in heavy "dropped vote VoteStateUpdate" mode:

[2023-11-14T02:03:48.684757437Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 2410319, restarted again because of the continued "VoteStateUpdate" logs, but this time, DEFINITELY deleted rocksdb before the restart:

[2023-11-14T02:14:51.383772737Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 10533669, TODAY'S crash (issue #33740), after which the restarted validator was back in the bad "VoteStateUpdate" logging mode (POTENTIAL ISSUE 1):

[2023-11-14T17:47:07.603584201Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 10637901: DOWNLOADED FRESH SNAPSHOTS then restarted validator, it went into the POTENTIAL ISSUE 2 mode, just sat there never making it to "voting". I am not sure if I deleted rocksdb right before this start-up.

[2023-11-14T17:54:58.232062722Z INFO  solana_validator] Starting validator with: ArgsOs {

Line 10917972: Gave up on the "never getting to voting" start-up and restarted again. Possibly deleted rocksdb first:

[2023-11-14T18:06:47.467202892Z INFO  solana_validator] Starting validator with: ArgsOs {

Line10918441: Final restart, this time I definitely cleared out rocksdb and downloaded fresh snapshots before starting up. This restart finally worked.

[2023-11-14T18:10:07.151966666Z INFO  solana_validator] Starting validator with: ArgsOs {

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Nov 14, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

1 participant