Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast sync algorithm. #2763

Closed
wants to merge 50 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
f2c9704
Add fast-sync engine.
shamil-gadelshin May 9, 2024
eca1522
Add raw block importer.
shamil-gadelshin May 9, 2024
00c3d65
node: Add fast-sync algorithm.
shamil-gadelshin Mar 29, 2024
8a5e758
node: Modify CLI configuration to support fast-sync.
shamil-gadelshin Mar 29, 2024
011f4a5
Integrate fast-sync algorithm into dsn-sync
shamil-gadelshin May 13, 2024
042ae07
Apply review suggestions.
shamil-gadelshin May 16, 2024
f8c4600
Change peer source for fast-sync.
shamil-gadelshin May 17, 2024
ae590ca
Update fast sync method.
shamil-gadelshin May 17, 2024
1fb2c0e
Only sync state from peers that have a higher best block number than …
nazar-pc May 21, 2024
14408db
Import lists of blocks together instead of individually to leverage v…
nazar-pc May 21, 2024
90b0ec9
Remove incorrect of last block from second last segment and add TODOs
nazar-pc May 21, 2024
9685fe9
Merge pull request #2779 from subspace/add-fast-sync-algorithm-fixes
shamil-gadelshin May 22, 2024
9d384f5
Refactor fast-sync.
shamil-gadelshin May 21, 2024
34b5a14
Move fast-sync out of DSN-sync scope.
shamil-gadelshin May 22, 2024
7e71097
Modify sync CLI arguments.
shamil-gadelshin May 22, 2024
c6f0240
Simplify fast-sync-engine.
shamil-gadelshin May 14, 2024
08510d2
Merge branch 'main' into add-fast-sync-algorithm
shamil-gadelshin May 23, 2024
f9d544f
Inline both `import_raw_block` and `create_raw_block` into the place …
nazar-pc May 23, 2024
70de7e5
Merge pull request #2785 from subspace/inline-import-raw-block
shamil-gadelshin May 24, 2024
5938323
Remove peer-management elements from fast-sync.
shamil-gadelshin May 24, 2024
72820c2
Add some comments.
shamil-gadelshin May 24, 2024
9a8778d
Revert unnecessary changes to DSN sync code
nazar-pc May 27, 2024
81b011d
Remove `FastSyncResult` that is no longer necessary
nazar-pc May 27, 2024
17601c2
Remove unnecessary retrieval of last segment header
nazar-pc May 27, 2024
f335461
Rename `download_segment_headers` to `sync_segment_headers` and simpl…
nazar-pc May 27, 2024
83228fc
Make fast sync not assume block size
nazar-pc May 27, 2024
73c0d8d
Merge pull request #2795 from subspace/revert-dsn-sync-changes
shamil-gadelshin May 28, 2024
ac4db9e
Toggle skip_proof variable for state downloading.
shamil-gadelshin May 28, 2024
96c20cf
Merge pull request #2796 from subspace/fast-sync-improvements
nazar-pc May 28, 2024
7e416f1
Make fast sync import block atomically, simplify code
nazar-pc May 27, 2024
3771a0a
Update Subspace block import to be better suited for custom sync mech…
nazar-pc May 27, 2024
1ad08d9
Small corrections and cleanups
nazar-pc May 27, 2024
2005b4b
Merge pull request #2797 from subspace/atomic-fast-sync-block-import
nazar-pc May 28, 2024
fa89440
Remove unnecessary `Mutex` and unused generics
nazar-pc May 27, 2024
350dfbc
Simplify fast sync engine API, add support for `fork_id`
nazar-pc May 27, 2024
f247877
Simplify networking in fast sync engine, simplify generics further
nazar-pc May 27, 2024
3b5ab82
Merge pull request #2798 from subspace/fast-sync-cleanup
nazar-pc May 28, 2024
95d17ce
Correct check for fast sync start and fix sync mode resetting after (…
nazar-pc May 28, 2024
5b39dd1
Simplify block import without resorting to `apply_block` low-level API
nazar-pc May 28, 2024
0edf1f7
Do not finalize blocks that do not exist
nazar-pc May 29, 2024
ae8a54c
Fix condition for clearing block gap
nazar-pc May 29, 2024
ccea056
Print info after fast sync
nazar-pc May 29, 2024
c14a7b2
Merge pull request #2799 from subspace/simplify-fast-sync
nazar-pc May 29, 2024
7b39a9d
Merge branch 'main' into add-fast-sync-algorithm
shamil-gadelshin May 30, 2024
0dafb27
Merge branch 'main' into add-fast-sync-algorithm
shamil-gadelshin May 30, 2024
c175ca8
Update fast-sync after latest polkadot-sdk
shamil-gadelshin May 30, 2024
1940fe4
Rename fast-sync -> snap-sync
shamil-gadelshin May 30, 2024
677c662
Refactor fast-sync.
shamil-gadelshin May 30, 2024
2e0a3d7
Remove redundant `ConfigSyncMode` type
shamil-gadelshin May 30, 2024
b216a18
Remove substrate sync option.
shamil-gadelshin May 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ fn main() -> Result<(), Error> {
sync_from_dsn: true,
is_timekeeper: false,
timekeeper_cpu_cores: Default::default(),
fast_sync_enabled: false,
};

let partial_components = subspace_service::new_partial::<PosTable, RuntimeApi>(
Expand Down
34 changes: 27 additions & 7 deletions crates/subspace-node/src/commands/run/consensus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ use sc_cli::{
TransactionPoolParams, RPC_DEFAULT_PORT,
};
use sc_informant::OutputFormat;
use sc_network::config::{MultiaddrWithPeerId, NonReservedPeerMode, SetConfig};
use sc_network::config::{MultiaddrWithPeerId, NonReservedPeerMode, SetConfig, SyncMode};
use sc_service::{BlocksPruning, Configuration, PruningMode};
use sc_storage_monitor::StorageMonitorParams;
use sc_telemetry::TelemetryEndpoints;
Expand Down Expand Up @@ -162,6 +162,8 @@ enum StatePruningMode {
Archive,
/// Keep only the data of finalized blocks.
ArchiveCanonical,
/// Keep the data of the last number of finalized blocks.
Number(u32),
}

impl FromStr for StatePruningMode {
Expand All @@ -171,17 +173,21 @@ impl FromStr for StatePruningMode {
match input {
"archive" => Ok(Self::Archive),
"archive-canonical" => Ok(Self::ArchiveCanonical),
_ => Err("Invalid state pruning mode specified".to_string()),
n => n
.parse()
.map_err(|_| "Invalid block pruning mode specified".to_string())
shamil-gadelshin marked this conversation as resolved.
Show resolved Hide resolved
.map(Self::Number),
}
}
}

impl fmt::Display for StatePruningMode {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
f.write_str(match self {
Self::Archive => "archive",
Self::ArchiveCanonical => "archive-canonical",
})
match self {
Self::Archive => f.write_str("archive"),
Self::ArchiveCanonical => f.write_str("archive-canonical"),
Self::Number(n) => f.write_str(n.to_string().as_str()),
}
}
}

Expand Down Expand Up @@ -256,6 +262,7 @@ impl PruningOptions {
match self.state_pruning {
StatePruningMode::Archive => PruningMode::ArchiveAll,
StatePruningMode::ArchiveCanonical => PruningMode::ArchiveCanonical,
StatePruningMode::Number(num) => PruningMode::blocks_pruning(num),
}
}

Expand Down Expand Up @@ -397,6 +404,10 @@ pub(super) struct ConsensusChainOptions {

#[clap(flatten)]
timekeeper_options: TimekeeperOptions,

/// Experimental support of state-only sync using DSN.
#[arg(long, default_value_t = false)]
fast_sync: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right API, it doesn't scale.

Substrate CLI had this, which we have removed since we only supported full sync:

      --sync <SYNC_MODE>
          Blockchain syncing mode.
          
          [default: full]

          Possible values:
          - full:        Full sync. Download end verify all blocks
          - fast:        Download blocks without executing them. Download latest state with proofs
          - fast-unsafe: Download blocks without executing them. Download latest state without proofs
          - warp:        Prove finality and download the latest state

I think we should bring --sync back and this particular mode can be called "fast" (even though it will be a bit confusing for those farmiliar with Substrate). I'm also not sure it is a good idea to have state pruning defined by a number since IIRC it will prune state regardless of finality, which means node will likely fail to restart in some cases due to state being unavailable at the point of archiver initialization, which is problematic.

This is why I mentioned in the past that we should probably call it "farmer sync" and not support defining state pruning mode explicitly, we should manually prune things when we need to, but not sooner (hence paritytech/polkadot-sdk#1570 from September of last year).

So for now I'd vote for --sync farmer and do not allow user to specify --blocks-pruning or --state-pruning when these are used. Not sure what to do about state pruning yet though. Did you confirm that it doesn't remove non-finalized state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I will add this later either within this PR or in a separate one.

}

pub(super) struct PrometheusConfiguration {
Expand Down Expand Up @@ -440,6 +451,7 @@ pub(super) fn create_consensus_chain_configuration(
sync_from_dsn,
storage_monitor,
mut timekeeper_options,
fast_sync,
} = consensus_node_options;

let transaction_pool;
Expand Down Expand Up @@ -584,7 +596,14 @@ pub(super) fn create_consensus_chain_configuration(
chain_spec: Box::new(chain_spec),
informant_output_format: OutputFormat { enable_color },
};
let consensus_chain_config = Configuration::from(consensus_chain_config);
let mut consensus_chain_config = Configuration::from(consensus_chain_config);
// TODO: revisit SyncMode change after https://github.com/paritytech/polkadot-sdk/issues/4407
if fast_sync {
consensus_chain_config.network.sync_mode = SyncMode::LightState {
shamil-gadelshin marked this conversation as resolved.
Show resolved Hide resolved
skip_proofs: true,
storage_chain_mode: false,
};
}

let pot_external_entropy =
derive_pot_external_entropy(&consensus_chain_config, pot_external_entropy)?;
Expand Down Expand Up @@ -653,6 +672,7 @@ pub(super) fn create_consensus_chain_configuration(
sync_from_dsn,
is_timekeeper: timekeeper_options.timekeeper,
timekeeper_cpu_cores: timekeeper_options.timekeeper_cpu_cores,
fast_sync_enabled: fast_sync,
},
dev,
pot_external_entropy,
Expand Down
2 changes: 2 additions & 0 deletions crates/subspace-service/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ pallet-transaction-payment-rpc = { git = "https://github.com/subspace/polkadot-s
parity-scale-codec = "3.6.9"
parking_lot = "0.12.2"
prometheus-client = "0.22.2"
prost = "0.12"
sc-basic-authorship = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sc-chain-spec = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sc-client-api = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
Expand All @@ -52,6 +53,7 @@ sc-telemetry = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269
sc-tracing = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sc-transaction-pool = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sc-transaction-pool-api = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sc-utils = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
schnorrkel = "0.11.4"
sp-api = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
sp-blockchain = { git = "https://github.com/subspace/polkadot-sdk", rev = "808269708cf5375526755797e8f9a9986016727d" }
Expand Down
4 changes: 3 additions & 1 deletion crates/subspace-service/src/config.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
use crate::dsn::DsnConfig;
use crate::sync_from_dsn::DsnSyncPieceGetter;
use crate::sync_from_dsn::import_blocks::DsnSyncPieceGetter;
use sc_chain_spec::ChainSpec;
use sc_network::config::{
MultiaddrWithPeerId, NetworkConfiguration, NodeKeyConfig, SetConfig, SyncMode, TransportConfig,
Expand Down Expand Up @@ -269,6 +269,8 @@ pub struct SubspaceConfiguration {
pub is_timekeeper: bool,
/// CPU cores that timekeeper can use
pub timekeeper_cpu_cores: HashSet<usize>,
/// Enables state-only sync using DSN.
pub fast_sync_enabled: bool,
}

impl Deref for SubspaceConfiguration {
Expand Down
67 changes: 54 additions & 13 deletions crates/subspace-service/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ use sc_proof_of_time::source::gossip::pot_gossip_peers_set_config;
use sc_proof_of_time::source::{PotSlotInfo, PotSourceWorker};
use sc_proof_of_time::verifier::PotVerifier;
use sc_service::error::Error as ServiceError;
use sc_service::{Configuration, NetworkStarter, SpawnTasksParams, TaskManager};
use sc_service::{ClientExt, Configuration, NetworkStarter, SpawnTasksParams, TaskManager};
use sc_subspace_block_relay::{
build_consensus_relay, BlockRelayConfigurationError, NetworkWrapper,
};
Expand Down Expand Up @@ -228,18 +228,19 @@ pub type FullClient<RuntimeApi> = sc_service::TFullClient<Block, RuntimeApi, Run
pub type FullBackend = sc_service::TFullBackend<Block>;
pub type FullSelectChain = sc_consensus::LongestChain<FullBackend, Block>;

struct SubspaceExtensionsFactory<PosTable, Client, DomainBlock> {
struct SubspaceExtensionsFactory<PosTable, Block: BlockT, Client, DomainBlock> {
kzg: Kzg,
client: Arc<Client>,
backend: Arc<FullBackend>,
pot_verifier: PotVerifier,
executor: Arc<RuntimeExecutor>,
domains_executor: Arc<sc_domains::RuntimeExecutor>,
fast_sync_state: Arc<Mutex<Option<NumberFor<Block>>>>,
_pos_table: PhantomData<(PosTable, DomainBlock)>,
}

impl<PosTable, Block, Client, DomainBlock> ExtensionsFactory<Block>
for SubspaceExtensionsFactory<PosTable, Client, DomainBlock>
for SubspaceExtensionsFactory<PosTable, Block, Client, DomainBlock>
where
PosTable: Table,
Block: BlockT,
Expand Down Expand Up @@ -269,6 +270,7 @@ where
exts.register(PotExtension::new({
let client = Arc::clone(&self.client);
let pot_verifier = self.pot_verifier.clone();
let fast_sync_state = self.fast_sync_state.clone();

Box::new(
move |parent_hash, slot, proof_of_time, quick_verification| {
Expand Down Expand Up @@ -298,6 +300,7 @@ where
return false;
}
};

let parent_pre_digest = match extract_pre_digest(&parent_header) {
Ok(parent_pre_digest) => parent_pre_digest,
Err(error) => {
Expand All @@ -318,6 +321,22 @@ where
return false;
}

// Check for fast-sync state
{
if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
let parent_block_number = *parent_header.number();
if parent_block_number < *state_block_number {
debug!(
%parent_block_number,
%state_block_number,
"Skipped PoT verification because of the fast sync state block."
);

return true;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we skipping PoT verification here? PR description "PoT check update to support fast-sync" is not sufficient to understand why this is required and I do not remember discussing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's yet another special case for fast-sync. Did you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can guess from the code that this is some kind of special case, but I don't understand what that special case is, what issues it causes and why you decided to fix it this way out of a set of possibilities. I sometimes can infer those things, but you'll have to help me understand it in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR PoT verification requires runtime parameters check and it will fail without the existing state. This snippet disables the check that will fail for this reason for blocks before the fast-sync target block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we always import blocks with state, I'm not sure I'm following what you're saying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason you had to add it here is the side effect of incorrectly importing two raw blocks instead of one (the first one being without state). This is also the reason a more advanced version of #2767 failed: there existed block without state that should not have been there. The fact that that extra block is present will also cause archiver to fail to restart if interrupted right after insertion of the first block into database.

Basically insertion of two raw blocks is incorrect and should be fixed.

Once fixed, the failure will happen at the beginning of the function when attempting to read parent header and should be handled there accordingly.

I checked the code and I believe we should never ever have invalid votes included in the block and parent hash always comes from runtime's state and not user input, thus it should be safe to return true from this function in that special case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2779


let pot_parameters = match client.runtime_api().pot_parameters(parent_hash) {
Ok(pot_parameters) => pot_parameters,
Err(error) => {
Expand Down Expand Up @@ -423,6 +442,8 @@ where
pub sync_target_block_number: Arc<AtomicU32>,
/// Telemetry
pub telemetry: Option<Telemetry>,
/// The first block with state enabled by fast-sync.
pub fast_sync_state: Arc<Mutex<Option<NumberFor<Block>>>>,
}

type PartialComponents<RuntimeApi> = sc_service::PartialComponents<
Expand Down Expand Up @@ -490,17 +511,21 @@ where

let executor = Arc::new(executor);

let fast_sync_state = Arc::<Mutex<Option<NumberFor<Block>>>>::default();
client
.execution_extensions()
.set_extensions_factory(SubspaceExtensionsFactory::<PosTable, _, DomainBlock> {
kzg: kzg.clone(),
client: Arc::clone(&client),
pot_verifier: pot_verifier.clone(),
executor: executor.clone(),
domains_executor: Arc::new(domains_executor),
backend: backend.clone(),
_pos_table: PhantomData,
});
.set_extensions_factory(
SubspaceExtensionsFactory::<PosTable, Block, _, DomainBlock> {
kzg: kzg.clone(),
client: Arc::clone(&client),
pot_verifier: pot_verifier.clone(),
executor: executor.clone(),
domains_executor: Arc::new(domains_executor),
backend: backend.clone(),
fast_sync_state: fast_sync_state.clone(),
_pos_table: PhantomData,
},
);

let telemetry = telemetry.map(|(worker, telemetry)| {
task_manager
Expand Down Expand Up @@ -598,6 +623,7 @@ where
pot_verifier,
sync_target_block_number,
telemetry,
fast_sync_state,
};

Ok(PartialComponents {
Expand Down Expand Up @@ -709,8 +735,19 @@ where
pot_verifier,
sync_target_block_number,
mut telemetry,
fast_sync_state,
} = other;

// Clear block gap after fast sync on reruns.
// Substrate detects a gap and inserts on each sync.
if config.fast_sync_enabled {
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
if finalized_hash_existed {
debug!(client_info=?client.info(), "Clear block gap after fast-sync.");
client.clear_block_gap();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a completely random place to clear gap sync. Can it be moved right next to the code that triggers gap sync in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this operation before the regular sync starts. What place do you advise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion viable?:

Can it be moved right next to the code that triggers gap sync in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to clear the gap before driving any networking futures. This is why it's placed in the initialization code far away from the sync-code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the rationate here. If gap sync is triggered by some specific behavior then we should clear gap sync after we trigger that behavior. If the behavior is importing blocks in specific order then you will absolutely drive network prior to gap sync because that is how you end up with block you can physically import in the first place.


let offchain_indexing_enabled = config.offchain_worker.indexing_enabled;
let (node, bootstrap_nodes) = match config.subspace_networking {
SubspaceNetworking::Reuse {
Expand Down Expand Up @@ -806,7 +843,7 @@ where
}
};

let import_queue_service = import_queue.service();
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you just need import queue service in multiple places? If so you can call import_queue.service() more than once and get two service instances with handles to the same import queue. Or is there a different reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, there were three instances passed across several methods that seemed awkward. I managed to work with two instances but I left shared ref anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still sharing it this way, what is the benefit of Arc<Mutex<T>> comparing to just T here? Seems like unnecessary complexity to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that you prefer passing two instances of service through methods instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>. I simply do not understand why would you even do that.

let network_wrapper = Arc::new(NetworkWrapper::default());
let block_relay = Some(
build_consensus_relay(
Expand Down Expand Up @@ -890,6 +927,7 @@ where
);

network_wrapper.set(network_service.clone());

if config.sync_from_dsn {
let dsn_sync_piece_getter = config.dsn_piece_getter.unwrap_or_else(|| {
Arc::new(PieceProvider::new(
Expand All @@ -916,6 +954,9 @@ where
sync_target_block_number,
pause_sync,
dsn_sync_piece_getter,
config.fast_sync_enabled,
fast_sync_state,
sync_service.clone(),
);
task_manager
.spawn_handle()
Expand Down
Loading