Add fast sync algorithm. #2763

shamil-gadelshin · 2024-05-14T07:58:55Z

This PR adds the last part of the fast-sync algorithm. The algorithm described in the related spec and in the comment for FastSyncer struct. The PR contains the new algorithm, its integration within the existing DSN sync, and CLI and config parameters changes required to enable the algorithm. Although to the best of my knowledge, the algorithm works as expected it should not be used until we develop a related solution for fraud proofs otherwise fast-sync will fail because of the missing state after reaching the fraud-proof.

CLI and config changes

new fast-sync parameter
starting with SyncMode::LightState when fast-sync is enabled
modification of the state-pruning parameters to support N blocks pruning (required by SyncMode::LightState)

Notable changes for fast sync integration

PoT check update to support fast-sync
introducing block-gap clean on start when fast-sync enabled (Substrate detects it and inserts it repeatedly)
Reconstructor is passed to DSN sync from the fast-sync algorithm to preserve the correct inner state

Code contributor checklist:

I have read, understood and followed contributing guide

nazar-pc

Left a bunch of comments and questions, will have to re-review again afterwards. Sorry for delay.

nazar-pc · 2024-05-15T07:51:05Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+    /// - 2. add the last block the second last segment (as raw - without checks and execution),
+    /// - 3. download the state for the first block of the last segment,
+    ///     - add the first block of the last segment as raw,
+    ///     - download state for it,


These steps do not correspond to the latest version of the specification

Will answer below to a similar comment.

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

nazar-pc · 2024-05-15T08:04:03Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+        let second_last_segment_blocks = download_and_reconstruct_blocks(
+            second_last_segment_index,
+            self.piece_getter,
+            &mut reconstructor,
+        )
+        .await?;


This is not how algorithm should work according to the spec though. You're supposed to check the segment header and only download second last segment header in case it is actually necessary, which is most of the time, but not always.

The spec version will likely work, however, on the implementation level I have doubts - in the absolute majority of cases we will go into the coded branch, do we need this special case that complicates the code and requires support but will have a negligible effect? And more importantly, we don't have the testing infrastructure to verify the code, so, the spec theoretic version IMO has more drawbacks than benefits.

The only thing that needs to be done is an if branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.

And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.

I do not think I understand what the drawbacks there are in following the spec.

The reason is that we won't test this code and it can fail even if seems logical which happened multiple times with other simple changes. It's likely possible to emulate this case when we have a complete last archived block. I checked the current blockchain - we don't have it currently. I added a TODO with a reference to the spec difference.

I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.

nazar-pc · 2024-05-15T08:33:45Z

crates/subspace-service/src/lib.rs

@@ -806,7 +843,7 @@ where
        }
    };

-    let import_queue_service = import_queue.service();
+    let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));


Do you just need import queue service in multiple places? If so you can call import_queue.service() more than once and get two service instances with handles to the same import queue. Or is there a different reason for this?

At some point, there were three instances passed across several methods that seemed awkward. I managed to work with two instances but I left shared ref anyway.

You're still sharing it this way, what is the benefit of Arc<Mutex<T>> comparing to just T here? Seems like unnecessary complexity to me.

Am I correct that you prefer passing two instances of service through methods instead?

Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>. I simply do not understand why would you even do that.

nazar-pc · 2024-05-15T08:36:35Z

crates/subspace-service/src/sync_from_dsn.rs

+            client.clone(),
+            import_queue_service,


Why owned values? I think references should still work just fine.

They are both Arc references

I can see that, but the question above remains

There are type requirements down the stack.

Which types? They were not required before, what has changed in a way that makes it mandatory?

nazar-pc · 2024-05-15T08:36:40Z

crates/subspace-service/src/sync_from_dsn.rs

+    let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;
+
+    pause_sync.store(true, Ordering::Release);
+
+    let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
+
+    if fast_sync_enabled && finalized_hash_existed {
+        debug!("Fast sync detected existing finalized hash.");
+    }
+
+    // Run fast-sync first.
+    #[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
+    if fast_sync_enabled && !finalized_hash_existed {
+        if notifications.next().await.is_some() {
+            let fast_syncer = FastSyncer::new(
+                &segment_headers_store,
+                node,
+                piece_getter,
+                client.clone(),
+                import_queue_service.clone(),
+                network_service.clone(),
+                fast_sync_state,
+                sync_service,
+            );

+            let fast_sync_result = fast_syncer.sync().await;
+
+            match fast_sync_result {
+                Ok(fast_sync_result) => {
+                    if fast_sync_result.skipped {
+                        debug!("Fast sync was skipped.");
+                    } else {
+                        last_processed_block_number = fast_sync_result.last_imported_block_number;
+                        last_processed_segment_index = fast_sync_result.last_imported_segment_index;
+                        reconstructor = fast_sync_result.reconstructor;
+
+                        debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
+                    }
+                }
+                Err(err) => {
+                    error!("Fast sync failed: {err}");
+                }
+            }
+
+            notifications_sender
+                .try_send(NotificationReason::WentOnlineSubspace)
+                .map_err(|_| {
+                    sc_service::Error::Other("Can't send sync notification reason.".into())
+                })?;
+        }
+    }


As discussed before this doesn't seem to belong to full sync from DSN, right now it is a one-off separate sync that happens before DSN sync . DSN sync is not a continuation of this sync because we will have new enough state for regular sync to kick in and take it from there, no need to pass reconstructor through everything.

I added the shared reconstructor when I encountered an error. Why do you think there is no need to pass the reconstructor? I appreciate more details here. There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

I added the shared reconstructor when I encountered an error

You know the next natural question I'll have would be "which error did you encounter exactly"?

Why do you think there is no need to pass the reconstructor? I appreciate more details here.

Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).

There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.

And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.

You know the next natural question I'll have would be "which error did you encounter exactly"?

It's hard to reproduce it on the fly but AFAIR the new reconstructor will result in a missing block on the edges of two segments (the last imported from the fast sync and the remaining segment obtained through DSN sync).

DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.

nazar-pc · 2024-05-15T08:37:54Z

crates/subspace-service/src/sync_from_dsn.rs

+        }
+    }
+
+    while let Some(reason) = notifications.next().await {


You have removed sync pausing when notification is received, this is incorrect and will cause issues

I restored the pause. Please, share possible issues.

crates/subspace-service/src/sync_from_dsn.rs

- refactor comments, error messages, expect messages - restore pause for DSN-sync - add comments - other minor changes

nazar-pc · 2024-05-16T13:04:17Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+        let second_last_segment_blocks = download_and_reconstruct_blocks(
+            second_last_segment_index,
+            self.piece_getter,
+            &mut reconstructor,
+        )
+        .await?;


The only thing that needs to be done is an if branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.

And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.

I do not think I understand what the drawbacks there are in following the spec.

nazar-pc · 2024-05-16T13:15:23Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+        if blocks_in_second_last_segment < 2 {
+            return Err(Error::Other(
+                "Unexpected block array length for the second last segment.".into(),
+            ));
+        }


I think I understand now what you meant to do here, it is just code that doesn't actually talk about it and doesn't check the actual invariant you care about.

The only thing (an nothing else) that we actually care about in fast sync is to import the first block of the last archived segment, this is why spec talks about that and not other things. There is no requirement for 2 blocks to be present, especially not in second last segment.

The edge-case though might be that we could hypothetically have (not practically, but it is still worth supporting IMHO) a block larger that a segment. In that case we may be unable to reconstruct the first block of last segment even if we download two last segments.

So what the code should actually is very close to what spec says:

once segment headers are known check the header of second last segment and see if last block there was archived or partially

if fully we only need to download the last segment and nothing else

if we have a partial block though then we need to check if third last segment ended with the same partial block number because if it did then we need to keep looking forward to the beginning of that block such that we can get the full block of the last segment header regardless of where it starts

With above description it should be clear that this check doesn't make sense. For example it will fail in the perfectly fine situation where second last segment contains end of one block and beginning of another (first block of last segment). In this case we have 0 blocks reconstructed from second last segment alone, yet we are able to successfully reconstruct first block of last segment that the specification actually requires.

Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible in this case.

nazar-pc · 2024-05-16T13:17:19Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+        if blocks_in_last_segment < 2 {
+            return Err(Error::Other(
+                "Unexpected block array length for the last segment.".into(),
+            ));
+        }


This check is unnecessary, the only thing we need is a single block. Other blocks can be imported even via regular Substrate sync without issues.

Here is the invariant from the spec you should check instead:

Suggested change

if blocks_in_last_segment < 2 {

return Err(Error::Other(

"Unexpected block array length for the last segment.".into(),

));

}

if blocks.is_empty() {

return Err(Error::Other(

"Failed to reconstruct first block of the last segment".into(),

));

}

Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible.

nazar-pc · 2024-05-16T13:19:13Z

crates/subspace-service/src/lib.rs

+                    // Check for fast-sync state
+                    {
+                        if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
+                            let parent_block_number = *parent_header.number();
+                            if parent_block_number < *state_block_number {
+                                debug!(
+                                    %parent_block_number,
+                                    %state_block_number,
+                                    "Skipped PoT verification because of the fast sync state block."
+                                );
+
+                                return true;
+                            }
+                        }
+                    }


I can guess from the code that this is some kind of special case, but I don't understand what that special case is, what issues it causes and why you decided to fix it this way out of a set of possibilities. I sometimes can infer those things, but you'll have to help me understand it in this case.

nazar-pc · 2024-05-16T13:20:12Z

crates/subspace-service/src/lib.rs

+    // Clear block gap after fast sync on reruns.
+    // Substrate detects a gap and inserts on each sync.
+    if config.fast_sync_enabled {
+        let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
+        if finalized_hash_existed {
+            debug!(client_info=?client.info(), "Clear block gap after fast-sync.");
+            client.clear_block_gap();
+        }
+    }


Is this suggestion viable?:

Can it be moved right next to the code that triggers gap sync in the first place?

nazar-pc · 2024-05-16T13:22:33Z

crates/subspace-service/src/lib.rs

@@ -806,7 +843,7 @@ where
        }
    };

-    let import_queue_service = import_queue.service();
+    let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));


You're still sharing it this way, what is the benefit of Arc<Mutex<T>> comparing to just T here? Seems like unnecessary complexity to me.

nazar-pc · 2024-05-16T13:22:59Z

crates/subspace-service/src/sync_from_dsn.rs

+            client.clone(),
+            import_queue_service,


I can see that, but the question above remains

nazar-pc · 2024-05-16T13:30:25Z

crates/subspace-service/src/sync_from_dsn.rs

+    let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;
+
+    pause_sync.store(true, Ordering::Release);
+
+    let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
+
+    if fast_sync_enabled && finalized_hash_existed {
+        debug!("Fast sync detected existing finalized hash.");
+    }
+
+    // Run fast-sync first.
+    #[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
+    if fast_sync_enabled && !finalized_hash_existed {
+        if notifications.next().await.is_some() {
+            let fast_syncer = FastSyncer::new(
+                &segment_headers_store,
+                node,
+                piece_getter,
+                client.clone(),
+                import_queue_service.clone(),
+                network_service.clone(),
+                fast_sync_state,
+                sync_service,
+            );

+            let fast_sync_result = fast_syncer.sync().await;
+
+            match fast_sync_result {
+                Ok(fast_sync_result) => {
+                    if fast_sync_result.skipped {
+                        debug!("Fast sync was skipped.");
+                    } else {
+                        last_processed_block_number = fast_sync_result.last_imported_block_number;
+                        last_processed_segment_index = fast_sync_result.last_imported_segment_index;
+                        reconstructor = fast_sync_result.reconstructor;
+
+                        debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
+                    }
+                }
+                Err(err) => {
+                    error!("Fast sync failed: {err}");
+                }
+            }
+
+            notifications_sender
+                .try_send(NotificationReason::WentOnlineSubspace)
+                .map_err(|_| {
+                    sc_service::Error::Other("Can't send sync notification reason.".into())
+                })?;
+        }
+    }


I added the shared reconstructor when I encountered an error

You know the next natural question I'll have would be "which error did you encounter exactly"?

Why do you think there is no need to pass the reconstructor? I appreciate more details here.

Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).

There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.

And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.

nazar-pc · 2024-05-17T12:01:14Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+        let second_last_segment_blocks = download_and_reconstruct_blocks(
+            second_last_segment_index,
+            self.piece_getter,
+            &mut reconstructor,
+        )
+        .await?;


I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.

nazar-pc · 2024-05-17T12:09:44Z

crates/subspace-service/src/lib.rs

+                    // Check for fast-sync state
+                    {
+                        if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
+                            let parent_block_number = *parent_header.number();
+                            if parent_block_number < *state_block_number {
+                                debug!(
+                                    %parent_block_number,
+                                    %state_block_number,
+                                    "Skipped PoT verification because of the fast sync state block."
+                                );
+
+                                return true;
+                            }
+                        }
+                    }


But we always import blocks with state, I'm not sure I'm following what you're saying.

nazar-pc · 2024-05-17T12:12:04Z

crates/subspace-service/src/lib.rs

+    // Clear block gap after fast sync on reruns.
+    // Substrate detects a gap and inserts on each sync.
+    if config.fast_sync_enabled {
+        let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
+        if finalized_hash_existed {
+            debug!(client_info=?client.info(), "Clear block gap after fast-sync.");
+            client.clear_block_gap();
+        }
+    }


I don't think I understand the rationate here. If gap sync is triggered by some specific behavior then we should clear gap sync after we trigger that behavior. If the behavior is importing blocks in specific order then you will absolutely drive network prior to gap sync because that is how you end up with block you can physically import in the first place.

nazar-pc · 2024-05-17T12:13:42Z

crates/subspace-service/src/lib.rs

@@ -806,7 +843,7 @@ where
        }
    };

-    let import_queue_service = import_queue.service();
+    let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));


Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>. I simply do not understand why would you even do that.

nazar-pc · 2024-05-17T12:14:21Z

crates/subspace-service/src/sync_from_dsn.rs

+            client.clone(),
+            import_queue_service,


Which types? They were not required before, what has changed in a way that makes it mandatory?

nazar-pc · 2024-05-17T12:15:55Z

crates/subspace-service/src/sync_from_dsn.rs

+    let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;
+
+    pause_sync.store(true, Ordering::Release);
+
+    let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
+
+    if fast_sync_enabled && finalized_hash_existed {
+        debug!("Fast sync detected existing finalized hash.");
+    }
+
+    // Run fast-sync first.
+    #[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
+    if fast_sync_enabled && !finalized_hash_existed {
+        if notifications.next().await.is_some() {
+            let fast_syncer = FastSyncer::new(
+                &segment_headers_store,
+                node,
+                piece_getter,
+                client.clone(),
+                import_queue_service.clone(),
+                network_service.clone(),
+                fast_sync_state,
+                sync_service,
+            );

+            let fast_sync_result = fast_syncer.sync().await;
+
+            match fast_sync_result {
+                Ok(fast_sync_result) => {
+                    if fast_sync_result.skipped {
+                        debug!("Fast sync was skipped.");
+                    } else {
+                        last_processed_block_number = fast_sync_result.last_imported_block_number;
+                        last_processed_segment_index = fast_sync_result.last_imported_segment_index;
+                        reconstructor = fast_sync_result.reconstructor;
+
+                        debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
+                    }
+                }
+                Err(err) => {
+                    error!("Fast sync failed: {err}");
+                }
+            }
+
+            notifications_sender
+                .try_send(NotificationReason::WentOnlineSubspace)
+                .map_err(|_| {
+                    sc_service::Error::Other("Can't send sync notification reason.".into())
+                })?;
+        }
+    }


DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

…block number of the state we're trying to download

…erification concurrency and better error handling

nazar-pc

Fixed a few key issues in #2779 except non-atomic state import and didn't implement support for blocks that cover multiple segments because code will have to be refactored anyway and it can be done then.

#2779 needs to be merged into this PR, then further adjustments and cleanups need to be added, after which we'll squash numerous commits into one, maybe a couple, with clear additions according to spec and minimal changes elsewhere.

nazar-pc · 2024-05-21T21:35:44Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+            self.import_deconstructed_block(
+                block_bytes,
+                current_block_number,
+                BlockOrigin::NetworkInitialSync,
+            )
+            .await?;


You should schedule all the blocks together rather than one by one.

Scheduling them one by one means issues with verification concurrency and huge amount of logs with errors for every single block in case things go wrong. Check how DSN sync implementation schedules block import.

UPD: Fixed in #2779

nazar-pc · 2024-05-21T22:15:10Z

crates/subspace-service/src/sync_from_dsn/fast_sync.rs

+
+        let blocks_in_second_last_segment = second_last_segment_blocks.len();
+        debug!(
+            "Second last segment blocks downloaded (SegmentId={}): {}-{}",


There is no SegmentId in the protocol, there are just segment indices. I'd use %second_last_segment_index instead of this. On a similar note I'd use {}..={} instead of {}-{} for block numbers since we're in Rust.

nazar-pc · 2024-05-21T22:51:56Z

crates/subspace-service/src/lib.rs

+                    // Check for fast-sync state
+                    {
+                        if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
+                            let parent_block_number = *parent_header.number();
+                            if parent_block_number < *state_block_number {
+                                debug!(
+                                    %parent_block_number,
+                                    %state_block_number,
+                                    "Skipped PoT verification because of the fast sync state block."
+                                );
+
+                                return true;
+                            }
+                        }
+                    }


Fixed in #2779

Add fast sync algorithm fixes

nazar-pc · 2024-05-22T11:39:14Z

Node restart after fast sync is not working yet.

I started node like this:

cargo run --bin subspace-node -- run --chain gemini-3h --fast-sync --blocks-pruning 1000 --state-pruning 1000 --base-path /path/to/node

It synced and was staying up to date successfully:

2024-05-22T11:31:47.460558Z  INFO Consensus: substrate: ✨ Imported #1623917 (0x042c…4f7f)    
2024-05-22T11:31:47.979928Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623917 (0x042c…4f7f), finalized #1591279 (0xbe73…2d82), ⬇ 43.2kiB/s ⬆ 42.5kiB/s    
2024-05-22T11:31:48.175238Z  INFO Consensus: substrate: ✨ Imported #1623918 (0x9279…b004)    
2024-05-22T11:31:49.298301Z  INFO Consensus: substrate: ✨ Imported #1623919 (0x7c57…0125)    
2024-05-22T11:31:52.980452Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 42.7kiB/s ⬆ 30.2kiB/s    
2024-05-22T11:31:57.980742Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 18.7kiB/s ⬆ 20.2kiB/s    
2024-05-22T11:32:02.981136Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.9kiB/s ⬆ 27.0kiB/s    
2024-05-22T11:32:07.981621Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 28.0kiB/s ⬆ 27.8kiB/s    
2024-05-22T11:32:12.981933Z  INFO Consensus: substrate: 💤 Idle (24 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.4kiB/s ⬆ 27.2kiB/s    
2024-05-22T11:32:14.688759Z  INFO Consensus: substrate: ✨ Imported #1623920 (0x88a2…70df)    
2024-05-22T11:32:17.982270Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623920 (0x88a2…70df), finalized #1591279 (0xbe73…2d82), ⬇ 78.4kiB/s ⬆ 51.3kiB/s    
2024-05-22T11:32:21.485434Z  INFO Consensus: substrate: ✨ Imported #1623921 (0xd68b…a35c)    
2024-05-22T11:32:22.982559Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 56.0kiB/s ⬆ 38.3kiB/s    
2024-05-22T11:32:27.983010Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 26.2kiB/s ⬆ 28.5kiB/s    
2024-05-22T11:32:29.945513Z  INFO Consensus: substrate: ✨ Imported #1623922 (0x6386…edf0)    
2024-05-22T11:32:32.983468Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 43.3kiB/s ⬆ 37.4kiB/s    
2024-05-22T11:32:37.984041Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 22.6kiB/s ⬆ 24.1kiB/s    
2024-05-22T11:32:39.625676Z  INFO Consensus: substrate: ✨ Imported #1623923 (0xe655…f731)

Then I shut it down and started again:

2024-05-22T11:32:59.497045Z  INFO Consensus: subspace_service::sync_from_dsn: Received notification to sync from DSN reason=WentOnlineSubstrate
2024-05-22T11:33:01.260306Z  WARN Consensus: sc_proof_of_time::source: Proof of time chain was extended from block import from_next_slot=9532958 to_next_slot=9532969
2024-05-22T11:33:01.260336Z  INFO Consensus: substrate: ✨ Imported #1623924 (0xf937…b7b4)    
2024-05-22T11:33:03.635817Z  INFO Consensus: substrate: 💤 Idle (1 peers), best: #1623924 (0xf937…b7b4), finalized #1591279 (0xbe73…2d82), ⬇ 8.4kiB/s ⬆ 2.3kiB/s    
2024-05-22T11:33:04.140823Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: block has an unknown parent    
2024-05-22T11:33:04.386792Z  INFO Consensus: substrate: ✨ Imported #1623924 (0x1f9b…3977)    
2024-05-22T11:33:04.864512Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    
2024-05-22T11:33:05.279444Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977

Note that both blocks have the same number (it is interesting that I got them right away), but very last block has its state discarded right away for some reason. Maybe node decided it is not a canonical chain, but failed to do reorg or something? Possibly a side effect of light sync.

Subsequent restarts crash even earlier:

2024-05-22T11:33:51.463587Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x4b08…e8c9, header-hash: 0x0c12…1c34)    
Error: SubstrateService(Other("Failed to build a full subspace node 1: Application(UnknownBlock(\"State already discarded for 0xf93768c99f12a725e4879b1839ef746d6c7bf1328356b646870c489ac01fb7b4\"))"))

Not a hard blocker here, but something to fix before this is really usable.

shamil-gadelshin · 2024-05-22T11:46:52Z

Node restart after fast sync is not working yet.

I started node like this:

cargo run --bin subspace-node -- run --chain gemini-3h --fast-sync --blocks-pruning 1000 --state-pruning 1000 --base-path /path/to/node

It synced and was staying up to date successfully:

2024-05-22T11:31:47.460558Z  INFO Consensus: substrate: ✨ Imported #1623917 (0x042c…4f7f)    
2024-05-22T11:31:47.979928Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623917 (0x042c…4f7f), finalized #1591279 (0xbe73…2d82), ⬇ 43.2kiB/s ⬆ 42.5kiB/s    
2024-05-22T11:31:48.175238Z  INFO Consensus: substrate: ✨ Imported #1623918 (0x9279…b004)    
2024-05-22T11:31:49.298301Z  INFO Consensus: substrate: ✨ Imported #1623919 (0x7c57…0125)    
2024-05-22T11:31:52.980452Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 42.7kiB/s ⬆ 30.2kiB/s    
2024-05-22T11:31:57.980742Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 18.7kiB/s ⬆ 20.2kiB/s    
2024-05-22T11:32:02.981136Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.9kiB/s ⬆ 27.0kiB/s    
2024-05-22T11:32:07.981621Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 28.0kiB/s ⬆ 27.8kiB/s    
2024-05-22T11:32:12.981933Z  INFO Consensus: substrate: 💤 Idle (24 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.4kiB/s ⬆ 27.2kiB/s    
2024-05-22T11:32:14.688759Z  INFO Consensus: substrate: ✨ Imported #1623920 (0x88a2…70df)    
2024-05-22T11:32:17.982270Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623920 (0x88a2…70df), finalized #1591279 (0xbe73…2d82), ⬇ 78.4kiB/s ⬆ 51.3kiB/s    
2024-05-22T11:32:21.485434Z  INFO Consensus: substrate: ✨ Imported #1623921 (0xd68b…a35c)    
2024-05-22T11:32:22.982559Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 56.0kiB/s ⬆ 38.3kiB/s    
2024-05-22T11:32:27.983010Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 26.2kiB/s ⬆ 28.5kiB/s    
2024-05-22T11:32:29.945513Z  INFO Consensus: substrate: ✨ Imported #1623922 (0x6386…edf0)    
2024-05-22T11:32:32.983468Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 43.3kiB/s ⬆ 37.4kiB/s    
2024-05-22T11:32:37.984041Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 22.6kiB/s ⬆ 24.1kiB/s    
2024-05-22T11:32:39.625676Z  INFO Consensus: substrate: ✨ Imported #1623923 (0xe655…f731)

Then I shut it down and started again:

2024-05-22T11:32:59.497045Z  INFO Consensus: subspace_service::sync_from_dsn: Received notification to sync from DSN reason=WentOnlineSubstrate
2024-05-22T11:33:01.260306Z  WARN Consensus: sc_proof_of_time::source: Proof of time chain was extended from block import from_next_slot=9532958 to_next_slot=9532969
2024-05-22T11:33:01.260336Z  INFO Consensus: substrate: ✨ Imported #1623924 (0xf937…b7b4)    
2024-05-22T11:33:03.635817Z  INFO Consensus: substrate: 💤 Idle (1 peers), best: #1623924 (0xf937…b7b4), finalized #1591279 (0xbe73…2d82), ⬇ 8.4kiB/s ⬆ 2.3kiB/s    
2024-05-22T11:33:04.140823Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: block has an unknown parent    
2024-05-22T11:33:04.386792Z  INFO Consensus: substrate: ✨ Imported #1623924 (0x1f9b…3977)    
2024-05-22T11:33:04.864512Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    
2024-05-22T11:33:05.279444Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977

Note that both blocks have the same number (it is interesting that I got them right away), but very last block has its state discarded right away for some reason. Maybe node decided it is not a canonical chain, but failed to do reorg or something? Possibly a side effect of light sync.

Subsequent restarts crash even earlier:

2024-05-22T11:33:51.463587Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x4b08…e8c9, header-hash: 0x0c12…1c34)    
Error: SubstrateService(Other("Failed to build a full subspace node 1: Application(UnknownBlock(\"State already discarded for 0xf93768c99f12a725e4879b1839ef746d6c7bf1328356b646870c489ac01fb7b4\"))"))

Not a hard blocker here, but something to fix before this is really usable.

Could be because of the state/blocks pruning number. I didn't encounter this error for some time but I have much greater pruning numbers. I will test the restart again.

nazar-pc · 2024-05-22T11:50:10Z

Any numbers except really-really large ones for state will be (IIRC). The issue here is more interesting though because the block hash for which state is discarded is literally the block that was just imported within the same second. This is why it looks concerning to me.

shamil-gadelshin · 2024-05-22T11:53:09Z

Any numbers except really-really large ones for state will be (IIRC).

Rephrase it, please.

The issue here is more interesting though because the block hash for which state is discarded is literally the block that was just imported within the same second. This is why it looks concerning to me.

We have multiple dependencies on the previous blocks (old state).

nazar-pc · 2024-05-22T11:56:42Z

Rephrase it, please.

IIRC we will need state of older blocks up to the beginning of the archived segment that is currently beeing built, so any number we put as a state pruning (until we can control it precisely and programmatically) would have to cover the max possible number of blocks that a segment could possibly contain.

We have multiple dependencies on the previous blocks (old state).

Sure, but there should be no way for the block we just imported to not have state half a second later when we want to import the very next block.

Atomic fast sync block import

Fast sync cleanup

…optional) fast sync

Simplify fast sync

# Conflicts: # crates/subspace-service/Cargo.toml

nazar-pc

Left some nits, this will then need to be squashed as we discussed previously

crates/subspace-node/src/commands/run/consensus.rs

nazar-pc · 2024-05-30T10:57:20Z

crates/subspace-node/src/commands/run/consensus.rs

+    /// Full sync. Download and verify all blocks.
+    Full,
+    /// Download blocks from DSN.
+    Dsn,


I'm not sure why we have these as separate modes, nothing in the code seems to make any distinction between them. Also we don't need separate ConfigSyncMode and ChainSyncMode, just ChainSyncMode that implements FromStr is sufficient.

We use Snap and not Full, so we use Dsn implicitly. Two types ChainSyncMode and ConfigSyncMode were introduced to avoid adding clap dependency into subspace-service.

I didn't understand what you wanted to say with the first sentence, can you rephrase?

As for Clap, you don't need it. Enum only needs to implement FromStr trait IIRC.

We use Dsn value implicitly without a code reference as following: we either use only Full (regular) sync or DSN sync option when we can run Snap sync first.

As for Clap, you don't need it. Enum only needs to implement FromStr trait IIRC.

Indeed!

We use Dsn value implicitly without a code reference as following: we either use only Full (regular) sync or DSN sync option when we can run Snap sync first.

I'm sorry, it still makes no sense to me, at least it doesn't seem to answer the question I had originally.

Previously, we had two sync modes - regular and DSN, the latter was enabled by --sync-from-dsn and started a dedicated task.
Now, we have three modes: full, dsn, and snap. Snap starts in the same dedicated task as Dsn sync.
We start the dedicated task by checking if !config.sync.is_full(). "Not Full" includes Dsn and Snap sync modes. Snap sync can be activated by Snap sync mode before the Dsn worker in the dedicated sync task.
Thus we use Dsn sync mode implicitly without referencing it in the code.

I hope it makes more sense now. Feel free to ask additional questions.

I missed is_full check somehow. You made Full sync the default, which will disable DSN sync if I understand correctly. I'd say we should simply remove ability to disable sync from DSN. There should be just Full and Snap. No need to have Dsn as a third option, users will not be able to sync fully without DSN anyway (almost all nodes are pruned).

Will domain operators be affected by this change? We're changing our previous semantics.

What semantics? DSN sync was enabled before unless it was disabled explicitly with CLI option, so it was opt-out. Now you made it opt-in instead and I suggest to return to previous behavior and remove option that disabled sync from DSN so it is always enabled.

crates/subspace-service/src/sync_from_dsn/snap_sync_engine.rs

crates/subspace-service/src/lib.rs

shamil-gadelshin · 2024-05-30T12:41:27Z

@nazar-pc Please, run the DSN-sync local test.

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.
I didn't reproduce it after that.

nazar-pc · 2024-05-30T12:42:54Z

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.

Seems fine to me, there were no new blocks to import from DSN after Snap sync has finished, this is actually expected and should happen most of the times.

shamil-gadelshin · 2024-05-30T12:43:53Z

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.

Seems fine to me, there were no new blocks to import from DSN after Snap sync has finished, this is actually expected and should happen most of the times.

I meant fresh start with DSN-sync.

nazar-pc · 2024-05-30T12:45:25Z

I meant fresh start with DSN-sync.

Which pieces did it request then? I think it must have failed to acquire segment headers for some reason, but hard to say without logs.

nazar-pc

I think this is good to go, I'll open squashed PR instead of 50 individual commits. Also CI has failed, looks like we have a flaky test in networking.

shamil-gadelshin added 5 commits May 10, 2024 16:53

Add fast-sync engine.

f2c9704

Add raw block importer.

eca1522

node: Add fast-sync algorithm.

00c3d65

node: Modify CLI configuration to support fast-sync.

8a5e758

Integrate fast-sync algorithm into dsn-sync

011f4a5

shamil-gadelshin requested a review from dariolina May 14, 2024 07:58

shamil-gadelshin self-assigned this May 14, 2024

shamil-gadelshin requested review from NingLin-P, nazar-pc and rg3l3dr as code owners May 14, 2024 07:58

nazar-pc reviewed May 15, 2024

View reviewed changes

Apply review suggestions.

042ae07

- refactor comments, error messages, expect messages - restore pause for DSN-sync - add comments - other minor changes

shamil-gadelshin requested a review from nazar-pc May 16, 2024 12:52

nazar-pc reviewed May 16, 2024

View reviewed changes

shamil-gadelshin added 2 commits May 17, 2024 18:08

Change peer source for fast-sync.

f8c4600

Update fast sync method.

ae590ca

shamil-gadelshin requested a review from nazar-pc May 17, 2024 12:05

nazar-pc reviewed May 17, 2024

View reviewed changes

nazar-pc added 3 commits May 22, 2024 00:53

Only sync state from peers that have a higher best block number than …

1fb2c0e

…block number of the state we're trying to download

Import lists of blocks together instead of individually to leverage v…

14408db

…erification concurrency and better error handling

Remove incorrect of last block from second last segment and add TODOs

90b0ec9

nazar-pc mentioned this pull request May 21, 2024

Add fast sync algorithm fixes #2779

Merged

1 task

nazar-pc reviewed May 21, 2024

View reviewed changes

shamil-gadelshin added 2 commits May 22, 2024 14:48

Merge pull request #2779 from subspace/add-fast-sync-algorithm-fixes

9685fe9

Add fast sync algorithm fixes

Refactor fast-sync.

9d384f5

nazar-pc and others added 12 commits May 28, 2024 16:38

Merge pull request #2797 from subspace/atomic-fast-sync-block-import

2005b4b

Atomic fast sync block import

Remove unnecessary Mutex and unused generics

fa89440

Simplify fast sync engine API, add support for fork_id

350dfbc

Simplify networking in fast sync engine, simplify generics further

f247877

Merge pull request #2798 from subspace/fast-sync-cleanup

3b5ab82

Fast sync cleanup

Correct check for fast sync start and fix sync mode resetting after (…

95d17ce

…optional) fast sync

Simplify block import without resorting to apply_block low-level API

5b39dd1

Do not finalize blocks that do not exist

0edf1f7

Fix condition for clearing block gap

ae8a54c

Print info after fast sync

ccea056

Merge pull request #2799 from subspace/simplify-fast-sync

c14a7b2

Simplify fast sync

Merge branch 'main' into add-fast-sync-algorithm

7b39a9d

shamil-gadelshin requested a review from nazar-pc May 30, 2024 08:20

shamil-gadelshin added 3 commits May 30, 2024 16:53

Merge branch 'main' into add-fast-sync-algorithm

0dafb27

# Conflicts: # crates/subspace-service/Cargo.toml

Update fast-sync after latest polkadot-sdk

c175ca8

Rename fast-sync -> snap-sync

1940fe4

shamil-gadelshin force-pushed the add-fast-sync-algorithm branch from 1c4785b to 1940fe4 Compare May 30, 2024 10:50

nazar-pc reviewed May 30, 2024

View reviewed changes

Refactor fast-sync.

677c662

shamil-gadelshin requested a review from nazar-pc May 30, 2024 11:12

shamil-gadelshin added 2 commits May 30, 2024 18:30

Remove redundant ConfigSyncMode type

2e0a3d7

Remove substrate sync option.

b216a18

nazar-pc reviewed May 30, 2024

View reviewed changes

nazar-pc mentioned this pull request May 30, 2024

Initial implementation of snap sync #2809

Merged

1 task

nazar-pc closed this in #2809 May 30, 2024

nazar-pc deleted the add-fast-sync-algorithm branch June 3, 2024 09:44

Add fast sync algorithm. #2763

Add fast sync algorithm. #2763

Conversation

shamil-gadelshin commented May 14, 2024

CLI and config changes

Notable changes for fast sync integration

Code contributor checklist:

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nazar-pc commented May 22, 2024

shamil-gadelshin commented May 22, 2024

nazar-pc commented May 22, 2024

shamil-gadelshin commented May 22, 2024

nazar-pc commented May 22, 2024

nazar-pc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shamil-gadelshin commented May 30, 2024

nazar-pc commented May 30, 2024

shamil-gadelshin commented May 30, 2024

nazar-pc commented May 30, 2024

nazar-pc left a comment

Choose a reason for hiding this comment

nazar-pc May 15, 2024 •

edited

Loading