Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast sync algorithm. #2763

Closed
wants to merge 50 commits into from
Closed

Conversation

shamil-gadelshin
Copy link
Contributor

This PR adds the last part of the fast-sync algorithm. The algorithm described in the related spec and in the comment for FastSyncer struct. The PR contains the new algorithm, its integration within the existing DSN sync, and CLI and config parameters changes required to enable the algorithm. Although to the best of my knowledge, the algorithm works as expected it should not be used until we develop a related solution for fraud proofs otherwise fast-sync will fail because of the missing state after reaching the fraud-proof.

CLI and config changes

  • new fast-sync parameter
  • starting with SyncMode::LightState when fast-sync is enabled
  • modification of the state-pruning parameters to support N blocks pruning (required by SyncMode::LightState)

Notable changes for fast sync integration

  • PoT check update to support fast-sync
  • introducing block-gap clean on start when fast-sync enabled (Substrate detects it and inserts it repeatedly)
  • Reconstructor is passed to DSN sync from the fast-sync algorithm to preserve the correct inner state

Code contributor checklist:

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments and questions, will have to re-review again afterwards. Sorry for delay.

Comment on lines 114 to 117
/// - 2. add the last block the second last segment (as raw - without checks and execution),
/// - 3. download the state for the first block of the last segment,
/// - add the first block of the last segment as raw,
/// - download state for it,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These steps do not correspond to the latest version of the specification

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will answer below to a similar comment.

crates/subspace-service/src/sync_from_dsn/fast_sync.rs Outdated Show resolved Hide resolved
crates/subspace-service/src/sync_from_dsn/fast_sync.rs Outdated Show resolved Hide resolved
crates/subspace-service/src/sync_from_dsn/fast_sync.rs Outdated Show resolved Hide resolved
Comment on lines 154 to 159
let second_last_segment_blocks = download_and_reconstruct_blocks(
second_last_segment_index,
self.piece_getter,
&mut reconstructor,
)
.await?;
Copy link
Member

@nazar-pc nazar-pc May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how algorithm should work according to the spec though. You're supposed to check the segment header and only download second last segment header in case it is actually necessary, which is most of the time, but not always.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec version will likely work, however, on the implementation level I have doubts - in the absolute majority of cases we will go into the coded branch, do we need this special case that complicates the code and requires support but will have a negligible effect? And more importantly, we don't have the testing infrastructure to verify the code, so, the spec theoretic version IMO has more drawbacks than benefits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that needs to be done is an if branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.

And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.

I do not think I understand what the drawbacks there are in following the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that we won't test this code and it can fail even if seems logical which happened multiple times with other simple changes. It's likely possible to emulate this case when we have a complete last archived block. I checked the current blockchain - we don't have it currently. I added a TODO with a reference to the spec difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.

@@ -806,7 +843,7 @@ where
}
};

let import_queue_service = import_queue.service();
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you just need import queue service in multiple places? If so you can call import_queue.service() more than once and get two service instances with handles to the same import queue. Or is there a different reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, there were three instances passed across several methods that seemed awkward. I managed to work with two instances but I left shared ref anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still sharing it this way, what is the benefit of Arc<Mutex<T>> comparing to just T here? Seems like unnecessary complexity to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that you prefer passing two instances of service through methods instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>. I simply do not understand why would you even do that.

Comment on lines 111 to 112
client.clone(),
import_queue_service,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why owned values? I think references should still work just fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both Arc references

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that, but the question above remains

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are type requirements down the stack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which types? They were not required before, what has changed in a way that makes it mandatory?

Comment on lines 293 to 343
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;

pause_sync.store(true, Ordering::Release);

let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;

if fast_sync_enabled && finalized_hash_existed {
debug!("Fast sync detected existing finalized hash.");
}

// Run fast-sync first.
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
if fast_sync_enabled && !finalized_hash_existed {
if notifications.next().await.is_some() {
let fast_syncer = FastSyncer::new(
&segment_headers_store,
node,
piece_getter,
client.clone(),
import_queue_service.clone(),
network_service.clone(),
fast_sync_state,
sync_service,
);

let fast_sync_result = fast_syncer.sync().await;

match fast_sync_result {
Ok(fast_sync_result) => {
if fast_sync_result.skipped {
debug!("Fast sync was skipped.");
} else {
last_processed_block_number = fast_sync_result.last_imported_block_number;
last_processed_segment_index = fast_sync_result.last_imported_segment_index;
reconstructor = fast_sync_result.reconstructor;

debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
}
}
Err(err) => {
error!("Fast sync failed: {err}");
}
}

notifications_sender
.try_send(NotificationReason::WentOnlineSubspace)
.map_err(|_| {
sc_service::Error::Other("Can't send sync notification reason.".into())
})?;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed before this doesn't seem to belong to full sync from DSN, right now it is a one-off separate sync that happens before DSN sync . DSN sync is not a continuation of this sync because we will have new enough state for regular sync to kick in and take it from there, no need to pass reconstructor through everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the shared reconstructor when I encountered an error. Why do you think there is no need to pass the reconstructor? I appreciate more details here. There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the shared reconstructor when I encountered an error

You know the next natural question I'll have would be "which error did you encounter exactly"?

Why do you think there is no need to pass the reconstructor? I appreciate more details here.

Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).

There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.

And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know the next natural question I'll have would be "which error did you encounter exactly"?

It's hard to reproduce it on the fly but AFAIR the new reconstructor will result in a missing block on the edges of two segments (the last imported from the fast sync and the remaining segment obtained through DSN sync).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.

}
}

while let Some(reason) = notifications.next().await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have removed sync pausing when notification is received, this is incorrect and will cause issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the pause. Please, share possible issues.

crates/subspace-service/src/sync_from_dsn.rs Outdated Show resolved Hide resolved
- refactor comments, error messages, expect messages
- restore pause for DSN-sync
- add comments
- other minor changes
@shamil-gadelshin shamil-gadelshin requested a review from nazar-pc May 16, 2024 12:52
Comment on lines 154 to 159
let second_last_segment_blocks = download_and_reconstruct_blocks(
second_last_segment_index,
self.piece_getter,
&mut reconstructor,
)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that needs to be done is an if branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.

And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.

I do not think I understand what the drawbacks there are in following the spec.

Comment on lines 163 to 167
if blocks_in_second_last_segment < 2 {
return Err(Error::Other(
"Unexpected block array length for the second last segment.".into(),
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now what you meant to do here, it is just code that doesn't actually talk about it and doesn't check the actual invariant you care about.

The only thing (an nothing else) that we actually care about in fast sync is to import the first block of the last archived segment, this is why spec talks about that and not other things. There is no requirement for 2 blocks to be present, especially not in second last segment.

The edge-case though might be that we could hypothetically have (not practically, but it is still worth supporting IMHO) a block larger that a segment. In that case we may be unable to reconstruct the first block of last segment even if we download two last segments.

So what the code should actually is very close to what spec says:

  • once segment headers are known check the header of second last segment and see if last block there was archived or partially
  • if fully we only need to download the last segment and nothing else
  • if we have a partial block though then we need to check if third last segment ended with the same partial block number because if it did then we need to keep looking forward to the beginning of that block such that we can get the full block of the last segment header regardless of where it starts

With above description it should be clear that this check doesn't make sense. For example it will fail in the perfectly fine situation where second last segment contains end of one block and beginning of another (first block of last segment). In this case we have 0 blocks reconstructed from second last segment alone, yet we are able to successfully reconstruct first block of last segment that the specification actually requires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible in this case.

Comment on lines 185 to 189
if blocks_in_last_segment < 2 {
return Err(Error::Other(
"Unexpected block array length for the last segment.".into(),
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is unnecessary, the only thing we need is a single block. Other blocks can be imported even via regular Substrate sync without issues.

Here is the invariant from the spec you should check instead:

Suggested change
if blocks_in_last_segment < 2 {
return Err(Error::Other(
"Unexpected block array length for the last segment.".into(),
));
}
if blocks.is_empty() {
return Err(Error::Other(
"Failed to reconstruct first block of the last segment".into(),
));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible.

Comment on lines 324 to 338
// Check for fast-sync state
{
if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
let parent_block_number = *parent_header.number();
if parent_block_number < *state_block_number {
debug!(
%parent_block_number,
%state_block_number,
"Skipped PoT verification because of the fast sync state block."
);

return true;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can guess from the code that this is some kind of special case, but I don't understand what that special case is, what issues it causes and why you decided to fix it this way out of a set of possibilities. I sometimes can infer those things, but you'll have to help me understand it in this case.

Comment on lines 741 to 749
// Clear block gap after fast sync on reruns.
// Substrate detects a gap and inserts on each sync.
if config.fast_sync_enabled {
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
if finalized_hash_existed {
debug!(client_info=?client.info(), "Clear block gap after fast-sync.");
client.clear_block_gap();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion viable?:

Can it be moved right next to the code that triggers gap sync in the first place?

@@ -806,7 +843,7 @@ where
}
};

let import_queue_service = import_queue.service();
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're still sharing it this way, what is the benefit of Arc<Mutex<T>> comparing to just T here? Seems like unnecessary complexity to me.

Comment on lines 111 to 112
client.clone(),
import_queue_service,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that, but the question above remains

Comment on lines 293 to 343
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;

pause_sync.store(true, Ordering::Release);

let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;

if fast_sync_enabled && finalized_hash_existed {
debug!("Fast sync detected existing finalized hash.");
}

// Run fast-sync first.
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
if fast_sync_enabled && !finalized_hash_existed {
if notifications.next().await.is_some() {
let fast_syncer = FastSyncer::new(
&segment_headers_store,
node,
piece_getter,
client.clone(),
import_queue_service.clone(),
network_service.clone(),
fast_sync_state,
sync_service,
);

let fast_sync_result = fast_syncer.sync().await;

match fast_sync_result {
Ok(fast_sync_result) => {
if fast_sync_result.skipped {
debug!("Fast sync was skipped.");
} else {
last_processed_block_number = fast_sync_result.last_imported_block_number;
last_processed_segment_index = fast_sync_result.last_imported_segment_index;
reconstructor = fast_sync_result.reconstructor;

debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
}
}
Err(err) => {
error!("Fast sync failed: {err}");
}
}

notifications_sender
.try_send(NotificationReason::WentOnlineSubspace)
.map_err(|_| {
sc_service::Error::Other("Can't send sync notification reason.".into())
})?;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the shared reconstructor when I encountered an error

You know the next natural question I'll have would be "which error did you encounter exactly"?

Why do you think there is no need to pass the reconstructor? I appreciate more details here.

Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).

There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.

Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.

And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.

@shamil-gadelshin shamil-gadelshin requested a review from nazar-pc May 17, 2024 12:05
Comment on lines 154 to 159
let second_last_segment_blocks = download_and_reconstruct_blocks(
second_last_segment_index,
self.piece_getter,
&mut reconstructor,
)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.

Comment on lines 324 to 338
// Check for fast-sync state
{
if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
let parent_block_number = *parent_header.number();
if parent_block_number < *state_block_number {
debug!(
%parent_block_number,
%state_block_number,
"Skipped PoT verification because of the fast sync state block."
);

return true;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we always import blocks with state, I'm not sure I'm following what you're saying.

Comment on lines 741 to 749
// Clear block gap after fast sync on reruns.
// Substrate detects a gap and inserts on each sync.
if config.fast_sync_enabled {
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;
if finalized_hash_existed {
debug!(client_info=?client.info(), "Clear block gap after fast-sync.");
client.clear_block_gap();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the rationate here. If gap sync is triggered by some specific behavior then we should clear gap sync after we trigger that behavior. If the behavior is importing blocks in specific order then you will absolutely drive network prior to gap sync because that is how you end up with block you can physically import in the first place.

@@ -806,7 +843,7 @@ where
}
};

let import_queue_service = import_queue.service();
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>. I simply do not understand why would you even do that.

Comment on lines 111 to 112
client.clone(),
import_queue_service,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which types? They were not required before, what has changed in a way that makes it mandatory?

Comment on lines 293 to 343
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?;

pause_sync.store(true, Ordering::Release);

let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash;

if fast_sync_enabled && finalized_hash_existed {
debug!("Fast sync detected existing finalized hash.");
}

// Run fast-sync first.
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same
if fast_sync_enabled && !finalized_hash_existed {
if notifications.next().await.is_some() {
let fast_syncer = FastSyncer::new(
&segment_headers_store,
node,
piece_getter,
client.clone(),
import_queue_service.clone(),
network_service.clone(),
fast_sync_state,
sync_service,
);

let fast_sync_result = fast_syncer.sync().await;

match fast_sync_result {
Ok(fast_sync_result) => {
if fast_sync_result.skipped {
debug!("Fast sync was skipped.");
} else {
last_processed_block_number = fast_sync_result.last_imported_block_number;
last_processed_segment_index = fast_sync_result.last_imported_segment_index;
reconstructor = fast_sync_result.reconstructor;

debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished.");
}
}
Err(err) => {
error!("Fast sync failed: {err}");
}
}

notifications_sender
.try_send(NotificationReason::WentOnlineSubspace)
.map_err(|_| {
sc_service::Error::Other("Can't send sync notification reason.".into())
})?;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.

crates/subspace-service/src/sync_from_dsn/fast_sync.rs Outdated Show resolved Hide resolved
@nazar-pc nazar-pc mentioned this pull request May 21, 2024
1 task
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a few key issues in #2779 except non-atomic state import and didn't implement support for blocks that cover multiple segments because code will have to be refactored anyway and it can be done then.

#2779 needs to be merged into this PR, then further adjustments and cleanups need to be added, after which we'll squash numerous commits into one, maybe a couple, with clear additions according to spec and minimal changes elsewhere.

Comment on lines 287 to 292
self.import_deconstructed_block(
block_bytes,
current_block_number,
BlockOrigin::NetworkInitialSync,
)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should schedule all the blocks together rather than one by one.

Scheduling them one by one means issues with verification concurrency and huge amount of logs with errors for every single block in case things go wrong. Check how DSN sync implementation schedules block import.

UPD: Fixed in #2779


let blocks_in_second_last_segment = second_last_segment_blocks.len();
debug!(
"Second last segment blocks downloaded (SegmentId={}): {}-{}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no SegmentId in the protocol, there are just segment indices. I'd use %second_last_segment_index instead of this. On a similar note I'd use {}..={} instead of {}-{} for block numbers since we're in Rust.

Comment on lines 324 to 338
// Check for fast-sync state
{
if let Some(state_block_number) = fast_sync_state.lock().as_ref() {
let parent_block_number = *parent_header.number();
if parent_block_number < *state_block_number {
debug!(
%parent_block_number,
%state_block_number,
"Skipped PoT verification because of the fast sync state block."
);

return true;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2779

@nazar-pc
Copy link
Member

Node restart after fast sync is not working yet.

I started node like this:

cargo run --bin subspace-node -- run --chain gemini-3h --fast-sync --blocks-pruning 1000 --state-pruning 1000 --base-path /path/to/node

It synced and was staying up to date successfully:

2024-05-22T11:31:47.460558Z  INFO Consensus: substrate: ✨ Imported #1623917 (0x042c…4f7f)    
2024-05-22T11:31:47.979928Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623917 (0x042c…4f7f), finalized #1591279 (0xbe73…2d82), ⬇ 43.2kiB/s ⬆ 42.5kiB/s    
2024-05-22T11:31:48.175238Z  INFO Consensus: substrate: ✨ Imported #1623918 (0x9279…b004)    
2024-05-22T11:31:49.298301Z  INFO Consensus: substrate: ✨ Imported #1623919 (0x7c57…0125)    
2024-05-22T11:31:52.980452Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 42.7kiB/s ⬆ 30.2kiB/s    
2024-05-22T11:31:57.980742Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 18.7kiB/s ⬆ 20.2kiB/s    
2024-05-22T11:32:02.981136Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.9kiB/s ⬆ 27.0kiB/s    
2024-05-22T11:32:07.981621Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 28.0kiB/s ⬆ 27.8kiB/s    
2024-05-22T11:32:12.981933Z  INFO Consensus: substrate: 💤 Idle (24 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.4kiB/s ⬆ 27.2kiB/s    
2024-05-22T11:32:14.688759Z  INFO Consensus: substrate: ✨ Imported #1623920 (0x88a2…70df)    
2024-05-22T11:32:17.982270Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623920 (0x88a2…70df), finalized #1591279 (0xbe73…2d82), ⬇ 78.4kiB/s ⬆ 51.3kiB/s    
2024-05-22T11:32:21.485434Z  INFO Consensus: substrate: ✨ Imported #1623921 (0xd68b…a35c)    
2024-05-22T11:32:22.982559Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 56.0kiB/s ⬆ 38.3kiB/s    
2024-05-22T11:32:27.983010Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 26.2kiB/s ⬆ 28.5kiB/s    
2024-05-22T11:32:29.945513Z  INFO Consensus: substrate: ✨ Imported #1623922 (0x6386…edf0)    
2024-05-22T11:32:32.983468Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 43.3kiB/s ⬆ 37.4kiB/s    
2024-05-22T11:32:37.984041Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 22.6kiB/s ⬆ 24.1kiB/s    
2024-05-22T11:32:39.625676Z  INFO Consensus: substrate: ✨ Imported #1623923 (0xe655…f731)    

Then I shut it down and started again:

2024-05-22T11:32:59.497045Z  INFO Consensus: subspace_service::sync_from_dsn: Received notification to sync from DSN reason=WentOnlineSubstrate
2024-05-22T11:33:01.260306Z  WARN Consensus: sc_proof_of_time::source: Proof of time chain was extended from block import from_next_slot=9532958 to_next_slot=9532969
2024-05-22T11:33:01.260336Z  INFO Consensus: substrate: ✨ Imported #1623924 (0xf937…b7b4)    
2024-05-22T11:33:03.635817Z  INFO Consensus: substrate: 💤 Idle (1 peers), best: #1623924 (0xf937…b7b4), finalized #1591279 (0xbe73…2d82), ⬇ 8.4kiB/s ⬆ 2.3kiB/s    
2024-05-22T11:33:04.140823Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: block has an unknown parent    
2024-05-22T11:33:04.386792Z  INFO Consensus: substrate: ✨ Imported #1623924 (0x1f9b…3977)    
2024-05-22T11:33:04.864512Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    
2024-05-22T11:33:05.279444Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    

Note that both blocks have the same number (it is interesting that I got them right away), but very last block has its state discarded right away for some reason. Maybe node decided it is not a canonical chain, but failed to do reorg or something? Possibly a side effect of light sync.

Subsequent restarts crash even earlier:

2024-05-22T11:33:51.463587Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x4b08…e8c9, header-hash: 0x0c12…1c34)    
Error: SubstrateService(Other("Failed to build a full subspace node 1: Application(UnknownBlock(\"State already discarded for 0xf93768c99f12a725e4879b1839ef746d6c7bf1328356b646870c489ac01fb7b4\"))"))

Not a hard blocker here, but something to fix before this is really usable.

@shamil-gadelshin
Copy link
Contributor Author

Node restart after fast sync is not working yet.

I started node like this:

cargo run --bin subspace-node -- run --chain gemini-3h --fast-sync --blocks-pruning 1000 --state-pruning 1000 --base-path /path/to/node

It synced and was staying up to date successfully:

2024-05-22T11:31:47.460558Z  INFO Consensus: substrate: ✨ Imported #1623917 (0x042c…4f7f)    
2024-05-22T11:31:47.979928Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623917 (0x042c…4f7f), finalized #1591279 (0xbe73…2d82), ⬇ 43.2kiB/s ⬆ 42.5kiB/s    
2024-05-22T11:31:48.175238Z  INFO Consensus: substrate: ✨ Imported #1623918 (0x9279…b004)    
2024-05-22T11:31:49.298301Z  INFO Consensus: substrate: ✨ Imported #1623919 (0x7c57…0125)    
2024-05-22T11:31:52.980452Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 42.7kiB/s ⬆ 30.2kiB/s    
2024-05-22T11:31:57.980742Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 18.7kiB/s ⬆ 20.2kiB/s    
2024-05-22T11:32:02.981136Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.9kiB/s ⬆ 27.0kiB/s    
2024-05-22T11:32:07.981621Z  INFO Consensus: substrate: 💤 Idle (23 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 28.0kiB/s ⬆ 27.8kiB/s    
2024-05-22T11:32:12.981933Z  INFO Consensus: substrate: 💤 Idle (24 peers), best: #1623919 (0x7c57…0125), finalized #1591279 (0xbe73…2d82), ⬇ 26.4kiB/s ⬆ 27.2kiB/s    
2024-05-22T11:32:14.688759Z  INFO Consensus: substrate: ✨ Imported #1623920 (0x88a2…70df)    
2024-05-22T11:32:17.982270Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623920 (0x88a2…70df), finalized #1591279 (0xbe73…2d82), ⬇ 78.4kiB/s ⬆ 51.3kiB/s    
2024-05-22T11:32:21.485434Z  INFO Consensus: substrate: ✨ Imported #1623921 (0xd68b…a35c)    
2024-05-22T11:32:22.982559Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 56.0kiB/s ⬆ 38.3kiB/s    
2024-05-22T11:32:27.983010Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623921 (0xd68b…a35c), finalized #1591279 (0xbe73…2d82), ⬇ 26.2kiB/s ⬆ 28.5kiB/s    
2024-05-22T11:32:29.945513Z  INFO Consensus: substrate: ✨ Imported #1623922 (0x6386…edf0)    
2024-05-22T11:32:32.983468Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 43.3kiB/s ⬆ 37.4kiB/s    
2024-05-22T11:32:37.984041Z  INFO Consensus: substrate: 💤 Idle (22 peers), best: #1623922 (0x6386…edf0), finalized #1591279 (0xbe73…2d82), ⬇ 22.6kiB/s ⬆ 24.1kiB/s    
2024-05-22T11:32:39.625676Z  INFO Consensus: substrate: ✨ Imported #1623923 (0xe655…f731)    

Then I shut it down and started again:

2024-05-22T11:32:59.497045Z  INFO Consensus: subspace_service::sync_from_dsn: Received notification to sync from DSN reason=WentOnlineSubstrate
2024-05-22T11:33:01.260306Z  WARN Consensus: sc_proof_of_time::source: Proof of time chain was extended from block import from_next_slot=9532958 to_next_slot=9532969
2024-05-22T11:33:01.260336Z  INFO Consensus: substrate: ✨ Imported #1623924 (0xf937…b7b4)    
2024-05-22T11:33:03.635817Z  INFO Consensus: substrate: 💤 Idle (1 peers), best: #1623924 (0xf937…b7b4), finalized #1591279 (0xbe73…2d82), ⬇ 8.4kiB/s ⬆ 2.3kiB/s    
2024-05-22T11:33:04.140823Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: block has an unknown parent    
2024-05-22T11:33:04.386792Z  INFO Consensus: substrate: ✨ Imported #1623924 (0x1f9b…3977)    
2024-05-22T11:33:04.864512Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    
2024-05-22T11:33:05.279444Z  WARN Consensus: sync: 💔 Error importing block 0x009fcaee0825caa5724de82d5943d23f8161b92df72e0c81ab792039078f712b: consensus error: Api called for an unknown Block: State already discarded for 0x1f9b639e56a6fecba2e357118de3b356fe89a90f76729c8266d485640a4c3977    

Note that both blocks have the same number (it is interesting that I got them right away), but very last block has its state discarded right away for some reason. Maybe node decided it is not a canonical chain, but failed to do reorg or something? Possibly a side effect of light sync.

Subsequent restarts crash even earlier:

2024-05-22T11:33:51.463587Z  INFO Consensus: sc_service::client::client: 🔨 Initializing Genesis block/state (state: 0x4b08…e8c9, header-hash: 0x0c12…1c34)    
Error: SubstrateService(Other("Failed to build a full subspace node 1: Application(UnknownBlock(\"State already discarded for 0xf93768c99f12a725e4879b1839ef746d6c7bf1328356b646870c489ac01fb7b4\"))"))

Not a hard blocker here, but something to fix before this is really usable.

Could be because of the state/blocks pruning number. I didn't encounter this error for some time but I have much greater pruning numbers. I will test the restart again.

@nazar-pc
Copy link
Member

Any numbers except really-really large ones for state will be (IIRC). The issue here is more interesting though because the block hash for which state is discarded is literally the block that was just imported within the same second. This is why it looks concerning to me.

@shamil-gadelshin
Copy link
Contributor Author

Any numbers except really-really large ones for state will be (IIRC).

Rephrase it, please.

The issue here is more interesting though because the block hash for which state is discarded is literally the block that was just imported within the same second. This is why it looks concerning to me.

We have multiple dependencies on the previous blocks (old state).

@nazar-pc
Copy link
Member

Rephrase it, please.

IIRC we will need state of older blocks up to the beginning of the archived segment that is currently beeing built, so any number we put as a state pruning (until we can control it precisely and programmatically) would have to cover the max possible number of blocks that a segment could possibly contain.

We have multiple dependencies on the previous blocks (old state).

Sure, but there should be no way for the block we just imported to not have state half a second later when we want to import the very next block.

@shamil-gadelshin shamil-gadelshin requested a review from nazar-pc May 30, 2024 08:20
@shamil-gadelshin shamil-gadelshin force-pushed the add-fast-sync-algorithm branch from 1c4785b to 1940fe4 Compare May 30, 2024 10:50
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, this will then need to be squashed as we discussed previously

crates/subspace-node/src/commands/run/consensus.rs Outdated Show resolved Hide resolved
Comment on lines 412 to 415
/// Full sync. Download and verify all blocks.
Full,
/// Download blocks from DSN.
Dsn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we have these as separate modes, nothing in the code seems to make any distinction between them. Also we don't need separate ConfigSyncMode and ChainSyncMode, just ChainSyncMode that implements FromStr is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Snap and not Full, so we use Dsn implicitly. Two types ChainSyncMode and ConfigSyncMode were introduced to avoid adding clap dependency into subspace-service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand what you wanted to say with the first sentence, can you rephrase?

As for Clap, you don't need it. Enum only needs to implement FromStr trait IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Dsn value implicitly without a code reference as following: we either use only Full (regular) sync or DSN sync option when we can run Snap sync first.

As for Clap, you don't need it. Enum only needs to implement FromStr trait IIRC.

Indeed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Dsn value implicitly without a code reference as following: we either use only Full (regular) sync or DSN sync option when we can run Snap sync first.

I'm sorry, it still makes no sense to me, at least it doesn't seem to answer the question I had originally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we had two sync modes - regular and DSN, the latter was enabled by --sync-from-dsn and started a dedicated task.
Now, we have three modes: full, dsn, and snap. Snap starts in the same dedicated task as Dsn sync.
We start the dedicated task by checking if !config.sync.is_full(). "Not Full" includes Dsn and Snap sync modes. Snap sync can be activated by Snap sync mode before the Dsn worker in the dedicated sync task.
Thus we use Dsn sync mode implicitly without referencing it in the code.

I hope it makes more sense now. Feel free to ask additional questions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed is_full check somehow. You made Full sync the default, which will disable DSN sync if I understand correctly. I'd say we should simply remove ability to disable sync from DSN. There should be just Full and Snap. No need to have Dsn as a third option, users will not be able to sync fully without DSN anyway (almost all nodes are pruned).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will domain operators be affected by this change? We're changing our previous semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What semantics? DSN sync was enabled before unless it was disabled explicitly with CLI option, so it was opt-out. Now you made it opt-in instead and I suggest to return to previous behavior and remove option that disabled sync from DSN so it is always enabled.

crates/subspace-service/src/lib.rs Outdated Show resolved Hide resolved
@shamil-gadelshin shamil-gadelshin requested a review from nazar-pc May 30, 2024 11:12
@shamil-gadelshin
Copy link
Contributor Author

@nazar-pc Please, run the DSN-sync local test.

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.
I didn't reproduce it after that.

@nazar-pc
Copy link
Member

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.

Seems fine to me, there were no new blocks to import from DSN after Snap sync has finished, this is actually expected and should happen most of the times.

@shamil-gadelshin
Copy link
Contributor Author

I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import.

Seems fine to me, there were no new blocks to import from DSN after Snap sync has finished, this is actually expected and should happen most of the times.

I meant fresh start with DSN-sync.

@nazar-pc
Copy link
Member

I meant fresh start with DSN-sync.

Which pieces did it request then? I think it must have failed to acquire segment headers for some reason, but hard to say without logs.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go, I'll open squashed PR instead of 50 individual commits. Also CI has failed, looks like we have a flaky test in networking.

@nazar-pc nazar-pc deleted the add-fast-sync-algorithm branch June 3, 2024 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants