-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast sync algorithm. #2763
Add fast sync algorithm. #2763
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments and questions, will have to re-review again afterwards. Sorry for delay.
/// - 2. add the last block the second last segment (as raw - without checks and execution), | ||
/// - 3. download the state for the first block of the last segment, | ||
/// - add the first block of the last segment as raw, | ||
/// - download state for it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These steps do not correspond to the latest version of the specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will answer below to a similar comment.
let second_last_segment_blocks = download_and_reconstruct_blocks( | ||
second_last_segment_index, | ||
self.piece_getter, | ||
&mut reconstructor, | ||
) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not how algorithm should work according to the spec though. You're supposed to check the segment header and only download second last segment header in case it is actually necessary, which is most of the time, but not always.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec version will likely work, however, on the implementation level I have doubts - in the absolute majority of cases we will go into the coded branch, do we need this special case that complicates the code and requires support but will have a negligible effect? And more importantly, we don't have the testing infrastructure to verify the code, so, the spec theoretic version IMO has more drawbacks than benefits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that needs to be done is an if
branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.
And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.
I do not think I understand what the drawbacks there are in following the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that we won't test this code and it can fail even if seems logical which happened multiple times with other simple changes. It's likely possible to emulate this case when we have a complete last archived block. I checked the current blockchain - we don't have it currently. I added a TODO with a reference to the spec difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.
crates/subspace-service/src/lib.rs
Outdated
@@ -806,7 +843,7 @@ where | |||
} | |||
}; | |||
|
|||
let import_queue_service = import_queue.service(); | |||
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you just need import queue service in multiple places? If so you can call import_queue.service()
more than once and get two service instances with handles to the same import queue. Or is there a different reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, there were three instances passed across several methods that seemed awkward. I managed to work with two instances but I left shared ref anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still sharing it this way, what is the benefit of Arc<Mutex<T>>
comparing to just T
here? Seems like unnecessary complexity to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that you prefer passing two instances of service through methods instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>
. I simply do not understand why would you even do that.
client.clone(), | ||
import_queue_service, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why owned values? I think references should still work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are both Arc
references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that, but the question above remains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are type requirements down the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which types? They were not required before, what has changed in a way that makes it mandatory?
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?; | ||
|
||
pause_sync.store(true, Ordering::Release); | ||
|
||
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash; | ||
|
||
if fast_sync_enabled && finalized_hash_existed { | ||
debug!("Fast sync detected existing finalized hash."); | ||
} | ||
|
||
// Run fast-sync first. | ||
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same | ||
if fast_sync_enabled && !finalized_hash_existed { | ||
if notifications.next().await.is_some() { | ||
let fast_syncer = FastSyncer::new( | ||
&segment_headers_store, | ||
node, | ||
piece_getter, | ||
client.clone(), | ||
import_queue_service.clone(), | ||
network_service.clone(), | ||
fast_sync_state, | ||
sync_service, | ||
); | ||
|
||
let fast_sync_result = fast_syncer.sync().await; | ||
|
||
match fast_sync_result { | ||
Ok(fast_sync_result) => { | ||
if fast_sync_result.skipped { | ||
debug!("Fast sync was skipped."); | ||
} else { | ||
last_processed_block_number = fast_sync_result.last_imported_block_number; | ||
last_processed_segment_index = fast_sync_result.last_imported_segment_index; | ||
reconstructor = fast_sync_result.reconstructor; | ||
|
||
debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished."); | ||
} | ||
} | ||
Err(err) => { | ||
error!("Fast sync failed: {err}"); | ||
} | ||
} | ||
|
||
notifications_sender | ||
.try_send(NotificationReason::WentOnlineSubspace) | ||
.map_err(|_| { | ||
sc_service::Error::Other("Can't send sync notification reason.".into()) | ||
})?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before this doesn't seem to belong to full sync from DSN, right now it is a one-off separate sync that happens before DSN sync . DSN sync is not a continuation of this sync because we will have new enough state for regular sync to kick in and take it from there, no need to pass reconstructor through everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the shared reconstructor when I encountered an error. Why do you think there is no need to pass the reconstructor? I appreciate more details here. There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the shared reconstructor when I encountered an error
You know the next natural question I'll have would be "which error did you encounter exactly"?
Why do you think there is no need to pass the reconstructor? I appreciate more details here.
Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).
There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.
Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.
And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know the next natural question I'll have would be "which error did you encounter exactly"?
It's hard to reproduce it on the fly but AFAIR the new reconstructor will result in a missing block on the edges of two segments (the last imported from the fast sync and the remaining segment obtained through DSN sync).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.
} | ||
} | ||
|
||
while let Some(reason) = notifications.next().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have removed sync pausing when notification is received, this is incorrect and will cause issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored the pause. Please, share possible issues.
- refactor comments, error messages, expect messages - restore pause for DSN-sync - add comments - other minor changes
let second_last_segment_blocks = download_and_reconstruct_blocks( | ||
second_last_segment_index, | ||
self.piece_getter, | ||
&mut reconstructor, | ||
) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that needs to be done is an if
branch. If the last block in second last segment was included partially then we download the segment and feed it into reconstructor, otherwise nothing needs to be downloaded or fed into reconstructor and we can proceed with the last segment only. This is a very simple check and I don't understand why we should intentionally not implement the spec and describe a different version in the function signature.
And such things are already tested in the archiver, you can easily logically confirm how it will behave without actually running the code.
I do not think I understand what the drawbacks there are in following the spec.
if blocks_in_second_last_segment < 2 { | ||
return Err(Error::Other( | ||
"Unexpected block array length for the second last segment.".into(), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now what you meant to do here, it is just code that doesn't actually talk about it and doesn't check the actual invariant you care about.
The only thing (an nothing else) that we actually care about in fast sync is to import the first block of the last archived segment, this is why spec talks about that and not other things. There is no requirement for 2 blocks to be present, especially not in second last segment.
The edge-case though might be that we could hypothetically have (not practically, but it is still worth supporting IMHO) a block larger that a segment. In that case we may be unable to reconstruct the first block of last segment even if we download two last segments.
So what the code should actually is very close to what spec says:
- once segment headers are known check the header of second last segment and see if last block there was archived or partially
- if fully we only need to download the last segment and nothing else
- if we have a partial block though then we need to check if third last segment ended with the same partial block number because if it did then we need to keep looking forward to the beginning of that block such that we can get the full block of the last segment header regardless of where it starts
With above description it should be clear that this check doesn't make sense. For example it will fail in the perfectly fine situation where second last segment contains end of one block and beginning of another (first block of last segment). In this case we have 0 blocks reconstructed from second last segment alone, yet we are able to successfully reconstruct first block of last segment that the specification actually requires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible in this case.
if blocks_in_last_segment < 2 { | ||
return Err(Error::Other( | ||
"Unexpected block array length for the last segment.".into(), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is unnecessary, the only thing we need is a single block. Other blocks can be imported even via regular Substrate sync without issues.
Here is the invariant from the spec you should check instead:
if blocks_in_last_segment < 2 { | |
return Err(Error::Other( | |
"Unexpected block array length for the last segment.".into(), | |
)); | |
} | |
if blocks.is_empty() { | |
return Err(Error::Other( | |
"Failed to reconstruct first block of the last segment".into(), | |
)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both checks were introduced because of the indexing operations (to fail with the understandable error). However, as you pointed out we don't support this case - I will remove it instead considering indexing infallible.
crates/subspace-service/src/lib.rs
Outdated
// Check for fast-sync state | ||
{ | ||
if let Some(state_block_number) = fast_sync_state.lock().as_ref() { | ||
let parent_block_number = *parent_header.number(); | ||
if parent_block_number < *state_block_number { | ||
debug!( | ||
%parent_block_number, | ||
%state_block_number, | ||
"Skipped PoT verification because of the fast sync state block." | ||
); | ||
|
||
return true; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can guess from the code that this is some kind of special case, but I don't understand what that special case is, what issues it causes and why you decided to fix it this way out of a set of possibilities. I sometimes can infer those things, but you'll have to help me understand it in this case.
crates/subspace-service/src/lib.rs
Outdated
// Clear block gap after fast sync on reruns. | ||
// Substrate detects a gap and inserts on each sync. | ||
if config.fast_sync_enabled { | ||
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash; | ||
if finalized_hash_existed { | ||
debug!(client_info=?client.info(), "Clear block gap after fast-sync."); | ||
client.clear_block_gap(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this suggestion viable?:
Can it be moved right next to the code that triggers gap sync in the first place?
crates/subspace-service/src/lib.rs
Outdated
@@ -806,7 +843,7 @@ where | |||
} | |||
}; | |||
|
|||
let import_queue_service = import_queue.service(); | |||
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're still sharing it this way, what is the benefit of Arc<Mutex<T>>
comparing to just T
here? Seems like unnecessary complexity to me.
client.clone(), | ||
import_queue_service, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that, but the question above remains
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?; | ||
|
||
pause_sync.store(true, Ordering::Release); | ||
|
||
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash; | ||
|
||
if fast_sync_enabled && finalized_hash_existed { | ||
debug!("Fast sync detected existing finalized hash."); | ||
} | ||
|
||
// Run fast-sync first. | ||
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same | ||
if fast_sync_enabled && !finalized_hash_existed { | ||
if notifications.next().await.is_some() { | ||
let fast_syncer = FastSyncer::new( | ||
&segment_headers_store, | ||
node, | ||
piece_getter, | ||
client.clone(), | ||
import_queue_service.clone(), | ||
network_service.clone(), | ||
fast_sync_state, | ||
sync_service, | ||
); | ||
|
||
let fast_sync_result = fast_syncer.sync().await; | ||
|
||
match fast_sync_result { | ||
Ok(fast_sync_result) => { | ||
if fast_sync_result.skipped { | ||
debug!("Fast sync was skipped."); | ||
} else { | ||
last_processed_block_number = fast_sync_result.last_imported_block_number; | ||
last_processed_segment_index = fast_sync_result.last_imported_segment_index; | ||
reconstructor = fast_sync_result.reconstructor; | ||
|
||
debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished."); | ||
} | ||
} | ||
Err(err) => { | ||
error!("Fast sync failed: {err}"); | ||
} | ||
} | ||
|
||
notifications_sender | ||
.try_send(NotificationReason::WentOnlineSubspace) | ||
.map_err(|_| { | ||
sc_service::Error::Other("Can't send sync notification reason.".into()) | ||
})?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the shared reconstructor when I encountered an error
You know the next natural question I'll have would be "which error did you encounter exactly"?
Why do you think there is no need to pass the reconstructor? I appreciate more details here.
Everything worked fine before, so it is not on me to prove that with new code added this is not necessary, you should convince me that something that didn't exist before is now necessary (which is very possible, I just don't see it right now).
There are two more reasons to place it here: fast-sync IS DSN-sync, shared sync pause management. However, I'm open to suggestions here.
Arguably it is not. Fast sync's primary goal is to import a single block from DSN: the very first block of the last segment. Even if node shuts down right after and DSN sync fails after restart, we can still continue regular Substrate sync successfully. The fact that we import other blocks from the last segment is an optimization (because we already have it, no need to throw them away), not a requirement.
And I already mentioned in one of the prior discussions that if we download source pieces of last segment sequentially we can both continue downloading last segment and import blocks concurrently before we've got the whole segment. It is important to distinguish what is the goal and the core feature we are trying to build and what are "nice to have" on top.
let second_last_segment_blocks = download_and_reconstruct_blocks( | ||
second_last_segment_index, | ||
self.piece_getter, | ||
&mut reconstructor, | ||
) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but it makes no sense to me. If we are so afraid to make changes we'll not make progress at any meaningful rate. Writing code that is intentionally wrong against the spec and even strictly speaking broken in general case makes even less sense to me.
crates/subspace-service/src/lib.rs
Outdated
// Check for fast-sync state | ||
{ | ||
if let Some(state_block_number) = fast_sync_state.lock().as_ref() { | ||
let parent_block_number = *parent_header.number(); | ||
if parent_block_number < *state_block_number { | ||
debug!( | ||
%parent_block_number, | ||
%state_block_number, | ||
"Skipped PoT verification because of the fast sync state block." | ||
); | ||
|
||
return true; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we always import blocks with state, I'm not sure I'm following what you're saying.
crates/subspace-service/src/lib.rs
Outdated
// Clear block gap after fast sync on reruns. | ||
// Substrate detects a gap and inserts on each sync. | ||
if config.fast_sync_enabled { | ||
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash; | ||
if finalized_hash_existed { | ||
debug!(client_info=?client.info(), "Clear block gap after fast-sync."); | ||
client.clear_block_gap(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand the rationate here. If gap sync is triggered by some specific behavior then we should clear gap sync after we trigger that behavior. If the behavior is importing blocks in specific order then you will absolutely drive network prior to gap sync because that is how you end up with block you can physically import in the first place.
crates/subspace-service/src/lib.rs
Outdated
@@ -806,7 +843,7 @@ where | |||
} | |||
}; | |||
|
|||
let import_queue_service = import_queue.service(); | |||
let import_queue_service = Arc::new(tokio::sync::Mutex::new(import_queue.service())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, it is like having a Arc and instead of accessing it from two places directly you wrapped it in additional Arc<Mutex>>
. I simply do not understand why would you even do that.
client.clone(), | ||
import_queue_service, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which types? They were not required before, what has changed in a way that makes it mandatory?
let mut reconstructor = Reconstructor::new().map_err(|error| error.to_string())?; | ||
|
||
pause_sync.store(true, Ordering::Release); | ||
|
||
let finalized_hash_existed = client.info().finalized_hash != client.info().genesis_hash; | ||
|
||
if fast_sync_enabled && finalized_hash_existed { | ||
debug!("Fast sync detected existing finalized hash."); | ||
} | ||
|
||
// Run fast-sync first. | ||
#[allow(clippy::collapsible_if)] // clippy error: the if statement is not the same | ||
if fast_sync_enabled && !finalized_hash_existed { | ||
if notifications.next().await.is_some() { | ||
let fast_syncer = FastSyncer::new( | ||
&segment_headers_store, | ||
node, | ||
piece_getter, | ||
client.clone(), | ||
import_queue_service.clone(), | ||
network_service.clone(), | ||
fast_sync_state, | ||
sync_service, | ||
); | ||
|
||
let fast_sync_result = fast_syncer.sync().await; | ||
|
||
match fast_sync_result { | ||
Ok(fast_sync_result) => { | ||
if fast_sync_result.skipped { | ||
debug!("Fast sync was skipped."); | ||
} else { | ||
last_processed_block_number = fast_sync_result.last_imported_block_number; | ||
last_processed_segment_index = fast_sync_result.last_imported_segment_index; | ||
reconstructor = fast_sync_result.reconstructor; | ||
|
||
debug!(%last_processed_block_number, %last_processed_segment_index, "Fast sync finished."); | ||
} | ||
} | ||
Err(err) => { | ||
error!("Fast sync failed: {err}"); | ||
} | ||
} | ||
|
||
notifications_sender | ||
.try_send(NotificationReason::WentOnlineSubspace) | ||
.map_err(|_| { | ||
sc_service::Error::Other("Can't send sync notification reason.".into()) | ||
})?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DSN sync is designed to resume from any point, it'll just have to potentially download some of the data twice, but as articulated a few times already we don't need to call DSN sync after fast sync in the first place because regular Substrate sync can pick up from there already.
…block number of the state we're trying to download
…erification concurrency and better error handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a few key issues in #2779 except non-atomic state import and didn't implement support for blocks that cover multiple segments because code will have to be refactored anyway and it can be done then.
#2779 needs to be merged into this PR, then further adjustments and cleanups need to be added, after which we'll squash numerous commits into one, maybe a couple, with clear additions according to spec and minimal changes elsewhere.
self.import_deconstructed_block( | ||
block_bytes, | ||
current_block_number, | ||
BlockOrigin::NetworkInitialSync, | ||
) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should schedule all the blocks together rather than one by one.
Scheduling them one by one means issues with verification concurrency and huge amount of logs with errors for every single block in case things go wrong. Check how DSN sync implementation schedules block import.
UPD: Fixed in #2779
|
||
let blocks_in_second_last_segment = second_last_segment_blocks.len(); | ||
debug!( | ||
"Second last segment blocks downloaded (SegmentId={}): {}-{}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no SegmentId
in the protocol, there are just segment indices. I'd use %second_last_segment_index
instead of this. On a similar note I'd use {}..={}
instead of {}-{}
for block numbers since we're in Rust.
crates/subspace-service/src/lib.rs
Outdated
// Check for fast-sync state | ||
{ | ||
if let Some(state_block_number) = fast_sync_state.lock().as_ref() { | ||
let parent_block_number = *parent_header.number(); | ||
if parent_block_number < *state_block_number { | ||
debug!( | ||
%parent_block_number, | ||
%state_block_number, | ||
"Skipped PoT verification because of the fast sync state block." | ||
); | ||
|
||
return true; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #2779
Add fast sync algorithm fixes
Node restart after fast sync is not working yet. I started node like this:
It synced and was staying up to date successfully:
Then I shut it down and started again:
Note that both blocks have the same number (it is interesting that I got them right away), but very last block has its state discarded right away for some reason. Maybe node decided it is not a canonical chain, but failed to do reorg or something? Possibly a side effect of light sync. Subsequent restarts crash even earlier:
Not a hard blocker here, but something to fix before this is really usable. |
Could be because of the state/blocks pruning number. I didn't encounter this error for some time but I have much greater pruning numbers. I will test the restart again. |
Any numbers except really-really large ones for state will be (IIRC). The issue here is more interesting though because the block hash for which state is discarded is literally the block that was just imported within the same second. This is why it looks concerning to me. |
Rephrase it, please.
We have multiple dependencies on the previous blocks (old state). |
IIRC we will need state of older blocks up to the beginning of the archived segment that is currently beeing built, so any number we put as a state pruning (until we can control it precisely and programmatically) would have to cover the max possible number of blocks that a segment could possibly contain.
Sure, but there should be no way for the block we just imported to not have state half a second later when we want to import the very next block. |
Atomic fast sync block import
Fast sync cleanup
…optional) fast sync
Simplify fast sync
# Conflicts: # crates/subspace-service/Cargo.toml
1c4785b
to
1940fe4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nits, this will then need to be squashed as we discussed previously
/// Full sync. Download and verify all blocks. | ||
Full, | ||
/// Download blocks from DSN. | ||
Dsn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we have these as separate modes, nothing in the code seems to make any distinction between them. Also we don't need separate ConfigSyncMode
and ChainSyncMode
, just ChainSyncMode
that implements FromStr
is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Snap
and not Full
, so we use Dsn
implicitly. Two types ChainSyncMode
and ConfigSyncMode
were introduced to avoid adding clap
dependency into subspace-service
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand what you wanted to say with the first sentence, can you rephrase?
As for Clap, you don't need it. Enum only needs to implement FromStr
trait IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Dsn
value implicitly without a code reference as following: we either use only Full
(regular) sync or DSN sync option when we can run Snap
sync first.
As for Clap, you don't need it. Enum only needs to implement FromStr trait IIRC.
Indeed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use Dsn value implicitly without a code reference as following: we either use only Full (regular) sync or DSN sync option when we can run Snap sync first.
I'm sorry, it still makes no sense to me, at least it doesn't seem to answer the question I had originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we had two sync modes - regular and DSN, the latter was enabled by --sync-from-dsn
and started a dedicated task.
Now, we have three modes: full, dsn, and snap. Snap starts in the same dedicated task as Dsn sync.
We start the dedicated task by checking if !config.sync.is_full()
. "Not Full" includes Dsn and Snap sync modes. Snap sync can be activated by Snap
sync mode before the Dsn worker in the dedicated sync task.
Thus we use Dsn sync mode implicitly without referencing it in the code.
I hope it makes more sense now. Feel free to ask additional questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed is_full
check somehow. You made Full
sync the default, which will disable DSN sync if I understand correctly. I'd say we should simply remove ability to disable sync from DSN. There should be just Full
and Snap
. No need to have Dsn
as a third option, users will not be able to sync fully without DSN anyway (almost all nodes are pruned).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will domain operators be affected by this change? We're changing our previous semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What semantics? DSN sync was enabled before unless it was disabled explicitly with CLI option, so it was opt-out. Now you made it opt-in instead and I suggest to return to previous behavior and remove option that disabled sync from DSN so it is always enabled.
@nazar-pc Please, run the DSN-sync local test. I experienced weird behavior once: "DSN sync finished" after piece requests and without actual block import. |
Seems fine to me, there were no new blocks to import from DSN after Snap sync has finished, this is actually expected and should happen most of the times. |
I meant fresh start with DSN-sync. |
Which pieces did it request then? I think it must have failed to acquire segment headers for some reason, but hard to say without logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go, I'll open squashed PR instead of 50 individual commits. Also CI has failed, looks like we have a flaky test in networking.
This PR adds the last part of the fast-sync algorithm. The algorithm described in the related spec and in the comment for
FastSyncer
struct. The PR contains the new algorithm, its integration within the existing DSN sync, and CLI and config parameters changes required to enable the algorithm. Although to the best of my knowledge, the algorithm works as expected it should not be used until we develop a related solution for fraud proofs otherwise fast-sync will fail because of the missing state after reaching the fraud-proof.CLI and config changes
fast-sync
parameterSyncMode::LightState
whenfast-sync
is enabledstate-pruning
parameters to support N blocks pruning (required bySyncMode::LightState
)Notable changes for fast sync integration
block-gap
clean on start when fast-sync enabled (Substrate detects it and inserts it repeatedly)Reconstructor
is passed to DSN sync from the fast-sync algorithm to preserve the correct inner stateCode contributor checklist: