-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster status cache #3796
base: master
Are you sure you want to change the base?
Faster status cache #3796
Conversation
571fca8
to
2db0759
Compare
d41386e
to
fc67d2f
Compare
2ff1d2b
to
d9fdc54
Compare
1309759
to
70974b7
Compare
Remove the global RwLock around the status cache, and introduce more granular RwLocks per-blockhash and per-slot. Additionally, change the internal hash tables from std HashMap to Dashmap, so that operations at the blockhash and slot level can be done only holding read locks.
70974b7
to
4661809
Compare
@@ -3459,26 +3459,26 @@ impl Bank { | |||
} | |||
|
|||
/// Forget all signatures. Useful for benchmarking. | |||
#[cfg(feature = "dev-context-only-utils")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
// Store forks in a single chunk of memory to avoid another lookup. | ||
pub type ForkStatus<T> = Vec<(Slot, T)>; | ||
// The number of shards to use for ::slot_deltas. We're going to store at most MAX_CACHE_ENTRIES | ||
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are vey rare. | |
// slots, MAX * 4 gives us a low load factor guaranteeing that collisions are very rare. |
fn push(&self, item: T) { | ||
self.vec.push(item); | ||
} | ||
|
||
fn iter(&self) -> impl Iterator<Item = (usize, &T)> { | ||
self.vec.iter() | ||
} | ||
} | ||
|
||
impl<T> IntoIterator for ConcurrentVec<T> { | ||
type Item = T; | ||
type IntoIter = boxcar::IntoIter<T>; | ||
|
||
fn into_iter(self) -> Self::IntoIter { | ||
self.vec.into_iter() | ||
} | ||
} | ||
|
||
impl<'a, T> IntoIterator for &'a ConcurrentVec<T> { | ||
type Item = (usize, &'a T); | ||
type IntoIter = boxcar::Iter<'a, T>; | ||
|
||
fn into_iter(self) -> Self::IntoIter { | ||
self.vec.iter() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe excepting serialize and deserialize, can't we just implement Deref
to get access to the inner for all of this other stuff?
|
||
// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the | ||
// case a tx only lands in one (the vast majority) or two forks. | ||
pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have stats on how many are on only 1 fork, 2 forks, 3+ forks?
Curious what the performance difference is, if for example 95+% are on a single fork - is it worth doubling the memory usage here for SmallVec<2> instead of SmallVec<1>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)
panic!("Blockhash must exist if it exists in self.slot_deltas, slot: {slot}") | ||
}; | ||
|
||
// cache_txs is self.blockhash_cache[blockhash] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.blockhash_cache
would probably be a good rename 😄
forks.retain(|(fork_slot, _key)| *fork_slot != slot); | ||
if forks.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be over/prematurely optimizing... but this strikes me as a bit of an odd way to do this.
AFAIK, there is no chance of duplicate fork slots in the status cache. Seems we could check the len == 1, remove entry in that case. If != 1, then we can do retain.
Actually, retaining seems a bit odd to me because let's say we have 4 forks and we remove entry at 0. we shuffle everything over by 1. They're not huge, so copying over may be relatively cheap, but probably could just do something like this:
// if all the slots have been cleared or purged, we don't need to track this tx
// anymore
if forks.len() == 1 {
cache_tx_entry.remove();
} else {
forks
.iter()
.position(|(fork_slot, _)| *fork_slot == slot)
.map(|index| forks.swap_remove(index));
}
|
||
// Get the cache entry for this blockhash. | ||
let (max_slot, key_index, hash_map) = | ||
self.cache.entry(*transaction_blockhash).or_insert_with(|| { | ||
let key_index = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we returned from this block (key_index, key_slice)
would that perform a copy, or would compiler optimize it to effectively what you have now?
Thinking about how we can limit the scope of MaybeUninit
so easier to maintain. Not opposed to what you have if there's any performance cost to my suggestion.
Basically change the inner spot where you're currently doing ptr copy to something like this:
// Grab the key slice.
let key_index = (*key_index).min(max_key_index);
let key_slice = {
let mut key_slice = MaybeUninit::<[u8; CACHED_KEY_SIZE]>::uninit();
unsafe {
ptr::copy_nonoverlapping(
key.as_ref()[key_index..key_index + CACHED_KEY_SIZE].as_ptr(),
key_slice.as_mut_ptr() as *mut u8,
CACHED_KEY_SIZE,
);
key_slice.assume_init()
}
};
// Insert the slot and tx result into the cache entry associated with
// this blockhash and keyslice.
let mut forks = txs.entry(key_slice).or_default();
forks.push((slot, res.clone()));
(key_index, key_slice)
now all the unsafe is in one place.
IF this introduces extra copies, let's just keep it as is. I think the compiler should be smart enough to do what we want, but probably putting too much faith in it 😢
} else { | ||
// Only take the write lock if this is the first time we've seen | ||
// this blockhash in this slot. | ||
let (_key_index, txs) = &mut *fork_entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we don't need mut
here? txs
doesn't actually need to be mutable, the mutation is happening only at fork_entry
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One minor suggestion
|
||
check_results.extend(sanitized_txs.iter().zip(lock_results).map( | ||
|(sanitized_tx, lock_result)| { | ||
let sanitized_tx = sanitized_tx.borrow(); | ||
if lock_result.is_ok() | ||
&& self.is_transaction_already_processed(sanitized_tx, &rcache) | ||
&& self.is_transaction_already_processed(sanitized_tx, &self.status_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably don't even need to pass status cache anymore
|
||
// Store forks in a single chunk of memory to avoid another hash lookup. Avoid allocations in the | ||
// case a tx only lands in one (the vast majority) or two forks. | ||
pub type ForkStatus<T> = SmallVec<[(Slot, T); 2]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anecdotally, more than 2 forks seems exceedingly rare.
more scientifically, I'm seeing ~11k forks over the last 7 days (a little more than 1 per minute)
/// Get the statuses for all the root slots | ||
/// Get the statuses for all the root slots. | ||
/// | ||
/// This is never called concurrently with add_root(), and for a slot to be a root there must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concurrent calls with this right now. Not sure we can really prevent them from being added though.
self.cache.retain(|_, (fork, _, _)| *fork > min); | ||
self.cache.retain(|_key, value| { | ||
let (max_slot, _, _) = &**value; | ||
max_slot.load(Ordering::Relaxed) > min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noting for myself.
retain
will grab the write lock on the shard for checking the closure. This guarantees that we have unique access to this self.cache
entry.
Concurrent access from insert
will have completed, or will be held up wiating for this to complete on the shard; if removed, then insert would be possibly re-adding entry (that's fine).
I'm like 99% sure that Relaxed ordering here (and in insert
) is fine because we have guaranteed the unique access via the write-lock on shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments on this round, but overall looks correct to me.
I hate to do anything to contribute to delaying this PR, but how about an idea that would delay it but also simplify it. You've refactored the data structure that stores this information using a concurrent
The former is really important (prevent double spends!) but requires only a presence test. The latter is less risky, requires actual data storage for result How about we first split these into two cache implementations, actually disable this write unless the validator is in RPC mode, and then eliminate |
Sounds like something that you could do after merging the current PR! 😋 I intentionally decided against changing any of the logic, to make reviews and auditing easier, and so that if we must backport to 2.1 at some point (larger blocks), the diff isn't too large. I do agree with you that the change you're proposing must be done though. |
This PR removes the global RwLock around the status cache, and introduces more granular RwLocks per-blockhash and per-slot. Additionally, it changes the internal hash tables from std HashMap to Dashmap, so that operations at the blockhash and slot level can be done only holding read locks.
This is not the final design of A Performant Status Cache - which, I think, can make check and update go straight to 0 - but it's a good incremental improvement.
Results are pretty good: check_transactions is ~6x faster, and update_transaction_statuses is ~2.5x faster.