-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Remove the Blockstore thread pool used for fetching Entries #34768
Remove the Blockstore thread pool used for fetching Entries #34768
Conversation
86a3b03
to
164a27d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #34768 +/- ##
=========================================
Coverage 81.7% 81.7%
=========================================
Files 834 825 -9
Lines 224232 223106 -1126
=========================================
- Hits 183351 182498 -853
+ Misses 40881 40608 -273 |
Marco from Triton chimed in with utilization of several nodes in the public mnb RPC pool:
AND
These numbers support my earlier comment about the thread pool being over-provisioned, even for an RPC node. If we don't rip this threadpool out completely, it certainly seems that putting a limit on the size of the pool instead of allowing it to scale with the number of threads on the machine would yield a win |
another one from a busy RPC node in our shared pool
|
164a27d
to
ab078cd
Compare
ab078cd
to
25971c5
Compare
Ok @t-nelson - I've let this one run long enough that I feel pretty good about it so formally marked it ready for review + requested review from ya. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hell yeah!
25971c5
to
6ad51f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me. A couple of nits and one more important question about what happens in the error case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
looks like we have some collect()
to temporaries in there that might be removable in a follow up
For the sake of paper-trail, we might be able to shave a mem-copy as well. Currently, on the fetch path, we have the following copies from rocksdb to
Namely, I think could save the copy in step 1). Instead, we could have |
🐄 |
…abs#34768) There are several cases for fetching entries from the Blockstore: - Fetching entries for block replay - Fetching entries for CompletedDataSetService - Fetching entries to service RPC getBlock requests All of these operations occur in a different calling thread. However, the currently implementation utilizes a shared thread-pool within the Blockstore function. There are several problems with this: - The thread pool is shared between all of the listed cases, despite block replay being the most critical. These other services shouldn't be able to interfere with block replay - The thread pool is overprovisioned for the average use; thread utilization on both regular validators and RPC nodes shows that many of the thread see very little activity. But, these thread existing introduce "accounting" overhead - rocksdb exposes an API to fetch multiple items at once, potentially with some parallelization under the hood. Using parallelization in our API and the underlying rocksdb is overkill and we're doing more damage than good. This change removes that threadpool completely, and instead fetches all of the desired entries in a single call. This has been observed to have a minor degradation on the time spent within the Blockstore get_slot_entries_with_shred_info() function. Namely, some buffer copying and deserialization that previously occurred in parallel now occur serially. However, the metric that tracks the amount of time spent replaying blocks (inclusive of fetch) is unchanged. Thus, despite spending marginally more time to fetch/copy/deserialize with only a single thread, the gains from not thrashing everything else with the pool keep us at parity.
Problem
Retrieving entries from the Blockstore is a necessary for several uses-cases:
ReplayStage
CompletedDataSetService
getBlock
While these operations all occur in different threads, the implementation of the
Blockstore
function utilizes a thread-pool to parallelize the operation of fetching / deshreding / deserializing shreds into entries. That thread pool is currently set to scale with the number of cores on the machine. Several problems with this:< 0.1%
usageReplayStage
is greedy and tries to replay entries as soon as possible; this means that the block is fetched in many small chunks as opposed to one call that fetches the entire block at once. Many small chunks spread over time means there is no need for large "parallelization"Contributes to anza-xyz#35
Summary of Changes
Remove the thread pool that the
Entry
fetch method had been using. This method previously parallelized over completed ranges (a completed range is a range of shreds that should be deserialized into aVec<Entry>
together), giving one completed range to one rayon thread. Now, the method looks up all completed ranges it might want to fetch with a single call to rocksdbmulti_get()
.Performance Impact
Background for several metrics of interest which are on a per-slot basis:
replay-slot-stats.fetch_entries
: The total amount of time spent fetching entriesreplay-slot-stats.num_execute_batches
: The total number of callsReplayStage
makes toblockstore_processor::confirm_slot()
that result in transactions getting executed. This is the same number of times that theBlockstore
method to fetch entries is getting calledreplay-slot-stats.confirmation_time_us
: The total amount of time spent withinblockstore_processor::confirm_slot()
; this is inclusive of both fetch entry time as well as everything else (ie actual tx execution)Summarizing some key points from some of the comments below:
replay-slot-stats.fetch_entries
. Average numbers on my node would suggest an increase from ~3.5k ms to ~5.5k msreplay-slot-stats.num_execute_batches
has been observed to have an average value of ~70 over the past two weeks~2 ms / 70 ~= 29 us
per call to fetch entriesconfirm_slot()
(and thus to fetch entries). So, in isolation, it might appear that we've "delayed" being able to complete a block by ~29 us (on average)replay-slot-stats.confirmation_time_us
looks pretty consistent for my node before and after making the change to remove the thread-pool. This value staying consistent would suggest that while fetch time is growing marginally, we are at least breaking even by avoiding the thread-pool