Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

program cache: reduce contention #1192

Merged
merged 2 commits into from
May 6, 2024

Conversation

alessandrod
Copy link

Before this change we used to take the write lock to extract(). This means that even in the ideal case (all programs are already cached), the cache was contended by all batches and all operations were serialized.

With this change we now take the write lock only when we store a new entry in the cache, and take the read lock to extract(). This means that in the common case where most/all programs are cached, there is no contention and all batches progress in parallel.

This improves node replay perf by 20-25% on current mnb traffic.

Before:
Screenshot 2024-05-05 at 10 31 51 am

After:

...can't show because there's nothing to show :P But can show a zoomed in profile of replenish.

Before:
Screenshot 2024-05-05 at 6 26 29 pm

After:
Screenshot 2024-05-05 at 6 28 15 pm

you can see that locking on write() goes from taking 80% of replenish, to 0.4%

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 79.41176% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (b0cfffb) to head (b688cac).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #1192     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         886      886             
  Lines      236372   236369      -3     
=========================================
- Hits       194204   194185     -19     
- Misses      42168    42184     +16     

@alessandrod alessandrod requested review from Lichtso and pgarg66 May 5, 2024 12:06
/// It is possible that multiple TX batches from different slots need different versions of a
/// program. The deployment slot of a program is only known after load tho,
/// so all loads for a given program key are serialized.
loading_entries: Mutex<HashMap<Pubkey, (Slot, std::thread::ThreadId)>>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the replacement of the index structure I am working on I also experimented with pulling cooperative_loading_lock out of the SecondLevel. We might be able to take this even further because when we use the account last-modified-slot we could also parallelize the loading of missing programs. Basically the "The deployment slot of a program is only known after load tho" part could be improved then.

@Lichtso
Copy link

Lichtso commented May 5, 2024

Do you plan on backporting this to v1.18? Because I also need to add a parameter to finish_cooperative_loading_task() to fix the recompilation phase issue that happened last Friday. Thus, these will cause merge conflicts.

@alessandrod
Copy link
Author

alessandrod commented May 6, 2024

Do you plan on backporting this to v1.18? Because I also need to add a parameter to finish_cooperative_loading_task() to fix the recompilation phase issue that happened last Friday. Thus, these will cause merge conflicts.

This is patch 1 of N in my attempt to improve skip rate on one of our nodes which struggles in replay during leader slots. I plan to try to backport some of the patches, but not this one because of how tricky the program cache logic is. And also because we still don't have multi threaded tests for the cache 😭. I've been running a node with this for 3 days and is still making roots, but def don't want to risk it.

If your fix is ready I don't mind merging after yours as I imagine the conflicts on this PR will be easy to reconcile. Or I'm happy to merge first, up to you really!

Before this change we used to take the write lock to extract(). This
means that even in the ideal case (all programs are already cached),
the cache was contended by all batches and all operations were
serialized.

With this change we now take the write lock only when we store a new
entry in the cache, and take the read lock to extract(). This means
that in the common case where most/all programs are cached, there is no
contention and all batches progress in parallel.

This improves node replay perf by 20-25% on current mnb traffic.
@Lichtso Lichtso merged commit 10e5086 into anza-xyz:master May 6, 2024
38 checks passed
@ryoqun
Copy link
Member

ryoqun commented May 8, 2024

as this pr is similar to mine (#1037), i confirmed this improves favorably unified scheduler performance as well. thanks for fixing the last write-lock contention source in the program cache. :)

i measured the perf before/after the merged commit (10e5086). as a quick context, this bench-marking was conducted by replaying a short slice of mb ledgers using the same minimized snapshot: (repro steps: solana-labs#35286 (comment))

As said above, I saw consistent performance improvement in my testing as well, although i couldn't see 20%-25%. I think this is due to being solana-ledger-tool (offline; ample of free cpu cores) vs solana-validator (online; system is kind of overloaded... meh)

before(at aa6c69a)

--block-verification-method blockstore-processor:

ledger processed in 32 seconds, 572 ms
ledger processed in 32 seconds, 747 ms
ledger processed in 32 seconds, 571 ms

--block-verification-method unified-scheduler:

ledger processed in 18 seconds, 164 ms
ledger processed in 17 seconds, 930 ms
ledger processed in 18 seconds, 111 ms

after(at 10e5086)

--block-verification-method blockstore-processor:

ledger processed in 32 seconds, 393 ms
ledger processed in 32 seconds, 333 ms
ledger processed in 32 seconds, 343 ms

--block-verification-method unified-scheduler:

ledger processed in 17 seconds, 187 ms
ledger processed in 17 seconds, 220 ms
ledger processed in 16 seconds, 871 ms

similar to #1037, unified scheduler saw some drastic improvement (5-6% faster), while blockstore processor saw small gain while being consistent(<1% faster)

@@ -401,49 +400,51 @@ impl<FG: ForkGraph> TransactionBatchProcessor<FG> {
&program_cache,
));
}
// Submit our last completed loading task.
if let Some((key, program)) = program_to_store.take() {
loaded_programs_for_txs.as_mut().unwrap().loaded_missing = true;
Copy link

@Lichtso Lichtso May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alessandrod I think you accidentally deleted this line here while moving the code. It was added in #1037 which was open around the same time as this PR, so most likely a merge conflict that was not resolved properly.

When I investigated the validator node of @buffalojoec (which went out of disk space as it was not configured to rotate log files) I discovered that the program cache was growing far beyond the capacity as it was not evicting anymore.

Can you open a PR to add the line back in?

I will add a metric for the number of loaded entries (complementary to the evictions metric we already have) recorded in ProgramCache::evict_using_2s_random_selection() so that this doesn't happen again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants