Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SchedulingStateMachine for unified scheduler [NO-MERGE&REVIEW-ONLY-MODE] #35286

Closed
wants to merge 31 commits into from

Conversation

ryoqun
Copy link
Contributor

@ryoqun ryoqun commented Feb 22, 2024

Problem

The unified scheduler is still single-threaded, because the most important piece of code is still missing: the scheduling code itself.

Summary of Changes

Implement it with cleanest way possible and most documented way possible. FINALLY, TIME HAS COME.

Note that there's more work to be done for general availability of unified scheduler after this pr, but it is arguably the most important PR.

numbers

steps
  1. Download replaying-bench-mainnet-beta-2024-02.tar.zst: https://drive.google.com/file/d/1Jc1cd3pHKkaaT1yJrGeBAjBMJpkKS0hc/view?usp=sharing
  2. untar
  3. run solana-ledger-tool --ledger ./path/to/replaying-bench-mainnet-beta-2024-02 verify ...
(A) --block-verification-method blockstore-processor with this pr (reference):
[2024-02-22T15:24:06.957483120Z INFO  solana_ledger::blockstore_processor] ledger processed in 28 seconds, 198 ms, 761 µs and 24 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400
(B) --block-verification-method unified-scheduler with this pr:

~1.3x faster than (A):

[2024-02-22T15:22:46.686416459Z INFO  solana_ledger::blockstore_processor] ledger processed in 20 seconds, 621 ms, 124 µs and 512 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400
(C) (fyi) #33070 (all extra opt is applied):

~1.8x faster than (A), ~1.3x faster than (B):

[2024-02-22T15:16:51.304579110Z INFO  solana_ledger::blockstore_processor] ledger processed in 15 seconds, 399 ms, 459 µs and 572 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

context

extracted from #33070

@ryoqun ryoqun force-pushed the scheduling-state-machine branch 3 times, most recently from 3a9a5b4 to 51d4dc8 Compare February 22, 2024 13:16
@ryoqun ryoqun changed the title Introduce SchedulingStateMachine Introduce SchedulingStateMachine for functional unified scheduler Feb 22, 2024
@ryoqun ryoqun changed the title Introduce SchedulingStateMachine for functional unified scheduler Introduce SchedulingStateMachine for unified scheduler Feb 22, 2024
@ryoqun ryoqun force-pushed the scheduling-state-machine branch 2 times, most recently from 2f92ee2 to 730afef Compare February 22, 2024 14:36
@ryoqun ryoqun requested a review from apfitzge February 22, 2024 15:28
@ryoqun ryoqun marked this pull request as ready for review February 22, 2024 15:28
@@ -1023,7 +1062,7 @@ mod tests {
.result,
Ok(_)
);
scheduler.schedule_execution(&(good_tx_after_bad_tx, 0));
scheduler.schedule_execution(&(good_tx_after_bad_tx, 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to task_index is started to be assert!()-ed....

Copy link

codecov bot commented Feb 22, 2024

Codecov Report

Attention: Patch coverage is 98.77384% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 81.6%. Comparing base (a780ffb) to head (c9d1648).
Report is 18 commits behind head on master.

❗ Current head c9d1648 differs from pull request most recent head d072efd. Consider uploading reports for the commit d072efd to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #35286     +/-   ##
=========================================
- Coverage    81.8%    81.6%   -0.2%     
=========================================
  Files         837      834      -3     
  Lines      225922   225589    -333     
=========================================
- Hits       184897   184206    -691     
- Misses      41025    41383    +358     

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial scan through the code, appreciate the well thought out comments! I will have to give this another pass, as well as getting through the tests.

unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Outdated Show resolved Hide resolved
fn default() -> Self {
Self {
usage: PageUsage::default(),
blocked_tasks: VecDeque::with_capacity(1024),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how you chose this number? At first glance, it seems quite high as the default allocation for number of blocked-tasks; at least for current blocks on MNB we have maybe several hundred non-vote transactions. So this seems like a huge overkill to me since for replay at max we probably have a couple hundred blocked tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. i temporarily reduced it and document about why: 5887a91

unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
@ryoqun
Copy link
Contributor Author

ryoqun commented Feb 23, 2024

@apfitzge thanks for reviewing. I'll later address them.

I've added benchmark steps in the pr description.

also, I'll share some on-cpu flame-graph by perf to make sure I'm not crazy. The charts was from different solana-ledger-tool invocations timed manually. so, only the proportion of the spent walltime is what it's comparable.

the scheduler thread of rebase pr

image

the schedule thread of this pr

image

some random remarks

  • SchedulingStateMachine::{de,}schedule_task() and crossbeam select is what it's absolutely worth to burn cycles.
  • the rebase pr doesn't have any extra overhead, very close to the ideal runtime profile of the scheduler thread
  • this pr has bunch of overhead by dropping the task in thread and accumulating the timings. (this is why i want to introduce the accumulator thread)

@ryoqun
Copy link
Contributor Author

ryoqun commented Feb 23, 2024

also, off-cpu flame graph shows there's no idling other then the expected crossbeam select:

image

as a quick refresher of this unusual off-cpu flame graph, you can see some lock-contension bolow (there's known one in solana-runtime, which is fixed at the rebase pr, but not yet in this pr):

image

@apfitzge
Copy link
Contributor

apfitzge commented Feb 23, 2024

I think I finally have a reasonably good understanding of how your scheduling works (much easier with trimmed down PR! ❤️ ).
As I'm thinking on this, I think this state-machine and the prio-graph I've used in the new scheduler for block-production are similar in many ways, although there are a few differences. Just gonna jot some thoughts here.

Similarities

  • Use FIFO fed order (in replay case this is record order, for production this is priority-order from a queue)
  • Keep track of access-kind per account in "use"
  • Pop unblocked txs

Differences

  • PG essentially creates a linked-list of transaction dependencies to only next transaction.
  • US keeps track of current access, and all blocked transactions/access type as an ordered list per transaction
  • Blocked tx resolve time: PG resolves at insert time, US resolves at deschedule time.
  • "use" differs since the US is used in a stateful manner, while PG is created on-demand
  • PG has a mechanism for allowing user-defined ordering for "unblocked" txs

Example

Think visual examples always serve better than words here. Let's take a really simple example with 3 conflicting write transactions.

Prio-graph will effectively store a linked list, edges from 1 to 2 to 3.

graph LR;
Tx1 --> Tx2 --> Tx3
Loading

US will initially store access of Tx1, and list w/ order (Tx2, Tx3). Only when Tx1 is descheduled will the conflict between 2 and 3 be "realized".

graph LR;
Tx1 --> Tx2["Tx2(0, w)"] & Tx3["Tx3(1, w)"];
Loading

Closing

Not sure anything I commented here has any actual value, just wanted to share my thoughts that these 2 approaches are remarkably similar but different in their implmentation details due to different uses (stateful vs lazy).

edit: I actually made a branch in prio-graph to use this AccessList (Page?) pattern instead of directly tracking edges. Interestingly I see cases where the old approach does significantly better, but also cases where an approach similiar to yours does significantly better - interesting to see criterion regressions of 500% and also improvements of 50% in same bench run 😆

I am somewhat curious to try plugging in prio-graph to your benches to see how it compares for exactly the same benches you have. Do you have any benches on just the scheduler logic?

@ryoqun
Copy link
Contributor Author

ryoqun commented Feb 26, 2024

I think I finally have a reasonably good understanding of how your scheduling works (much easier with trimmed down PR! ❤️ ).
As I'm thinking on this, I think this state-machine and the prio-graph I've used in the new scheduler for block-production are similar in many ways, although there are a few differences. Just gonna jot some thoughts here.
....

thanks for great comparison write-up. I'll check out prio-graph impl later with those good guides in mind. :) (EDIT: I wrote my thoughts here: #35286 (comment))

I am somewhat curious to try plugging in prio-graph to your benches to see how it compares for exactly the same benches you have. Do you have any benches on just the scheduler logic?

yeah: #33070 (comment)

my dev box:

bench_tx_throughput     time:   [57.525 ms 58.245 ms 58.958 ms]
                        change: [-1.3868% -0.1358% +1.2895%] (p = 0.84 > 0.05)
                        No change in performance detected.

this is processing 60000 txes with 100 accounts (and 50% conflicts). 970ns per this arb-like (= heavy) tx.

and this is the basis for 100ns in the doc comment (assuming normal tx should touch ~10 accounts; 1/10 of benched tx).

that said, I have very dirty changes (ryoqun#17 (comment)) with this results:

bench_tx_throughput     time:   [33.742 ms 34.205 ms 34.665 ms]
                        change: [+0.4371% +2.6652% +4.9342%] (p = 0.02 < 0.05)
                        Change within noise threshold.

That's ~1.7x faster. and 570ns per tx.

Comment on lines 651 to 654
/// Closure (`page_loader`) is used to delegate the (possibly multi-threaded)
/// implementation of [`Page`] look-up by [`pubkey`](Pubkey) to callers. It's the caller's
/// responsibility to ensure the same instance is returned from the closure, given a particular
/// pubkey.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing some context, but I am not following the purpose of this closure, or why the AddressBook is not owned by the internal scheduler.
When would this ever do something other than just loading from AddressBook, which can already be done in a multi-threaded way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, nice point. hope this clarifies why i don't add dashmap to solana-unified-scheduler-logic's Cargo.toml...: 821dee2

@ryoqun ryoqun force-pushed the scheduling-state-machine branch from 1d99c44 to 83e514a Compare February 28, 2024 15:30
@ryoqun ryoqun requested a review from apfitzge February 28, 2024 15:35
@ryoqun
Copy link
Contributor Author

ryoqun commented Feb 28, 2024

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

Comment on lines 469 to 474
/// Scheduler's internal data for each address ([`Pubkey`](`solana_sdk::pubkey::Pubkey`)). Very
/// opaque wrapper type; no methods just with [`::clone()`](Clone::clone) and
/// [`::default()`](Default::default).
#[derive(Debug, Clone, Default)]
pub struct Page(Arc<TokenCell<PageInner>>);
const_assert_eq!(mem::size_of::<Page>(), 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc has a runtime cost for cloning so I'm actually a bit surprised to see it used here for such a critical item - though perhaps it allows a simpler implementation. Do you know how much of your insertion time is spent doing Arc-clone? I imagine its' small relative to the Pubkey hashing?

Still, I wonder if you have thought about or have maybe even tested using an indexable storage for the inner pages, and then using index-references as the stored kind on the tasks, to avoid Arc-cloning.

To be clear I'm saying some sort of set up like:

pubkey_to_index: DashMap<Pubkey, usize>, // not sure if dashmap or locked hashmap? Just some translation that maps Pubkey -> index.
inner_pages: Vec<Option<InnerPage>>, // use a fixed-capacity vector (or slice) so we never add more pages past some pre-determined limit 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block on this, just sharing my thought on perf here. It's almost certainly a micro-optimization; but in this sensitive code, any small bit can help!

Copy link
Contributor Author

@ryoqun ryoqun Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc has a runtime cost for cloning so I'm actually a bit surprised to see it used here for such a critical item - though perhaps it allows a simpler implementation.

Yeah, I used Arc here for simplicity and to ship this pr into master asap (just surpassed the 1.5k loc... lol).

Do you know how much of your insertion time is spent doing Arc-clone?

not directly. but, I can indirectly guess it'll take 2-3ns per Arc::clone() when the instance isn't contended from an unrelated bench.

I imagine its' small relative to the Pubkey hashing?

I haven't tested. but I presume yes. Even more, I'm planning to switch to Rc from Arc. in that case, it's definitely smaller than the hashing.

As I indirectly indicated earlier (#35286 (comment) and #35286 (comment)), that optimization will allow us to trim down to merely a single inc (without the lock prefix) in x86-64 asm terms. i.e. possibly auto-vectorizable plain old 1 ALU instruction. And, mem access is 1 pop with no branching. And there's no work to need to be done for insertion into the scheduling state machine as it's not relevant. If you're referring to the insertions happening with the batched cloning in the case of being blocked in try_lock_for_task(), index-references or Rc will need 1 mem copy of 1 word-size for each address no matter what.

Still, I wonder if you have thought about or have maybe even tested using an indexable storage for the inner pages, and then using index-references as the stored kind on the tasks, to avoid Arc-cloning.

To be clear I'm saying some sort of set up like:

...

Yeah, i've thought about it. But, I concluded this pr's approach should be faster with gut sense. the index approach introduces another pointer dereference indirection, which i'd like to avoid. However, not benchmarked it.

I don't want to block on this, just sharing my thought on perf here. It's almost certainly a micro-optimization; but in this sensitive code, any small bit can help!

Thanks for sharing your thoughts. happy to share mine. I hope I've squeezed as much as possible.

@apfitzge
Copy link
Contributor

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

I was still going to do another round with deep dive. With these big ones I always try to read comments and get high-level overview, then come back for a deeper dive.
I did look into the actually scheduling logic, which I didn't have too many comments on as it was quite familiar to my work, although done differently internally.

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another short round of comments - I still need to take another round to go through the tests.

let result_with_timings = result_with_timings.as_mut().unwrap();
Self::accumulate_result_with_timings(result_with_timings, executed_task);
},
recv(dummy_receiver(state_machine.has_unblocked_task())) -> dummy => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dummy receiver pattern seems really odd to me, though I see how it is working (mostly? need to cargo expand since not 100% sure WHEN state_machine,has_unblocked_task() is resolved).

edit:
So the has_unblocked_task is resolved before receiver selection, which is what I was guessing. Even if we stick with this dummy-receiver pattern, I'd actually much prefer we make this very clear since I don't think it's a very obvious thing:

let dummy_receiver = dummy_receiver(state_machine.has_unblocked_task());
select!{
    // ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this in mind, I'm trying to figure out when we could have an unblocked task. We can only have an unblocked task if we just received a finished task, right?
Since on receiving new tasks, we'll immediately attempt to schedule and will schedule it if it is unblocked, that seems to be the case to me.

Why even have this recv for the never case that there is no unblocked task? That's just a wasted select, right? Let's say both our real channels have actual important data - a finished task and a new task. This macro could randomly (because current impl is always random) select this dummy receiver, which will just do nothing useful. At least my initial intuition is that this is wasted time, lmk what you think.

I know you had made a branch for a biased select - is the order here i.e. finished -> dummy -> new the biased order you want?
It could be overkill for the current MNB load we have, but I wonder if you considered just running a hot loop here if we are actively scheduling; not sure if using try_recvs in hot loop would cause more contention on the channels, you're certainly more familiar with crossbeam internals.
But basically I am asking about something along the lines of:

// Blocking recv for an OpenChannel to begin our scheduling session - no other messages can come in until then.
let message = new_task_receiver.recv();
{
    // Code for initialization, checking it's actually OpenSubChannel.
}

while !is_finished {
    if let Ok(executed_task) = finished_task_receiver.try_recv() {
        // Handle the executed task.
        continue; // Move to next iteration without checking other channels.
    }
    if state_machine.has_unblocked_task() {
        // Handle unblocked task.
        continue; // Move to next iteration without checking for new task.
    }
    if let Ok(message) = new_task_receiver.try_recv() {
        // Handle new task message.
    }
}

EDIT: I am an idiot - for some reason I thought never was always just disconnected...not it could never be selected. Anyway, my initial thought on the oddity of the pattern still stands. So not going to delete my comment, and still request your thoughts.

Copy link
Contributor Author

@ryoqun ryoqun Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dummy receiver pattern seems really odd to me, though I see how it is working

.........

Anyway, my initial thought on the oddity of the pattern still stands. So not going to delete my comment, and still request your thoughts.

Hope I explained the justification clearly: 89b7773. I think you can get used to it. this pattern is simplest and no redundant code and most performant. :)

I'd actually much prefer we make this very clear since I don't think it's a very obvious thing:

let dummy_receiver = dummy_receiver(state_machine.has_unblocked_task());
select!{
    // ...
}

This is done in the commit along with a comment. As I ranted in the comment, ideally, I want to move the dummy_receiver() invocation back into the select!...

I know you had made a branch for a biased select - is the order here i.e. finished -> dummy -> new the biased order you want?

Yes, this is correct.

It could be overkill for the current MNB load we have, but I wonder if you considered just running a hot loop here if we are actively scheduling; not sure if using try_recvs in hot loop would cause more contention on the channels, you're certainly more familiar with crossbeam internals.

As I explicit commented on in the commit 89b7773. it's too early to take the busy loop path and i haven't fully grokked the perf implication of busy looping. Casual tests showed an improved latency indeed, though.

But basically I am asking about something along the lines of:

...

By the way, busy looping can be done with select{,_biased}! pretty easily like this:

loop {
    select{_biased}! {
        recv(...) => ....,
        ...,
        default => continue, // there's special-cased the `default` selector in `select{,_biased}!`
    }
    ...
    if is_finished {
        break;
    }
}

I'm hesitant with manual busy (or non-busy) select construction, considering I'm planning to adding new channel specially for blocked tasks to fast-lane to process them.

))
.unwrap();
session_ending = false;
loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never exit this loop except for panic? I think we may have already talked about this in a previous PR, and you're just planning on error-handling in a follow-up. Please correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never exit this loop except for panic? I think we may have already talked about this in a previous PR, and you're just planning on error-handling in a follow-up.

This understanding is completely aligns with what i said previously. The proper scheduler management code is complicated, flavored to my own taste (expecting back-and-force review session), (i.e. yet another beast by itself). so, i wanted to focus on the logic with this pr.

unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
/// behavior when used with [`Token`].
#[must_use]
pub(super) unsafe fn assume_exclusive_mutating_thread() -> Self {
thread_local! {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a summary, I think Token and TokenCell were introduced because we have data within an Arc that we want to mutate, and these were added to give some additional safety constraints.

As far as I can tell, right now the mutations occur only in the scheduling thread so this seems fine. These token cells do not protect us against multiple threads accessing the same data if it were called from multiple threads - which requires some care from developers here. I am hopeful the naming of this function and unsafeness is probably sufficient warning!

WRT the ShortCounter Token-gating; the reason we can't or do not want to use an AtomicU32 is simply because we want to have checked operations to handle overflow explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT the ShortCounter Token-gating; the reason we can't or do not want to use an AtomicU32 is simply because we want to have checked operations to handle overflow explicitly?

AtomicU32 is slow, because it always emits atomic operations, necessitating the given cpu core to go to the global bus ;) by the way, std::cell::Cell<ShortCounter> can be used without runtime cost here, thanks to ShortCounter: Copy. But i wanted to dog-food TokenCell and to use TokenCell's more concise api than Cell, which is made possible due to its additional constraints than Cell.

Also, note that blocked_page_count will be removed for upcoming opt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a summary, ...

As far as I can tell, ...

also, this is correct. thanks for wrapping up your understanding.

@apfitzge apfitzge self-requested a review February 28, 2024 21:00
@@ -7462,7 +7462,9 @@ dependencies = [
name = "solana-unified-scheduler-logic"
version = "1.19.0"
dependencies = [
"assert_matches",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this will soon be not needed as par the progress at: rust-lang/rust#82775

@ryoqun
Copy link
Contributor Author

ryoqun commented Feb 29, 2024

sans the big rename Page => UsageQueue and my take on comparison with prio-graph, i think i should have addressed all review comments so far. so, I'm requesting another review round. not sure you did go in depth for its algo code and my unsafe-related bla bla talks.. lol

I just finished the big renaming. still, i have yet to do the quoted homework of prio-graph. but, i think I've addressed all new review comments so far again. Always thanks for quite timely review!

@willhickey
Copy link
Contributor

This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

@willhickey willhickey closed this Mar 3, 2024
@ryoqun ryoqun reopened this Mar 7, 2024
@ryoqun ryoqun force-pushed the scheduling-state-machine branch from f245874 to 4fcb360 Compare March 7, 2024 05:19
@ryoqun ryoqun added the noCI Suppress CI on this Pull Request label Mar 7, 2024
Comment on lines 216 to 235
/// Returns a mutable reference with its lifetime bound to the mutable reference of the
/// given token.
///
/// In this way, any additional reborrow can never happen at the same time across all
/// instances of [`TokenCell<V>`] conceptually owned by the instance of [`Token<V>`] (a
/// particular thread), unless previous borrow is released. After the release, the used
/// singleton token should be free to be reused for reborrows.
pub(super) fn borrow_mut<'t>(&self, _token: &'t mut Token<V>) -> &'t mut V {
unsafe { &mut *self.0.get() }
}
}

// Safety: Once after a (`Send`-able) `TokenCell` is transferred to a thread from other
// threads, access to `TokenCell` is assumed to be only from the single thread by proper use of
// Token. Thereby, implementing `Sync` can be thought as safe and doing so is needed for the
// particular implementation pattern in the unified scheduler (multi-threaded off-loading).
//
// In other words, TokenCell is technically still `!Sync`. But there should be no
// legalized usage which depends on real `Sync` to avoid undefined behaviors.
unsafe impl<V> Sync for TokenCell<V> {}
Copy link
Contributor Author

@ryoqun ryoqun Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per #35286 (review):

However. I think we should get a 2nd opionion on all the unsafe code.

@alessandrod This mod contains bunch of unsafes and these highlighted two is the most important ones. I think I've extensively doc-ed it. Could you review whether my justification does make sense at all? I appointed you assuming you know Rust better than me from various memory mangling at vm. :) Happy to comment-in-source more to fill the inner workings of unified scheduler itself for general context. That said, feel free to defer to someone else. Thanks in advance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫡

@ryoqun ryoqun requested a review from alessandrod March 7, 2024 05:43
@ryoqun ryoqun requested a review from apfitzge March 7, 2024 07:31
@ryoqun
Copy link
Contributor Author

ryoqun commented Mar 7, 2024

think i'm happy with the test cases, minus indexing ones (see comment).

thanks for in-depth look into tests. I've addressed all, requesting another review req. btw, please r+ on anza-xyz#129...

However. I think we should get a 2nd opionion on all the unsafe code.

kk. I've done this as well: #35286 (comment)

Copy link
Contributor

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished my first pass at this. I really like the general structure and direction! Being completely new to this new scheduler code I've struggled with some of the terminology and I've left some nits about that. Also see the TokenCell issue.

unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
//! [`::schedule_task()`](SchedulingStateMachine::schedule_task) while maintaining the account
//! readonly/writable lock rules. Those returned runnable tasks are guaranteed to be safe to
//! execute in parallel. Lastly, `SchedulingStateMachine` should be notified about the completion
//! of the exeuction via [`::deschedule_task()`](SchedulingStateMachine::deschedule_task), so that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the caller doesn't call deschedule_task()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kitten dies. 😿 well, it's really bad condition, i haven't thought about in depth. it will easily lead to dead locks.

well, i wonder this edge case should be called out in this summary doc.. fyi, I explicit mentioned this case in the deschedule_task() doc comment: 250edde

unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
/// `solana-unified-scheduler-pool`.
pub struct SchedulingStateMachine {
unblocked_task_queue: VecDeque<Task>,
active_task_count: ShortCounter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments on what these counters count would be helpful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit confusing that this is called active_task_count, because it also
includes blocked tasks, which are arguably inactive? Maybe current_task_count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, i agree tastelessness of active here. hmm, current_task_count is a bit ambiguous. how about in_progress_task_count? (or not_(yet)_handled_task_count; don't like this much because of implication of being a counter of tasks which has been resolved not to process).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find in_progress as confusing, since it implies that something is progressing while blocked tasks technically aren't progressing. @apfitzge as the native speaker, wdyt would be the best term here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about pending_task_count, or managed_task_count, then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or runnable_task_count?


while let Some((requested_usage, task_with_unblocked_queue)) = unblocked_task_from_queue
{
if let Some(task) = task_with_unblocked_queue.try_unblock(&mut self.count_token) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to the task when try_unblock returns None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another nice question. hope this helps your understanding: ee5c8b6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this works? usage_queue.unlock() pops the task from the usage queue. task.try_unblock() consumes self and returns None. How does the task reappear?

Copy link
Contributor Author

@ryoqun ryoqun Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's cloned here #35286 (comment) to all of blocked usage_queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is why I dislike hiding Arcs with type aliases :P

unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
unified-scheduler-logic/src/lib.rs Show resolved Hide resolved
/// instances of [`TokenCell<V>`] conceptually owned by the instance of [`Token<V>`] (a
/// particular thread), unless previous borrow is released. After the release, the used
/// singleton token should be free to be reused for reborrows.
pub(super) fn borrow_mut<'t>(&self, _token: &'t mut Token<V>) -> &'t mut V {
Copy link
Contributor

@alessandrod alessandrod Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intent, but this is akin to a transmute and can be made
segfault trivially

struct T {
	v: Vec<u8>,
};

let mut token = unsafe { Token::<T>::assume_exclusive_mutating_thread() };
let c = {
	let cell = TokenCell::new(T { v: vec![42] });
	cell.borrow_mut(&mut token)
};
c.v.push(43);

Copy link
Contributor Author

@ryoqun ryoqun Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alessandrod oh, quite nice catch.. hope this fixes segfault and no more ub?: 001b10e (EDIT: force-pushed with updated fix: 02567b1)

seems i'm too short-sighted only with aliasing-based ub.. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uggg, actually, i need with_borrow_mut().. I'll soon push this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uggg, actually, i need with_borrow_mut().. I'll soon push this.

done: 02567b1

Copy link

@anza-team anza-team left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

@anza-team anza-team closed this Mar 18, 2024
@ryoqun
Copy link
Contributor Author

ryoqun commented Mar 18, 2024

I really like the general structure and direction!

❤️

Being completely new to this new scheduler code I've struggled with some of the terminology and I've left some nits about that.

these feedback is quite appreciated. I'll address in turn to fill the missing contexts.

Also see the TokenCell issue.

as this is the most important review comment, I've address it firstly, and requested review for it...

@ryoqun
Copy link
Contributor Author

ryoqun commented Mar 18, 2024

gg

@ryoqun ryoqun reopened this Mar 18, 2024
Copy link

@anza-team anza-team left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! This repository is no longer in use. Please re-open this pull request in the agave repo: https://github.com/anza-xyz/agave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
noCI Suppress CI on this Pull Request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants