-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory hashsort for commitment keys #13521
base: main
Are you sure you want to change the base?
Conversation
@@ -1048,6 +1061,10 @@ func (t *Updates) TouchPlainKey(key string, val []byte, fn func(c *KeyUpdate, va | |||
} | |||
t.keys[key] = struct{}{} | |||
} | |||
case ModeInMemory: | |||
if _, ok := t.keys[key]; !ok { | |||
t.keys[key] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if insert hashed key to list now and sort in bg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is bg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you hash the key now you will lose the benefits of parallelization, since each insert happens serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time span from TouchPlainKey till HashSort called is quite big and better to use it for hashing so when we do HashSort - all our keys are already hashed and sorted , this is behavior with ETL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bg=background
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least put it into some hashing queue where workers do pick and then just use sort.Sort over finished list of pairs <hashed_key, plain_key>
keysWithHash := hashKeysConcurrently(keys, t.hasher) // hash plainkeys concurrently | ||
|
||
// sort keys by lexicographycally by hash | ||
slices.SortFunc(keysWithHash, func(a, b *keyWithHash) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suppose sort [start,end] after range is hashed, so this sort will be similar to quicksort last touch - sort partially ordered set.
IMO better hash once we see key and sort list after insertion in background. That way amortized sort time will be n log(n) but spent during collection, not actually during HashSort call made by Process or some other caller. I mean with etl they are sorted in bg actually before Load is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if sort made on insert, we just iterate through already sorted list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd suppose sort [start,end] after range is hashed, so this sort will be similar to quicksort last touch - sort partially ordered set.
Can you elaborate on the algorithm? So let's say each worker takes a chunk and hashes the plain keys, and finally sorts the chunk by hashed key. Now we have n
sorted chunks. Then I suppose we need to combine these chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"quicksort last touch" and "combine these chunks" - is the same thing.
FYI: etl
does same: it using heap
to merge sorted files:
erigon/erigon-lib/etl/collector.go
Line 298 in 586af97
func mergeSortFiles(logPrefix string, providers []dataProvider, loadFunc simpleLoadFunc, args TransformArgs, buf Buffer) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is no point in re-implementing ETL then. In this PR I'm just distributing the hashing of the keys, and calling the library function slices.SortFunc
. The results I got slightly improve over the results with ETL, so I don't consider it necessary to implement this feature (at least not for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the point was that you are splitting whole plain key list into Chunks. Then you hash each pk in chunk, my addition was to sort actually this chunk at this moment, then you will get list of Chunks and each Chunk is an ordered list of hashed keys inside. Then just have to sort everything and combine Chunks into one list which we will iterate over in HashSort call
Currently both modes ModeDirect and ModeUpdate use ETL for sorting plainkeys by hash as a preprocessing step before processing commitment updates. This PR introduces a new mode : ModeInMemory which sorts plainkeys by their hash in-memory using
slices.SortFunc
.Prior to sorting, the keys are hashed concurrently by a number of worker goroutines. With the concurrent hashing the benchmark tests yield neck-to-neck performance results compared to ModeDirect, with ModeInMemory sometimes faster.