Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory hashsort for commitment keys #13521

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

In-memory hashsort for commitment keys #13521

wants to merge 4 commits into from

Conversation

antonis19
Copy link
Member

Currently both modes ModeDirect and ModeUpdate use ETL for sorting plainkeys by hash as a preprocessing step before processing commitment updates. This PR introduces a new mode : ModeInMemory which sorts plainkeys by their hash in-memory using slices.SortFunc .

Prior to sorting, the keys are hashed concurrently by a number of worker goroutines. With the concurrent hashing the benchmark tests yield neck-to-neck performance results compared to ModeDirect, with ModeInMemory sometimes faster.

@antonis19 antonis19 changed the title In-memory hashsort for commitment In-memory hashsort for commitment keys Jan 21, 2025
@antonis19 antonis19 requested review from awskii and mh0lt January 21, 2025 18:39
@@ -1048,6 +1061,10 @@ func (t *Updates) TouchPlainKey(key string, val []byte, fn func(c *KeyUpdate, va
}
t.keys[key] = struct{}{}
}
case ModeInMemory:
if _, ok := t.keys[key]; !ok {
t.keys[key] = struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if insert hashed key to list now and sort in bg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is bg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you hash the key now you will lose the benefits of parallelization, since each insert happens serially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time span from TouchPlainKey till HashSort called is quite big and better to use it for hashing so when we do HashSort - all our keys are already hashed and sorted , this is behavior with ETL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bg=background

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least put it into some hashing queue where workers do pick and then just use sort.Sort over finished list of pairs <hashed_key, plain_key>

erigon-lib/commitment/commitment.go Outdated Show resolved Hide resolved
keysWithHash := hashKeysConcurrently(keys, t.hasher) // hash plainkeys concurrently

// sort keys by lexicographycally by hash
slices.SortFunc(keysWithHash, func(a, b *keyWithHash) int {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suppose sort [start,end] after range is hashed, so this sort will be similar to quicksort last touch - sort partially ordered set.

IMO better hash once we see key and sort list after insertion in background. That way amortized sort time will be n log(n) but spent during collection, not actually during HashSort call made by Process or some other caller. I mean with etl they are sorted in bg actually before Load is called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if sort made on insert, we just iterate through already sorted list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd suppose sort [start,end] after range is hashed, so this sort will be similar to quicksort last touch - sort partially ordered set.

Can you elaborate on the algorithm? So let's say each worker takes a chunk and hashes the plain keys, and finally sorts the chunk by hashed key. Now we have n sorted chunks. Then I suppose we need to combine these chunks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"quicksort last touch" and "combine these chunks" - is the same thing.
FYI: etl does same: it using heap to merge sorted files:

func mergeSortFiles(logPrefix string, providers []dataProvider, loadFunc simpleLoadFunc, args TransformArgs, buf Buffer) (err error) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is no point in re-implementing ETL then. In this PR I'm just distributing the hashing of the keys, and calling the library function slices.SortFunc. The results I got slightly improve over the results with ETL, so I don't consider it necessary to implement this feature (at least not for now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point was that you are splitting whole plain key list into Chunks. Then you hash each pk in chunk, my addition was to sort actually this chunk at this moment, then you will get list of Chunks and each Chunk is an ordered list of hashed keys inside. Then just have to sort everything and combine Chunks into one list which we will iterate over in HashSort call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants