Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate hash table iterator with value prefetching #1568

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

NadavGigi
Copy link
Contributor

@NadavGigi NadavGigi commented Jan 15, 2025

This PR builds upon the previous entry prefetching optimization to further enhance performance by implementing value prefetching for hashtable iterators.

Implementation

Modified hashtableInitIterator to accept a new flags parameter, allowing control over iterator behavior.
Implemented conditional value prefetching within hashtableNext based on the new HASHTABLE_ITER_PREFETCH_VALUES flag.
When the flag is set, hashtableNext now calls prefetchBucketValues at the start of each new bucket, preemptively loading the values of filled entries into the CPU cache.
The actual prefetching of values is performed using type-specific callback functions implemented in server.c:

  • For robj the hashtableObjectPrefetchValue callback is used to prefetch the value if not embeded.

This implementation is specifically focused on main database iterations at this stage. Applying it to hashtables that hold other object types should not be problematic, but its performance benefits for those cases will need to be proven through testing and benchmarking.

Performance

Setup:

  • 64cores Graviton 3 Amazon EC2 instance.
  • 50 mil keys with different value sizes.
  • Running valkey server over RAM file system.
  • crc checksum and comperssion off.

Action

  • save command.

Results

The results regarding the duration of “save” command was taken from “info all” command.

+--------------------+------------------+------------------+ 
| Prefetching        | Value size (byte)| Time (seconds)   | 
+--------------------+------------------+------------------+ 
| No                 | 100              | 20.112279        | 
| Yes                | 100              | 12.758519        | 
| No                 | 40               | 16.945366        | 
| Yes                | 40               | 10.902022        |
| No                 | 20               | 9.817000         | 
| Yes                | 20               | 9.626821         |
| No                 | 10               | 9.71510          | 
| Yes                | 10               | 9.510565         |
+--------------------+------------------+------------------+

The results largely align with our expectations, showing significant improvements for larger values (100 bytes and 40 bytes) that are stored outside the robj. For smaller values (20 bytes and 10 bytes) that are embedded within the robj, we see almost no improvement, which is as expected.

However, the small improvement observed even for these embedded values is somewhat surprising. Given that we are not actively prefetching these embedded values, this minor performance gain was not anticipated.

perf record on save command without value prefetching:

                --99.98%--rdbSaveDb
                          |          
                          |--91.38%--rdbSaveKeyValuePair
                          |          |          
                          |          |--42.72%--rdbSaveRawString
                          |          |          |          
                          |          |          |--26.69%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --25.75%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --15.41%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--7.58%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --7.08%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --6.54%--_IO_fwrite
                          |          |                     |                                         
                          |          |                     |          
                          |          |                      --7.42%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --7.18%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --6.73%--_IO_fwrite
                          |          |                                                            
                          |          |          
                          |          |--40.44%--rdbSaveStringObject
                          |          |          
                          |           --7.62%--rdbSaveObjectType
                          |                     |          
                          |                      --7.39%--rdbWriteRaw.constprop.1
                          |                                |          
                          |                                 --7.04%--rioFileWrite.lto_priv.0
                          |                                           |          
                          |                                            --6.59%--_IO_fwrite
                          |                                                               
                          |          
                           --7.33%--hashtableNext.constprop.1
                                     |          
                                      --6.28%--prefetchNextBucketEntries.lto_priv.0

perf record on save command with value prefetching:

               rdbSaveRio
               |          
                --99.93%--rdbSaveDb
                          |          
                          |--79.81%--rdbSaveKeyValuePair
                          |          |          
                          |          |--66.79%--rdbSaveRawString
                          |          |          |          
                          |          |          |--42.31%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --40.74%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --23.37%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--11.78%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --11.03%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --10.30%--_IO_fwrite
                          |          |                     |                                |          
                          |          |                     |          
                          |          |                      --10.98%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --10.44%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --9.74%--_IO_fwrite
                          |          |                                                      |          
                          |          |          
                          |          |--11.33%--rdbSaveObjectType
                          |          |          |          
                          |          |           --10.96%--rdbWriteRaw.constprop.1
                          |          |                     |          
                          |          |                      --10.51%--rioFileWrite.lto_priv.0
                          |          |                                |          
                          |          |                                 --9.75%--_IO_fwrite
                          |          |                                           |          
                          |          |          
                          |           --0.77%--rdbSaveStringObject
                          |          
                           --18.39%--hashtableNext
                                     |          
                                     |--10.04%--hashtableObjectPrefetchValue
                                     |
                                      --6.06%--prefetchNextBucketEntries        

Conclusions:

The prefetching strategy appears to be working as intended, shifting the performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that string objects(which are the values) are being accessed more efficiently.

@NadavGigi NadavGigi force-pushed the prefetch_values branch 2 times, most recently from 1da6b36 to 35f6b6e Compare January 15, 2025 16:15
@NadavGigi NadavGigi changed the title Improving iterator using prefetch values Accelerate hash table iterator with value prefetching Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 95.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 71.00%. Comparing base (7fc958d) to head (57a387a).
Report is 2 commits behind head on unstable.

Files with missing lines Patch % Lines
src/module.c 0.00% 2 Missing ⚠️
src/acl.c 75.00% 1 Missing ⚠️
src/server.c 92.85% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1568      +/-   ##
============================================
+ Coverage     70.80%   71.00%   +0.19%     
============================================
  Files           121      121              
  Lines         65132    65139       +7     
============================================
+ Hits          46118    46249     +131     
+ Misses        19014    18890     -124     
Files with missing lines Coverage Δ
src/aof.c 80.30% <100.00%> (+0.07%) ⬆️
src/cluster.c 89.21% <100.00%> (ø)
src/cluster_legacy.c 87.08% <100.00%> (-0.28%) ⬇️
src/db.c 89.57% <100.00%> (ø)
src/debug.c 52.40% <100.00%> (ø)
src/hashtable.c 78.08% <100.00%> (+0.73%) ⬆️
src/kvstore.c 95.11% <100.00%> (-0.08%) ⬇️
src/latency.c 80.92% <100.00%> (ø)
src/object.c 82.04% <100.00%> (ø)
src/pubsub.c 96.82% <100.00%> (ø)
... and 8 more

... and 7 files with indirect coverage changes

src/acl.c Outdated Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. I just skimmed though, not a full review.

To summarize, this is a second level of pre-fetching. Theoretically, we could do even more levels, right? It could even be generalized to any number of levels. We could use an array of prefetch callbacks.... (not saying we should)

We should probably also consider the bigger picture where this kind of callbacks can be used not only for the iterator, but also for the hashtableIncrementalFind that's already used with IO threads, and with other operations like Scan.

A hypothetical use case: Imagine we have a sorted set where we want to find the 10 members sorted by score, ZRANGE. We could find the starting element and also prefetch in up to 10 steps the next elements.

Have you been thinking about more prefetching ideas? Can you share your ideas? It's useful to understand the bigger picture better before we add many small optimizations for specific scenarios.

src/acl.c Outdated Show resolved Hide resolved
@NadavGigi
Copy link
Contributor Author

NadavGigi commented Jan 19, 2025

Interesting idea. I just skimmed though, not a full review.

To summarize, this is a second level of pre-fetching. Theoretically, we could do even more levels, right? It could even be generalized to any number of levels. We could use an array of prefetch callbacks.... (not saying we should)

We should probably also consider the bigger picture where this kind of callbacks can be used not only for the iterator, but also for the hashtableIncrementalFind that's already used with IO threads, and with other operations like Scan.

A hypothetical use case: Imagine we have a sorted set where we want to find the 10 members sorted by score, ZRANGE. We could find the starting element and also prefetch in up to 10 steps the next elements.

Have you been thinking about more prefetching ideas? Can you share your ideas? It's useful to understand the bigger picture better before we add many small optimizations for specific scenarios.

@zuiderkwast Thank you for your thoughtful feedback and for taking the time to review my PR. I apologize for not sharing my broader ideas earlier.

Let me address these and share my current thoughts on the matter:

  1. Multi-level Prefetching:
    Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

    I think that your levels idea will happen anyway with this PR when we have nested hashtables. With other data structures, each one should be considered individually if it can benefit from prefetching or not. The perefetching of deeper levels should be the caller responsible and by the iterator.

    Anyway, notice that with each level of prefetching, we will have to prefetch more buckets and entries ahead of the current bucket from which we're returning entries.

  2. Broader Applications:
    I agree that we should consider wider applications of this prefetching mechanism including the ZRANGE usecase you suggested. I have already started thinking about it and will share the insights soon.

Current Ideas and Roadmap:
I'm approaching these optimizations by porting techniques that have proven effective in dict implementations, while being mindful that direct application may not always yield the same benefits in hashtable contexts. Here's my current thinking where each of these optimizations builds upon the previous ones :

  1. Entries prefetching - Done
  2. Value prefetching for db iterations. Prefetching the internal fields of robj and hashTypeEntry - in PR.
  3. Expanding the value prefetching to other data stcructures. - Implementation
  4. second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.
  5. Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

@zuiderkwast
Copy link
Contributor

I apologize for not sharing my broader ideas earlier.

@NadavGigi No need to apologize! I didn't ask before. :D

1. Multi-level Prefetching:
Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

Yes, intriguing, that was the intention of my comment. 😄 We should base the decisions on actual use cases and facts.

The ZRANGE idea I had isn't even in the hashtable code. It is the skiplist code. We could definitely do some prefetching in that kind of code. And in the quicklist implementation for lists we could do it.

4. second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.

This is interesting! Should we add this callback information in the commands' JSON files? Feel free to open an issue to explain it, or a PR if you already have some code.

5. Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

Cool!

It seems to me that we should prioritize SCAN next, don't you think? If users find out that KEYS is faster than SCAN, maybe they switch to using KEYS. 🙀 And we use scan for expiration and other internal jobs too.

It seems easy to add prefetching to scan. We can just prefetch the next bucket. Maybe it doesn't speed up SCAN 0 COUNT 10 much, but it might speed up SCAN 0 COUNT 100...

@NadavGigi NadavGigi force-pushed the prefetch_values branch 3 times, most recently from c9d2294 to 1e4823f Compare January 20, 2025 19:43
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Just some nits.

src/server.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@madolson @xbasel do you still want to review this, or OK to merge?

src/hashtable.h Outdated Show resolved Hide resolved
@NadavGigi
Copy link
Contributor Author

I apologize for not sharing my broader ideas earlier.

@NadavGigi No need to apologize! I didn't ask before. :D

Haha, I guess I was being a bit too formal there!

  1. Multi-level Prefetching:
    Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

Yes, intriguing, that was the intention of my comment. 😄 We should base the decisions on actual use cases and facts.

The ZRANGE idea I had isn't even in the hashtable code. It is the skiplist code. We could definitely do some prefetching in that kind of code. And in the quicklist implementation for lists we could do it.

Great Idea! it can be added to the prefetching tasks list. When working on sunionDiffGenericCommand i will try to make a general batch prefetching function that will might be good for all.

  1. second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.

This is interesting! Should we add this callback information in the commands' JSON files? Feel free to open an issue to explain it, or a PR if you already have some code.

  1. Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

Cool!

It seems to me that we should prioritize SCAN next, don't you think? If users find out that KEYS is faster than SCAN, maybe they switch to using KEYS. 🙀 And we use scan for expiration and other internal jobs too.

It seems easy to add prefetching to scan. We can just prefetch the next bucket. Maybe it doesn't speed up SCAN 0 COUNT 10 much, but it might speed up SCAN 0 COUNT 100...

Sounds great!
Myaybe we can come up with something like:
Start by prefetching next bucket and then prefetching the next bucket entries when we reach the middle of the current bucket we iterating.
I dont know yet how it will be prioritized but i will let you know as soon as possible.

@xbasel
Copy link
Member

xbasel commented Jan 21, 2025

xbasel

I'd like to have another look. Should be quick.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments about wording.

src/hashtable.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
src/hashtable.c Outdated Show resolved Hide resolved
@madolson
Copy link
Member

@NadavGigi Another remind to not force push, just keep pushing little commits, we can squash them at the end.

@zuiderkwast zuiderkwast merged commit f251078 into valkey-io:unstable Jan 23, 2025
50 checks passed
@zuiderkwast zuiderkwast added release-notes This issue should get a line item in the release notes performance labels Jan 23, 2025
@NadavGigi NadavGigi deleted the prefetch_values branch January 23, 2025 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants