Accelerate hash table iterator with value prefetching #1568

NadavGigi · 2025-01-15T15:54:36Z

This PR builds upon the previous entry prefetching optimization to further enhance performance by implementing value prefetching for hashtable iterators.

Implementation

Modified hashtableInitIterator to accept a new flags parameter, allowing control over iterator behavior.
Implemented conditional value prefetching within hashtableNext based on the new HASHTABLE_ITER_PREFETCH_VALUES flag.
When the flag is set, hashtableNext now calls prefetchBucketValues at the start of each new bucket, preemptively loading the values of filled entries into the CPU cache.
The actual prefetching of values is performed using type-specific callback functions implemented in server.c:

For robj the hashtableObjectPrefetchValue callback is used to prefetch the value if not embeded.

This implementation is specifically focused on main database iterations at this stage. Applying it to hashtables that hold other object types should not be problematic, but its performance benefits for those cases will need to be proven through testing and benchmarking.

Performance

Setup:

64cores Graviton 3 Amazon EC2 instance.
50 mil keys with different value sizes.
Running valkey server over RAM file system.
crc checksum and comperssion off.

Action

save command.

Results

The results regarding the duration of “save” command was taken from “info all” command.

+--------------------+------------------+------------------+ 
| Prefetching        | Value size (byte)| Time (seconds)   | 
+--------------------+------------------+------------------+ 
| No                 | 100              | 20.112279        | 
| Yes                | 100              | 12.758519        | 
| No                 | 40               | 16.945366        | 
| Yes                | 40               | 10.902022        |
| No                 | 20               | 9.817000         | 
| Yes                | 20               | 9.626821         |
| No                 | 10               | 9.71510          | 
| Yes                | 10               | 9.510565         |
+--------------------+------------------+------------------+

The results largely align with our expectations, showing significant improvements for larger values (100 bytes and 40 bytes) that are stored outside the robj. For smaller values (20 bytes and 10 bytes) that are embedded within the robj, we see almost no improvement, which is as expected.

However, the small improvement observed even for these embedded values is somewhat surprising. Given that we are not actively prefetching these embedded values, this minor performance gain was not anticipated.

perf record on save command without value prefetching:

                --99.98%--rdbSaveDb
                          |          
                          |--91.38%--rdbSaveKeyValuePair
                          |          |          
                          |          |--42.72%--rdbSaveRawString
                          |          |          |          
                          |          |          |--26.69%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --25.75%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --15.41%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--7.58%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --7.08%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --6.54%--_IO_fwrite
                          |          |                     |                                         
                          |          |                     |          
                          |          |                      --7.42%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --7.18%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --6.73%--_IO_fwrite
                          |          |                                                            
                          |          |          
                          |          |--40.44%--rdbSaveStringObject
                          |          |          
                          |           --7.62%--rdbSaveObjectType
                          |                     |          
                          |                      --7.39%--rdbWriteRaw.constprop.1
                          |                                |          
                          |                                 --7.04%--rioFileWrite.lto_priv.0
                          |                                           |          
                          |                                            --6.59%--_IO_fwrite
                          |                                                               
                          |          
                           --7.33%--hashtableNext.constprop.1
                                     |          
                                      --6.28%--prefetchNextBucketEntries.lto_priv.0

perf record on save command with value prefetching:

               rdbSaveRio
               |          
                --99.93%--rdbSaveDb
                          |          
                          |--79.81%--rdbSaveKeyValuePair
                          |          |          
                          |          |--66.79%--rdbSaveRawString
                          |          |          |          
                          |          |          |--42.31%--rdbWriteRaw
                          |          |          |          |          
                          |          |          |           --40.74%--rioFileWrite.lto_priv.0
                          |          |          |          
                          |          |           --23.37%--rdbSaveLen
                          |          |                     |          
                          |          |                     |--11.78%--rdbWriteRaw
                          |          |                     |          |          
                          |          |                     |           --11.03%--rioFileWrite.lto_priv.0
                          |          |                     |                     |          
                          |          |                     |                      --10.30%--_IO_fwrite
                          |          |                     |                                |          
                          |          |                     |          
                          |          |                      --10.98%--rdbWriteRaw.constprop.1
                          |          |                                |          
                          |          |                                 --10.44%--rioFileWrite.lto_priv.0
                          |          |                                           |          
                          |          |                                            --9.74%--_IO_fwrite
                          |          |                                                      |          
                          |          |          
                          |          |--11.33%--rdbSaveObjectType
                          |          |          |          
                          |          |           --10.96%--rdbWriteRaw.constprop.1
                          |          |                     |          
                          |          |                      --10.51%--rioFileWrite.lto_priv.0
                          |          |                                |          
                          |          |                                 --9.75%--_IO_fwrite
                          |          |                                           |          
                          |          |          
                          |           --0.77%--rdbSaveStringObject
                          |          
                           --18.39%--hashtableNext
                                     |          
                                     |--10.04%--hashtableObjectPrefetchValue
                                     |
                                      --6.06%--prefetchNextBucketEntries

Conclusions:

The prefetching strategy appears to be working as intended, shifting the performance bottleneck from data access to I/O operations.
The significant reduction in rdbSaveStringObject time suggests that string objects(which are the values) are being accessed more efficiently.

codecov · 2025-01-15T16:29:37Z

Codecov Report

Attention: Patch coverage is 95.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 71.00%. Comparing base (7fc958d) to head (57a387a).
Report is 2 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/module.c	0.00%	2 Missing ⚠️
src/acl.c	75.00%	1 Missing ⚠️
src/server.c	92.85%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1568      +/-   ##
============================================
+ Coverage     70.80%   71.00%   +0.19%     
============================================
  Files           121      121              
  Lines         65132    65139       +7     
============================================
+ Hits          46118    46249     +131     
+ Misses        19014    18890     -124

Files with missing lines	Coverage Δ
src/aof.c	`80.30% <100.00%> (+0.07%)`	⬆️
src/cluster.c	`89.21% <100.00%> (ø)`
src/cluster_legacy.c	`87.08% <100.00%> (-0.28%)`	⬇️
src/db.c	`89.57% <100.00%> (ø)`
src/debug.c	`52.40% <100.00%> (ø)`
src/hashtable.c	`78.08% <100.00%> (+0.73%)`	⬆️
src/kvstore.c	`95.11% <100.00%> (-0.08%)`	⬇️
src/latency.c	`80.92% <100.00%> (ø)`
src/object.c	`82.04% <100.00%> (ø)`
src/pubsub.c	`96.82% <100.00%> (ø)`
... and 8 more

... and 7 files with indirect coverage changes

src/acl.c

zuiderkwast

Interesting idea. I just skimmed though, not a full review.

To summarize, this is a second level of pre-fetching. Theoretically, we could do even more levels, right? It could even be generalized to any number of levels. We could use an array of prefetch callbacks.... (not saying we should)

We should probably also consider the bigger picture where this kind of callbacks can be used not only for the iterator, but also for the hashtableIncrementalFind that's already used with IO threads, and with other operations like Scan.

A hypothetical use case: Imagine we have a sorted set where we want to find the 10 members sorted by score, ZRANGE. We could find the starting element and also prefetch in up to 10 steps the next elements.

Have you been thinking about more prefetching ideas? Can you share your ideas? It's useful to understand the bigger picture better before we add many small optimizations for specific scenarios.

src/acl.c

NadavGigi · 2025-01-19T13:52:02Z

Interesting idea. I just skimmed though, not a full review.

To summarize, this is a second level of pre-fetching. Theoretically, we could do even more levels, right? It could even be generalized to any number of levels. We could use an array of prefetch callbacks.... (not saying we should)

We should probably also consider the bigger picture where this kind of callbacks can be used not only for the iterator, but also for the hashtableIncrementalFind that's already used with IO threads, and with other operations like Scan.

A hypothetical use case: Imagine we have a sorted set where we want to find the 10 members sorted by score, ZRANGE. We could find the starting element and also prefetch in up to 10 steps the next elements.

Have you been thinking about more prefetching ideas? Can you share your ideas? It's useful to understand the bigger picture better before we add many small optimizations for specific scenarios.

@zuiderkwast Thank you for your thoughtful feedback and for taking the time to review my PR. I apologize for not sharing my broader ideas earlier.

Let me address these and share my current thoughts on the matter:

Multi-level Prefetching:
Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

I think that your levels idea will happen anyway with this PR when we have nested hashtables. With other data structures, each one should be considered individually if it can benefit from prefetching or not. The perefetching of deeper levels should be the caller responsible and by the iterator.

Anyway, notice that with each level of prefetching, we will have to prefetch more buckets and entries ahead of the current bucket from which we're returning entries.
Broader Applications:
I agree that we should consider wider applications of this prefetching mechanism including the ZRANGE usecase you suggested. I have already started thinking about it and will share the insights soon.

Current Ideas and Roadmap:
I'm approaching these optimizations by porting techniques that have proven effective in dict implementations, while being mindful that direct application may not always yield the same benefits in hashtable contexts. Here's my current thinking where each of these optimizations builds upon the previous ones :

Entries prefetching - Done
Value prefetching for db iterations. Prefetching the internal fields of robj and hashTypeEntry - in PR.
Expanding the value prefetching to other data stcructures. - Implementation
second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.
Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

zuiderkwast · 2025-01-19T18:59:17Z

I apologize for not sharing my broader ideas earlier.

@NadavGigi No need to apologize! I didn't ask before. :D

1. Multi-level Prefetching:
Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

Yes, intriguing, that was the intention of my comment. 😄 We should base the decisions on actual use cases and facts.

The ZRANGE idea I had isn't even in the hashtable code. It is the skiplist code. We could definitely do some prefetching in that kind of code. And in the quicklist implementation for lists we could do it.

4. second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.

This is interesting! Should we add this callback information in the commands' JSON files? Feel free to open an issue to explain it, or a PR if you already have some code.

5. Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

Cool!

It seems to me that we should prioritize SCAN next, don't you think? If users find out that KEYS is faster than SCAN, maybe they switch to using KEYS. 🙀 And we use scan for expiration and other internal jobs too.

It seems easy to add prefetching to scan. We can just prefetch the next bucket. Maybe it doesn't speed up SCAN 0 COUNT 10 much, but it might speed up SCAN 0 COUNT 100...

zuiderkwast

I like it! Just some nits.

src/server.c

src/hashtable.c

zuiderkwast

LGTM.

@madolson @xbasel do you still want to review this, or OK to merge?

src/hashtable.h

NadavGigi · 2025-01-21T12:55:25Z

I apologize for not sharing my broader ideas earlier.

@NadavGigi No need to apologize! I didn't ask before. :D

Haha, I guess I was being a bit too formal there!

Multi-level Prefetching:
Your observation about the potential for multiple levels of prefetching is intriguing. While my current implementation focuses on a second level, I'm not entirely sure how well extending this to multiple levels would work in practice.

Yes, intriguing, that was the intention of my comment. 😄 We should base the decisions on actual use cases and facts.

The ZRANGE idea I had isn't even in the hashtable code. It is the skiplist code. We could definitely do some prefetching in that kind of code. And in the quicklist implementation for lists we could do it.

Great Idea! it can be added to the prefetching tasks list. When working on sunionDiffGenericCommand i will try to make a general batch prefetching function that will might be good for all.

second-level prefetching mechanism. This mechanism operates at the data structure level and is triggered when receiving client commands, before their actual execution. It leverages the existing incremental find API through a callback system. For instance, when processing a command like 'hget myhash f1 v1', a specific callback is invoked. This callback utilizes the incremental find API to initiate prefetching of relevant data structures, potentially improving performance by reducing latency during command execution.

This is interesting! Should we add this callback information in the commands' JSON files? Feel free to open an issue to explain it, or a PR if you already have some code.

Using prefetch (similar to what you suggested in hashtableIncrementalFindInit) to speed up sinterGenericCommand and sunionDiffGenericCommand. - Implementation.

Cool!

It seems to me that we should prioritize SCAN next, don't you think? If users find out that KEYS is faster than SCAN, maybe they switch to using KEYS. 🙀 And we use scan for expiration and other internal jobs too.

It seems easy to add prefetching to scan. We can just prefetch the next bucket. Maybe it doesn't speed up SCAN 0 COUNT 10 much, but it might speed up SCAN 0 COUNT 100...

Sounds great!
Myaybe we can come up with something like:
Start by prefetching next bucket and then prefetching the next bucket entries when we reach the middle of the current bucket we iterating.
I dont know yet how it will be prioritized but i will let you know as soon as possible.

xbasel · 2025-01-21T13:01:38Z

xbasel

I'd like to have another look. Should be quick.

madolson

Just minor comments about wording.

src/hashtable.c

src/server.c

src/hashtable.c

Signed-off-by: NadavGigi <[email protected]>

madolson · 2025-01-22T16:11:20Z

@NadavGigi Another remind to not force push, just keep pushing little commits, we can squash them at the end.

NadavGigi force-pushed the prefetch_values branch 2 times, most recently from 1da6b36 to 35f6b6e Compare January 15, 2025 16:15

NadavGigi changed the title ~~Improving iterator using prefetch values~~ Accelerate hash table iterator with value prefetching Jan 15, 2025

madolson reviewed Jan 15, 2025

View reviewed changes

src/acl.c Outdated Show resolved Hide resolved

zuiderkwast reviewed Jan 16, 2025

View reviewed changes

src/acl.c Outdated Show resolved Hide resolved

NadavGigi force-pushed the prefetch_values branch 3 times, most recently from c9d2294 to 1e4823f Compare January 20, 2025 19:43

zuiderkwast reviewed Jan 20, 2025

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Show resolved Hide resolved

NadavGigi force-pushed the prefetch_values branch from 1e4823f to 4da9437 Compare January 21, 2025 12:07

zuiderkwast approved these changes Jan 21, 2025

View reviewed changes

src/hashtable.h Outdated Show resolved Hide resolved

NadavGigi force-pushed the prefetch_values branch from 4da9437 to 618ba26 Compare January 21, 2025 12:58

madolson reviewed Jan 22, 2025

View reviewed changes

src/hashtable.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

src/hashtable.c Outdated Show resolved Hide resolved

xbasel approved these changes Jan 22, 2025

View reviewed changes

NadavGigi force-pushed the prefetch_values branch from 618ba26 to 0096df0 Compare January 22, 2025 12:00

Prefetching values for optimized iteration

57a387a

Signed-off-by: NadavGigi <[email protected]>

NadavGigi force-pushed the prefetch_values branch from 0096df0 to 57a387a Compare January 22, 2025 12:10

madolson approved these changes Jan 22, 2025

View reviewed changes

zuiderkwast merged commit f251078 into valkey-io:unstable Jan 23, 2025
50 checks passed

zuiderkwast added release-notes This issue should get a line item in the release notes performance labels Jan 23, 2025

NadavGigi deleted the prefetch_values branch January 23, 2025 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate hash table iterator with value prefetching #1568

Accelerate hash table iterator with value prefetching #1568

NadavGigi commented Jan 15, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading

zuiderkwast left a comment

NadavGigi commented Jan 19, 2025 •

edited

Loading

zuiderkwast commented Jan 19, 2025

zuiderkwast left a comment

zuiderkwast left a comment

NadavGigi commented Jan 21, 2025

xbasel commented Jan 21, 2025

madolson left a comment

madolson commented Jan 22, 2025

Accelerate hash table iterator with value prefetching #1568

Accelerate hash table iterator with value prefetching #1568

Conversation

NadavGigi commented Jan 15, 2025 • edited Loading

Implementation

Performance

Setup:

Action

Results

codecov bot commented Jan 15, 2025 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

NadavGigi commented Jan 19, 2025 • edited Loading

zuiderkwast commented Jan 19, 2025

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

NadavGigi commented Jan 21, 2025

xbasel commented Jan 21, 2025

madolson left a comment

Choose a reason for hiding this comment

madolson commented Jan 22, 2025

NadavGigi commented Jan 15, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading

NadavGigi commented Jan 19, 2025 •

edited

Loading