Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accounts-db: Sampled LRU eviction #3943

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

vadorovsky
Copy link
Member

@vadorovsky vadorovsky commented Dec 5, 2024

Instead of rigid LRU eviction, perform sampled LRU eviction.

Sampled LRU eviction takes K random keys and picks the one with the lowest update timestamp. This way, even though the evicted element is not always the oldest in the whole cache, removes the necessity of maintaining a queue and reduces the contention caused by locking the queue.

The K parameter is configurable, but the best performing value so far was 8 and it's used as the default one.

The new eviction mechanism results in performance improvement in the cache benchmarks:

read_only_accounts_cache_eviction_lo_hi/load/8
                        time:   [1.3057 µs 1.3065 µs 1.3076 µs]
                        change: [-63.644% -60.958% -57.959%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_lo_hi/store/8
                        time:   [2.1751 µs 2.1809 µs 2.1856 µs]
                        change: [-79.874% -79.642% -79.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild
read_only_accounts_cache_eviction_lo_hi/load/16
                        time:   [2.3593 µs 2.3614 µs 2.3630 µs]
                        change: [-56.035% -52.626% -48.887%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_lo_hi/store/16
                        time:   [3.1320 µs 3.1436 µs 3.1558 µs]
                        change: [-85.141% -84.976% -84.813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high severe
read_only_accounts_cache_eviction_lo_hi/load/32
                        time:   [1.5572 µs 1.5614 µs 1.5662 µs]
                        change: [-85.886% -84.973% -83.975%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
read_only_accounts_cache_eviction_lo_hi/store/32
                        time:   [3.3965 µs 3.4077 µs 3.4184 µs]
                        change: [-90.858% -90.724% -90.585%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  2 (2.00%) low mild
read_only_accounts_cache_eviction_lo_hi/load/64
                        time:   [2.2669 µs 2.5165 µs 2.7736 µs]
                        change: [-91.867% -91.083% -90.189%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
read_only_accounts_cache_eviction_lo_hi/store/64
                        time:   [6.6097 µs 6.9530 µs 7.2983 µs]
                        change: [-91.090% -90.788% -90.417%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  3 (3.00%) low severe
  7 (7.00%) low mild
  6 (6.00%) high mild
  10 (10.00%) high severe

read_only_accounts_cache_eviction_hi/load/8
                        time:   [2.1554 µs 2.1580 µs 2.1600 µs]
                        change: [-44.674% -39.965% -34.699%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_hi/store/8
                        time:   [2.7438 µs 2.7580 µs 2.7689 µs]
                        change: [-80.454% -80.144% -79.805%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_hi/load/16
                        time:   [1.9593 µs 1.9604 µs 1.9619 µs]
                        change: [-64.954% -61.744% -58.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
read_only_accounts_cache_eviction_hi/store/16
                        time:   [2.7436 µs 2.7581 µs 2.7697 µs]
                        change: [-84.959% -84.721% -84.489%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
read_only_accounts_cache_eviction_hi/load/32
                        time:   [2.5226 µs 2.5260 µs 2.5301 µs]
                        change: [-79.214% -77.822% -76.252%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) low mild
read_only_accounts_cache_eviction_hi/store/32
                        time:   [4.0885 µs 4.1015 µs 4.1148 µs]
                        change: [-89.677% -89.496% -89.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
read_only_accounts_cache_eviction_hi/load/64
                        time:   [4.1474 µs 4.3543 µs 4.5721 µs]
                        change: [-86.307% -85.336% -84.364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
read_only_accounts_cache_eviction_hi/store/64
                        time:   [5.8232 µs 6.3214 µs 6.7945 µs]
                        change: [-93.624% -93.167% -92.638%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

The InfluxDB metrics report:

  • A slight improvement in load times in general and less spikes.
  • A significant improvement in store times and less spikes.
  • Longer evict times, but that's expected and we are willing to accept it as a trade-off.

lru2

Ref #4272

@vadorovsky
Copy link
Member Author

vadorovsky commented Dec 5, 2024

@alessandrod @brooksprumo Marked as a draft,I still want to take some time for more benchmark runs and figuring out how that behaves on the actual MB validator. And also on ensuring about the choice of K parameter. The benchmark is not the best representation of the real workload, but I still expect the overall "numbers go lower" tendency.

Feel free to comment on the code already.

@brooksprumo
Copy link

brooksprumo commented Dec 5, 2024

behaves on the actual MB validator

Yeah, this will be the valuable one.

A thing to also check is if we should also remove the code the conditionally doesn't update the last accessed time. Because if we're randomly sampling, we want those last accessed times to be more accurate. We'll have to see if there's perf issue with that though, or if it is fast since we don't have to maintain the lru queue anymore.

@vadorovsky
Copy link
Member Author

behaves on the actual MB validator

Yeah, this will be the valuable one.

A thing to also check is if we should also remove the code the conditionally doesn't update the last accessed time. Because if we're randomly sampling, we want those last accessed times to be more accurate. We'll have to see if there's perf issue with that though, or if it is fast since we don't have to maintain the lru queue anymore.

Good point. I will benchmark that, theoretically the unconditional update of the timestamp shouldn't change much.

            if let Some(entry) = self.cache.get(&pubkey) {
                if entry.slot == slot {
                    let update_lru = entry.ms_since_last_update() >= self.ms_to_skip_lru_update;
                    if update_lru {
                        entry
                            .last_update_time
                            .store(ReadOnlyAccountCacheEntry::timestamp(), Ordering::Release);
                    }
                    [...]
            }

self.cache.get returns Ref, which acquires a read lock. Updating the timestamp without write lock works, because it's an atomic field. So since we got rid of the queue, we don't have to write lock anything.

@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 4 times, most recently from ef1a7d2 to afbf23f Compare December 10, 2024 10:33
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 7 times, most recently from 1a5ce85 to e8597e4 Compare December 26, 2024 07:13
@brooksprumo brooksprumo self-requested a review December 30, 2024 18:37
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 6 times, most recently from e13717b to 142cc80 Compare January 2, 2025 09:33
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea looks good now. I'll do a deep dive on the code/impl itself next pass.

accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 2 times, most recently from e61d65b to ed25318 Compare January 13, 2025 14:53
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 9 times, most recently from f4d4d7d to 7fb5818 Compare January 15, 2025 09:40
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evict code looks good. I need to do another pass on the tests still.

accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved
accounts-db/src/accounts_db.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
@brooksprumo brooksprumo self-requested a review January 15, 2025 15:35
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 3 times, most recently from 62ff8de to 0714d61 Compare January 15, 2025 19:54
Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there! left a few comments

accounts-db/src/accounts_db.rs Show resolved Hide resolved
accounts-db/src/accounts_db.rs Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Outdated Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved
accounts-db/src/read_only_accounts_cache.rs Show resolved Hide resolved
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch 3 times, most recently from 8c940b5 to e3b24e0 Compare January 16, 2025 08:38
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch from e3b24e0 to 582c165 Compare January 16, 2025 08:59
Instead of rigid LRU eviction, perform sampled LRU eviction.

Sampled LRU eviction takes K random keys and picks the one with the
lowest update timestamp. This way, even though the evicted element is
not always the oldest in the whole cache, removes the necessity of
maintaining a queue and reduces the contention caused by locking the
queue.

The K parameter is configurable, but the best performing value so far
was 8 and it's used as the default one.

The new eviction mechanism results in performance improvement in the
cache eviction benchmarks:

```
read_only_accounts_cache_eviction_lo_hi/load/8
                        time:   [1.3057 µs 1.3065 µs 1.3076 µs]
                        change: [-63.644% -60.958% -57.959%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_lo_hi/store/8
                        time:   [2.1751 µs 2.1809 µs 2.1856 µs]
                        change: [-79.874% -79.642% -79.420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low mild
read_only_accounts_cache_eviction_lo_hi/load/16
                        time:   [2.3593 µs 2.3614 µs 2.3630 µs]
                        change: [-56.035% -52.626% -48.887%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_lo_hi/store/16
                        time:   [3.1320 µs 3.1436 µs 3.1558 µs]
                        change: [-85.141% -84.976% -84.813%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high severe
read_only_accounts_cache_eviction_lo_hi/load/32
                        time:   [1.5572 µs 1.5614 µs 1.5662 µs]
                        change: [-85.886% -84.973% -83.975%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
read_only_accounts_cache_eviction_lo_hi/store/32
                        time:   [3.3965 µs 3.4077 µs 3.4184 µs]
                        change: [-90.858% -90.724% -90.585%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  2 (2.00%) low mild
read_only_accounts_cache_eviction_lo_hi/load/64
                        time:   [2.2669 µs 2.5165 µs 2.7736 µs]
                        change: [-91.867% -91.083% -90.189%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
read_only_accounts_cache_eviction_lo_hi/store/64
                        time:   [6.6097 µs 6.9530 µs 7.2983 µs]
                        change: [-91.090% -90.788% -90.417%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  3 (3.00%) low severe
  7 (7.00%) low mild
  6 (6.00%) high mild
  10 (10.00%) high severe

read_only_accounts_cache_eviction_hi/load/8
                        time:   [2.1554 µs 2.1580 µs 2.1600 µs]
                        change: [-44.674% -39.965% -34.699%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_hi/store/8
                        time:   [2.7438 µs 2.7580 µs 2.7689 µs]
                        change: [-80.454% -80.144% -79.805%] (p = 0.00 < 0.05)
                        Performance has improved.
read_only_accounts_cache_eviction_hi/load/16
                        time:   [1.9593 µs 1.9604 µs 1.9619 µs]
                        change: [-64.954% -61.744% -58.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
read_only_accounts_cache_eviction_hi/store/16
                        time:   [2.7436 µs 2.7581 µs 2.7697 µs]
                        change: [-84.959% -84.721% -84.489%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
read_only_accounts_cache_eviction_hi/load/32
                        time:   [2.5226 µs 2.5260 µs 2.5301 µs]
                        change: [-79.214% -77.822% -76.252%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) low mild
read_only_accounts_cache_eviction_hi/store/32
                        time:   [4.0885 µs 4.1015 µs 4.1148 µs]
                        change: [-89.677% -89.496% -89.310%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
read_only_accounts_cache_eviction_hi/load/64
                        time:   [4.1474 µs 4.3543 µs 4.5721 µs]
                        change: [-86.307% -85.336% -84.364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
read_only_accounts_cache_eviction_hi/store/64
                        time:   [5.8232 µs 6.3214 µs 6.7945 µs]
                        change: [-93.624% -93.167% -92.638%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
@vadorovsky vadorovsky force-pushed the sampled-lru-eviction branch from 582c165 to 565acd7 Compare January 16, 2025 10:30
Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Thanks for all the work you've done here! I'm excited for this to merge!

@vadorovsky vadorovsky merged commit 1b99961 into anza-xyz:master Jan 16, 2025
30 checks passed
@vadorovsky vadorovsky deleted the sampled-lru-eviction branch January 16, 2025 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants