Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend hash length #227

Merged
merged 1 commit into from
Nov 13, 2024
Merged

Conversation

michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Nov 12, 2024

Summary

Previous analysis from #53 (June '23) aimed to avoid hash collisions for up to O(1000) profiles, since the bottleneck was reader performance. With improvements to reader performance, such as #170 and #185 (June '24) we are able to read O(10000) profiles in reasonable time. Our hashes in develop are length 8-10 digits in length ( ceil(log_10(16^n - 1)) = ceil(9.6) = 10 ). Results in #53 suggest probable hash collisions around O(100000) at 10 digits.

Increasing truncated hex length to 11 results in hashes that are 11-14 digits in length. Only negative implication of extending hash length is longer hashes are more inconvenient for data analysis.

Analysis

KDE plots when truncating the MD5 hash before converting to integer. 1000 trials, sample size is the number of profiles, where each "profile" is a randomly generated unique string of length 32. Empty plot indicates no collisions (we can expect collision rate to be negligible).

hash length = 8 (develop)
image

hash length = 9
image

hash length = 10
image

hash length = 11
image

All of these collision rates are pretty low, but we can see that collisions are possible at O(10000) with a hash length of 8. A hash length of 10 makes collisions negligible up to ~50000 profiles and length 11 up to ~200,000 profiles. Although these are estimates bumping up the hash length makes sense now that we can read more files into Thicket.

@michaelmckinsey1 michaelmckinsey1 self-assigned this Nov 12, 2024
@michaelmckinsey1 michaelmckinsey1 added area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-bug Identifies bugs in issues and identifies bug fixes in PRs labels Nov 12, 2024
@michaelmckinsey1 michaelmckinsey1 marked this pull request as ready for review November 12, 2024 23:35
@slabasan slabasan merged commit 1e58bc5 into LLNL:develop Nov 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-thicket Issues and PRs involving Thicket's core Thicket datastructure and associated classes priority-normal Normal priority issues and PRs status-ready-for-review This PR is ready to be reviewed by assigned reviewers type-bug Identifies bugs in issues and identifies bug fixes in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants