Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164

Open
mhaseeb123 opened this issue Oct 24, 2024 · 0 comments
Assignees
Labels
cuco cuCollections related issue cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@mhaseeb123
Copy link
Member

mhaseeb123 commented Oct 24, 2024

Is your feature request related to a problem? Please describe.
In Parquet reader, we can use the cuco::bloom_filter_ref with a custom cuco::bloom_filter_policy to filter row groups when we have an equality predicate. This would allow us to potentially reduce I/O.

The custom cuco::bloom_filter_policy would need to implement Arrow's logic for generating the bit pattern, selecting bloom filter blocks and selecting a filter block for a given key and would also be used to write our own bloom filters to Parquet (in the writer's side) in the future.

Describe the solution you'd like
Use cuco::bloom_filter with a custom cuco::bloom_filter_policy to implement Arrow's BF logic in Parquet reader to filter row gorups.

Additional context
The 1:1 Arrow BF policy may be implemented directly in cuco or upstreamed later on from cudf for exposure to broader RAPIDS.

Associated Subtasks

Task PRs Notes
Implement a cuco::bloom_filter_policy to mimic Arrow BF policy NVIDIA/cuCollections#625 NVIDIA/cuCollections#633 adds bitset validation against Arrow impl
Add support to read and deserialize BF bitset from Parquet files #17289 NVIDIA/cuCollections#642 and ✅ #17393 to support cudf types in Bloom Filter
Use cuco::bloom_filter with the read BF bitset and policy in Parquet reader
* check min/max stats and bloom filter simultaneously to prune column chunks
* identify which columns have equality conditions
* read the bloom filters only for the relevant column chunks
#17289 #17587 simplifies Stats and Bloomfilter AST expression transformers using ast::tree
NVIDIA/cuCollections#654 updates arrow_filter_policy to not rely on xxhash64's member types to be consistent with STL
Measure number of filtered row groups and return as a part of table_with_metadata #17594 #17587 simplifies AST expression converter using ast::tree
rapidsai/rapids-cmake#735 bumps cuco to include changes from NVIDIA/cuCollections#654
@mhaseeb123 mhaseeb123 added cuco cuCollections related issue cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. labels Oct 24, 2024
sleeepyjack pushed a commit to NVIDIA/cuCollections that referenced this issue Oct 30, 2024
This PR adds a new Bloom Filter policy implementing the Arrow BF
algorithm. This PR is a part of
rapidsai/cudf#17164. A follow-up PR will add
tests for bitwise validation of bloom filter using arrow policy.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yunsong Wang <[email protected]>
sleeepyjack pushed a commit to NVIDIA/cuCollections that referenced this issue Nov 1, 2024
This PR adds a tests to validate the bitset from inserting specific keys
to a `cuco::bloom_filter` with `cuco::arrow_filter_policy` against the
one generated by inserting the same keys to the implementation in Arrow.

Related to #625. Part of rapidsai/cudf#17164.
Reference bitset gen with arrow here: https://godbolt.org/z/ebdddezbP

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@mhaseeb123 mhaseeb123 self-assigned this Dec 4, 2024
@mhaseeb123 mhaseeb123 moved this to In progress in libcudf Dec 4, 2024
rapids-bot bot pushed a commit that referenced this issue Dec 20, 2024
…st::tree` (#17587)

This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). 

This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289.

Related to #17164

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #17587
rapids-bot bot pushed a commit that referenced this issue Jan 14, 2025
…s using them (#17289)

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on `col == literal` like predicate(s), if provided. 

Related to #17164

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)
  - Bradley Dice (https://github.com/bdice)

URL: #17289
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuco cuCollections related issue cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In progress
Development

No branches or pull requests

1 participant