[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164
Labels
cuco
cuCollections related issue
cuIO
cuIO issue
feature request
New feature or request
improvement
Improvement / enhancement to an existing function
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Is your feature request related to a problem? Please describe.
In Parquet reader, we can use the
cuco::bloom_filter_ref
with a customcuco::bloom_filter_policy
to filter row groups when we have an equality predicate. This would allow us to potentially reduce I/O.The custom
cuco::bloom_filter_policy
would need to implement Arrow's logic for generating the bit pattern, selecting bloom filter blocks and selecting a filter block for a given key and would also be used to write our own bloom filters to Parquet (in the writer's side) in the future.Describe the solution you'd like
Use
cuco::bloom_filter
with a customcuco::bloom_filter_policy
to implement Arrow's BF logic in Parquet reader to filter row gorups.Additional context
The 1:1 Arrow BF policy may be implemented directly in cuco or upstreamed later on from cudf for exposure to broader RAPIDS.
Associated Subtasks
cuco::bloom_filter_policy
to mimic Arrow BF policycuco::bloom_filter
with the read BF bitset and policy in Parquet reader* check min/max stats and bloom filter simultaneously to prune column chunks
* identify which columns have equality conditions
* read the bloom filters only for the relevant column chunks
✅ NVIDIA/cuCollections#654 updates
arrow_filter_policy
to not rely on xxhash64's member types to be consistent with STLtable_with_metadata
ast::tree
✅ rapidsai/rapids-cmake#735 bumps cuco to include changes from NVIDIA/cuCollections#654
The text was updated successfully, but these errors were encountered: