Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading bloom filters from Parquet files and filter row groups using them #17289

Merged
Merged
Changes from 1 commit
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
95fe8e8
Initial stuff for reading bloom filter from PQ files
mhaseeb123 Nov 9, 2024
4f0e7ab
Minor bug fix
mhaseeb123 Nov 9, 2024
48a50c4
Apply style fix
mhaseeb123 Nov 9, 2024
9a85d08
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 Nov 14, 2024
b71cf9b
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 Nov 15, 2024
68be24f
Some updates
mhaseeb123 Nov 16, 2024
f848251
Move contents to a separate file
mhaseeb123 Nov 16, 2024
0b65233
Revert erroneous changes
mhaseeb123 Nov 16, 2024
cf7d762
Style and doc fix
mhaseeb123 Nov 16, 2024
81efad2
Get equality predicate col indices
mhaseeb123 Nov 19, 2024
088377b
Enable `arrow_filter_policy` and `span` types in bloom filter.
mhaseeb123 Nov 20, 2024
0435bff
Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data
mhaseeb123 Nov 20, 2024
3dff590
Successfully search bloom filter
mhaseeb123 Nov 21, 2024
71e1d33
style fix
mhaseeb123 Nov 21, 2024
aa65a2b
Code cleanup
mhaseeb123 Nov 22, 2024
c52821b
add tests
mhaseeb123 Nov 25, 2024
3a20a98
Initial stuff for reading bloom filter from PQ files
mhaseeb123 Nov 9, 2024
d67e4b5
Minor bug fix
mhaseeb123 Nov 9, 2024
10471d4
Apply style fix
mhaseeb123 Nov 9, 2024
1e12662
Some updates
mhaseeb123 Nov 16, 2024
ee7217c
Move contents to a separate file
mhaseeb123 Nov 16, 2024
f8e6159
Revert erroneous changes
mhaseeb123 Nov 16, 2024
1886cab
Style and doc fix
mhaseeb123 Nov 16, 2024
be228b3
Get equality predicate col indices
mhaseeb123 Nov 19, 2024
aaf355e
Enable `arrow_filter_policy` and `span` types in bloom filter.
mhaseeb123 Nov 20, 2024
e92324e
Successfully search bloom filter
mhaseeb123 Nov 21, 2024
0b1719d
style fix
mhaseeb123 Nov 21, 2024
ef3a262
Code cleanup
mhaseeb123 Nov 22, 2024
051be2d
add tests
mhaseeb123 Nov 25, 2024
a12c90e
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 Nov 25, 2024
fb55c3f
Major cleanups
mhaseeb123 Nov 26, 2024
b477d2d
Significant code refactoring
mhaseeb123 Nov 26, 2024
f9f1746
minor style fix
mhaseeb123 Nov 26, 2024
bad484f
refactoring
mhaseeb123 Nov 26, 2024
ce09d43
Minor refactoring
mhaseeb123 Nov 26, 2024
dddee6c
Minor improvements
mhaseeb123 Nov 26, 2024
0cfeb80
Add gtest
mhaseeb123 Nov 26, 2024
9137585
Improvements
mhaseeb123 Nov 26, 2024
77152b4
Support int96 in bloom filter
mhaseeb123 Nov 27, 2024
3984291
Cleanup
mhaseeb123 Nov 27, 2024
9a39aa4
Minor improvements
mhaseeb123 Nov 27, 2024
1def801
Fix minor bug
mhaseeb123 Nov 27, 2024
6edc248
MInor bug fixing
mhaseeb123 Nov 28, 2024
2925f1e
Add python tests
mhaseeb123 Nov 28, 2024
efc6ec0
Correct parquet files
mhaseeb123 Nov 28, 2024
df84aca
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Nov 28, 2024
a2fa784
minor spelling fix
mhaseeb123 Dec 2, 2024
1f5da37
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 2, 2024
fa0cec8
Apply suggestions from code review
mhaseeb123 Dec 2, 2024
7a309c6
Minor bug fix
mhaseeb123 Dec 2, 2024
bcc68c0
Convert to enum class
mhaseeb123 Dec 2, 2024
2dce9b1
Apply suggestion from code review
mhaseeb123 Dec 3, 2024
e03bea0
Suggestions from code reviews
mhaseeb123 Dec 3, 2024
059a9d8
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 3, 2024
4b0b5ed
Apply suggestions from code reviews
mhaseeb123 Dec 4, 2024
c1256b1
Refactor into single table for cudf::compute_column
mhaseeb123 Dec 4, 2024
88bf491
Minor, add const
mhaseeb123 Dec 4, 2024
9ca42c6
Move bloom filter test to parquet test
mhaseeb123 Dec 4, 2024
84c24c1
Minor updates
mhaseeb123 Dec 4, 2024
0c05031
Minor
mhaseeb123 Dec 4, 2024
09560c5
Logical and between bloom filter and stats
mhaseeb123 Dec 4, 2024
21f4412
Revert merging converted AST tables.
mhaseeb123 Dec 4, 2024
442de80
Revert an extra eol
mhaseeb123 Dec 4, 2024
f7952d4
Revert extra eol
mhaseeb123 Dec 4, 2024
4d0c570
Read bloom filter data sync
mhaseeb123 Dec 4, 2024
67c6247
Update cpp/src/io/parquet/bloom_filter_reader.cu
mhaseeb123 Dec 4, 2024
40c80b7
strong type for int96 timestamp
mhaseeb123 Dec 4, 2024
690c165
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 Dec 4, 2024
c5f8150
Remove unused header
mhaseeb123 Dec 4, 2024
7a21a6e
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 6, 2024
4465277
Apply suggestions from code review
mhaseeb123 Dec 9, 2024
3888732
Apply suggestions
mhaseeb123 Dec 9, 2024
8bc8927
Update cpp/src/io/parquet/reader_impl_helpers.hpp
mhaseeb123 Dec 9, 2024
d719e65
Update cpp/src/io/parquet/reader_impl_helpers.hpp
mhaseeb123 Dec 9, 2024
03cf07f
Move equality_literals instead of copying
mhaseeb123 Dec 9, 2024
de94168
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 Dec 9, 2024
c92d326
Minor
mhaseeb123 Dec 9, 2024
82083f9
Use spans instead of passing around vectors
mhaseeb123 Dec 10, 2024
6918a40
Minor
mhaseeb123 Dec 10, 2024
85cdc00
Make `get_equality_literals()` safe again
mhaseeb123 Dec 10, 2024
aa1a909
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 10, 2024
fdf8fc8
Update counting_iterator
mhaseeb123 Dec 10, 2024
10a8f5a
Minor changes
mhaseeb123 Dec 10, 2024
d46504f
Minor
mhaseeb123 Dec 10, 2024
c94ce86
Sync arrow filter policy with cuco
mhaseeb123 Dec 10, 2024
69aa685
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 11, 2024
d95a178
Address partial reviewer comments and fix new logger header
mhaseeb123 Dec 12, 2024
840c6e7
Revert to direct dtype check until I find a way to get scalar from li…
mhaseeb123 Dec 12, 2024
9d8c071
Create a dummy scalar of type T and compare with dtype
mhaseeb123 Dec 12, 2024
3b8aea0
Use a temporary scalar
mhaseeb123 Dec 12, 2024
0c859db
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 12, 2024
c385537
Recalculate `total_row_groups` in apply_bloom_filter
mhaseeb123 Dec 13, 2024
3693ad1
Simplify bloom filter expression with ast::tree and handle non-equali…
mhaseeb123 Dec 13, 2024
c2de9fb
Apply suggestions from code review
mhaseeb123 Dec 13, 2024
344851c
Minor optimization: Set `have_bloom_filters` while populating `bloom_…
mhaseeb123 Dec 14, 2024
96fb7c2
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 14, 2024
4522afa
Add pytest to test logical or with non == expr
mhaseeb123 Dec 16, 2024
ed66593
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 17, 2024
f509148
Remove temporary arrow_filter_policy.cuh and use cuco directly.
mhaseeb123 Dec 17, 2024
c8cd646
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 17, 2024
4194d30
MInor style fix
mhaseeb123 Dec 17, 2024
8b7baff
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Dec 18, 2024
2fce902
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Jan 6, 2025
7b0735d
Update copyright year to 2025
mhaseeb123 Jan 6, 2025
4807d1a
Apply suggestions from code review
mhaseeb123 Jan 13, 2025
39b3412
Minor refactor, early exit if no bloom filters
mhaseeb123 Jan 13, 2025
a2ede06
Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…
mhaseeb123 Jan 13, 2025
86f0c12
Remove use of unnecessary mr parameter
mhaseeb123 Jan 13, 2025
bd9aa04
Disable boolean types in bloom filters.
mhaseeb123 Jan 13, 2025
cb0b844
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Jan 13, 2025
fa67675
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Jan 13, 2025
478a66f
Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data
mhaseeb123 Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Minor changes
  • Loading branch information
mhaseeb123 committed Dec 10, 2024
commit 10a8f5a0b2d45761401551125f89f76bbba6bb58
3 changes: 1 addition & 2 deletions cpp/include/cudf/hashing/detail/xxhash_64.cuh
Original file line number Diff line number Diff line change
@@ -29,8 +29,7 @@ namespace cudf::hashing::detail {

template <typename Key>
struct XXHash_64 {
using argument_type = Key;
using result_type = std::uint64_t;
using result_type = std::uint64_t;

CUDF_HOST_DEVICE constexpr XXHash_64(uint64_t seed = cudf::DEFAULT_HASH_SEED) : _impl{seed} {}

16 changes: 7 additions & 9 deletions cpp/src/io/parquet/arrow_filter_policy.cuh
Original file line number Diff line number Diff line change
@@ -30,8 +30,6 @@ namespace cuco {
* @brief A policy that defines how Arrow Block-Split Bloom Filter generates and stores a key's
* fingerprint.
*
* @note: This file is a part of cuCollections. Copied here until we get a cuco bump for cudf.
*
* Reference:
* https://github.com/apache/arrow/blob/be1dcdb96b030639c0b56955c4c62f9d6b03f473/cpp/src/parquet/bloom_filter.cc#L219-L230
*
@@ -41,14 +39,14 @@ namespace cuco {
* void bulk_insert_and_eval_arrow_policy_bloom_filter(device_vector<KeyType> const& positive_keys,
* device_vector<KeyType> const& negative_keys)
* {
* using policy_type = cuco::arrow_filter_policy<key_type>;
* using policy_type = cuco::arrow_filter_policy<KeyType, cuco::xxhash_64>;
*
* // Warn or throw if the number of filter blocks is greater than maximum used by Arrow policy.
* static_assert(NUM_FILTER_BLOCKS <= policy_type::max_filter_blocks, "NUM_FILTER_BLOCKS must be
* in range: [1, 4194304]");
*
* // Create a bloom filter with Arrow policy
* cuco::bloom_filter<key_type, cuco::extent<size_t>,
* cuco::bloom_filter<KeyType, cuco::extent<size_t>,
* cuda::thread_scope_device, policy_type> filter{NUM_FILTER_BLOCKS};
*
* // Add positive keys to the bloom filter
@@ -79,15 +77,15 @@ namespace cuco {
* @endcode
*
* @tparam Key The type of the values to generate a fingerprint for.
* @tparam XXHash64 64-bit XXHash hasher implementation for fingerprint generation.
*/
template <class Key, class Hash>
template <class Key, template <typename> class XXHash64>
class arrow_filter_policy {
public:
using hasher = Hash; ///< Hash function for Arrow bloom filter policy
using hasher = XXHash64<Key>; ///< 64-bit XXHash hasher for Arrow bloom filter policy
using word_type = std::uint32_t; ///< uint32_t for Arrow bloom filter policy
using hash_argument_type = typename hasher::argument_type; ///< Hash function input type
using hash_result_type = decltype(std::declval<hasher>()(
std::declval<hash_argument_type>())); ///< hash function output type
using hash_argument_type = Key; ///< Hash function input type
using hash_result_type = std::uint64_t; ///< hash function output type

static constexpr uint32_t bits_set_per_block = 8; ///< hardcoded bits set per Arrow filter block
static constexpr uint32_t words_per_block = 8; ///< hardcoded words per Arrow filter block
Loading