Support reading bloom filters from Parquet files and filter row groups using them #17289

mhaseeb123 · 2024-11-09T00:14:38Z

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

@bdice

…st::tree` (#17587) This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289. Related to #17164 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #17587

karthikeyann

LGTM 👍
Great work!

mhaseeb123 · 2025-01-06T20:26:40Z

/ok to test

cpp/src/io/parquet/predicate_pushdown.cpp

bdice

I have a few comments (in response to the request for another round of feedback), but nothing is serious enough to block this PR further. I know it's been a long time in the works, and congrats on implementing such a complex feature!

cpp/src/io/parquet/bloom_filter_reader.cu

bdice · 2025-01-11T16:23:49Z

cpp/src/io/parquet/bloom_filter_reader.cu

+    using policy_type = cuco::arrow_filter_policy<key_type, cudf::hashing::detail::XXHash_64>;
+    using word_type   = typename policy_type::word_type;
+
+    // List, Struct, Dictionary types are not supported


Are these types supported by other readers/writers? I would love to know if there is a specification for hashing compound types somewhere.

Seems like nested types aren't supported by arrow either. Also reminds me to disable boolean cols for bloom filters as well.

https://github.com/apache/arrow/blob/32fcd184da91e5d9bc9098baeef4f368632fc1f1/cpp/src/parquet/bloom_filter_test.cc#L382C1-L384C83

Also: https://github.com/apache/arrow/blob/32fcd184da91e5d9bc9098baeef4f368632fc1f1/cpp/src/parquet/xxhasher.cc

Edit: All done

cpp/src/io/parquet/bloom_filter_reader.cu

bdice · 2025-01-11T16:35:33Z

cpp/src/io/parquet/bloom_filter_reader.cu

+  CUDF_EXPECTS(total_row_groups <= std::numeric_limits<cudf::size_type>::max(),
+               "Total number of row groups exceed the size_type's limit");
+
+  auto mr = cudf::get_current_device_resource_ref();


Are we using memory resources properly?

Let's define mr closer to where it is used

This returns data to the caller that is allocated with a memory resource that wasn't passed in. Do we need to accept an mr parameter in this function and use that for the data returned to the caller? I can't recall how this rule applies to detail functions if the memory is not returned to the user but only an internal caller. https://github.com/rapidsai/cudf/blob/branch-25.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#output-memory_

After submitting this review, I scrolled a little further in the link I posted. It does clarify that detail APIs should accept an mr parameter and use that for data returned to the caller.

This rule automatically applies to all detail APIs that allocate memory. Any detail API may be called by any public API, and therefore could be allocating memory that is returned to the user. To support such uses cases, all detail APIs allocating memory resources should accept an mr parameter. Callers are responsible for either passing through a provided mr or cudf::get_current_device_resource_ref() as needed.

I think we are okay here as we aren't returning any data allocated with mr in apply_bloom_filter() and sub-functions, and only allocating temporary buffers with cudf::get_current_device_resource_ref(); which should be automatically destroyed when we return.

I have removed the mr input parameter from all functions in this file and replaced with cudf::get_current_device_resource_ref() to allocate temp memory where needed

cpp/src/io/parquet/parquet.hpp

Co-authored-by: Bradley Dice <[email protected]>

…/mhaseeb123/cudf into fea/extract-pq-bloom-filter-data

revans2 · 2025-01-13T20:46:25Z

Is there a way to disable this? Or other parts of predicate push down?

in Parquet-java, which is what spark used there are configs to enable/disable lot of things at a granular level related to predicate push down.

https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java#L278-L284

I am not saying that we have to implement all of these. Just curious if it is something you are planning to do or not.

mhaseeb123 · 2025-01-13T21:48:14Z

/ok to test

mhaseeb123 · 2025-01-13T21:52:15Z

Is there a way to disable this? Or other parts of predicate push down?

in Parquet-java, which is what spark used there are configs to enable/disable lot of things at a granular level related to predicate push down.

https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java#L278-L284

@revans2 AFAIK, currently we do predicate pushdown with stats and bloom filters if we have an input filter available and there isn't a particular option that controls it.

I am not saying that we have to implement all of these. Just curious if it is something you are planning to do or not.

Maybe @GregoryKimball can better answer this but if it's needed, a quick PR adding new options to control these can be added in 25.02

revans2 · 2025-01-13T22:01:07Z

Maybe @GregoryKimball can better answer this but if it's needed, a quick PR adding new options to control these can be added in 25.02

Sorry for any confusion. This is not needed right now. I am thinking more about in the future as Spark wants to try and move towards using CUDF for predicate push down. Eventually we might want something like this. I just wanted to be sure that I understood the code and its intentions.

mhaseeb123 · 2025-01-13T23:10:50Z

I am thinking more about in the future as Spark wants to try and move towards using CUDF for predicate push down.

In that case, I think it would be trivial to add these options to libcudf's predicate pushdown at any time! 🙂

mhaseeb123 · 2025-01-14T02:05:34Z

Tests failing due to an unrelated cudf-polars test failing due to fastexcel. PR should be able to merge once that is resolved

mhaseeb123 · 2025-01-14T03:32:18Z

/ok to test

mhaseeb123 · 2025-01-14T03:32:22Z

/merge

Initial stuff for reading bloom filter from PQ files

95fe8e8

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024

github-actions bot assigned mhaseeb123 Nov 9, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024

mhaseeb123 added 2 commits November 9, 2024 00:22

Minor bug fix

4f0e7ab

Apply style fix

48a50c4

mhaseeb123 mentioned this pull request Nov 9, 2024

[FEA] Use bloom filters in Parquet reader to filter row groups with equality predicates #17164

Open

mhaseeb123 and others added 19 commits November 14, 2024 14:54

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

9a85d08

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

b71cf9b

Some updates

68be24f

Move contents to a separate file

f848251

Revert erroneous changes

0b65233

Style and doc fix

cf7d762

Get equality predicate col indices

81efad2

Enable arrow_filter_policy and span types in bloom filter.

088377b

Merge branch 'branch-24.12' into fea/extract-pq-bloom-filter-data

0435bff

Successfully search bloom filter

3dff590

style fix

71e1d33

Code cleanup

aa65a2b

add tests

c52821b

Initial stuff for reading bloom filter from PQ files

3a20a98

Minor bug fix

d67e4b5

Apply style fix

10471d4

Some updates

1e12662

Move contents to a separate file

ee7217c

Revert erroneous changes

f8e6159

mhaseeb123 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 18, 2024

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

8b7baff

karthikeyann approved these changes Jan 6, 2025

View reviewed changes

mhaseeb123 and others added 2 commits January 6, 2025 12:18

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

2fce902

mhaseeb123 commented Jan 10, 2025

View reviewed changes

cpp/src/io/parquet/predicate_pushdown.cpp Show resolved Hide resolved

bdice approved these changes Jan 11, 2025

View reviewed changes

mhaseeb123 and others added 6 commits January 13, 2025 01:47

Apply suggestions from code review

4807d1a

Co-authored-by: Bradley Dice <[email protected]>

Minor refactor, early exit if no bloom filters

39b3412

Merge branch 'fea/extract-pq-bloom-filter-data' of https://github.com…

a2ede06

…/mhaseeb123/cudf into fea/extract-pq-bloom-filter-data

Remove use of unnecessary mr parameter

86f0c12

Disable boolean types in bloom filters.

bd9aa04

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

cb0b844

revans2 mentioned this pull request Jan 13, 2025

[FEA] Test parquet reads with bloom filters NVIDIA/spark-rapids#11962

Open

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

fa67675

mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Jan 13, 2025

Merge branch 'branch-25.02' into fea/extract-pq-bloom-filter-data

478a66f

rapids-bot bot merged commit 41215e2 into rapidsai:branch-25.02 Jan 14, 2025
108 of 109 checks passed

mhaseeb123 deleted the fea/extract-pq-bloom-filter-data branch January 14, 2025 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading bloom filters from Parquet files and filter row groups using them #17289

Support reading bloom filters from Parquet files and filter row groups using them #17289

mhaseeb123 commented Nov 9, 2024 •

edited

Loading

karthikeyann left a comment

mhaseeb123 commented Jan 6, 2025

bdice left a comment

bdice Jan 11, 2025

mhaseeb123 Jan 13, 2025 •

edited

Loading

bdice Jan 11, 2025

bdice Jan 11, 2025

mhaseeb123 Jan 13, 2025 •

edited

Loading

mhaseeb123 Jan 13, 2025

revans2 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

revans2 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 commented Jan 14, 2025

Support reading bloom filters from Parquet files and filter row groups using them #17289

Support reading bloom filters from Parquet files and filter row groups using them #17289

Conversation

mhaseeb123 commented Nov 9, 2024 • edited Loading

Description

Checklist

karthikeyann left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Jan 6, 2025

bdice left a comment

Choose a reason for hiding this comment

bdice Jan 11, 2025

Choose a reason for hiding this comment

mhaseeb123 Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

bdice Jan 11, 2025

Choose a reason for hiding this comment

bdice Jan 11, 2025

Choose a reason for hiding this comment

mhaseeb123 Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 Jan 13, 2025

Choose a reason for hiding this comment

revans2 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

revans2 commented Jan 13, 2025

mhaseeb123 commented Jan 13, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 commented Jan 14, 2025

mhaseeb123 commented Nov 9, 2024 •

edited

Loading

mhaseeb123 Jan 13, 2025 •

edited

Loading

mhaseeb123 Jan 13, 2025 •

edited

Loading