-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading bloom filters from Parquet files and filter row groups using them #17289
Support reading bloom filters from Parquet files and filter row groups using them #17289
Conversation
…st::tree` (#17587) This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289. Related to #17164 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #17587
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Great work!
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments (in response to the request for another round of feedback), but nothing is serious enough to block this PR further. I know it's been a long time in the works, and congrats on implementing such a complex feature!
using policy_type = cuco::arrow_filter_policy<key_type, cudf::hashing::detail::XXHash_64>; | ||
using word_type = typename policy_type::word_type; | ||
|
||
// List, Struct, Dictionary types are not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these types supported by other readers/writers? I would love to know if there is a specification for hashing compound types somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like nested types aren't supported by arrow either. Also reminds me to disable boolean
cols for bloom filters as well.
Edit: All done
CUDF_EXPECTS(total_row_groups <= std::numeric_limits<cudf::size_type>::max(), | ||
"Total number of row groups exceed the size_type's limit"); | ||
|
||
auto mr = cudf::get_current_device_resource_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using memory resources properly?
- Let's define
mr
closer to where it is used - This returns data to the caller that is allocated with a memory resource that wasn't passed in. Do we need to accept an
mr
parameter in this function and use that for the data returned to the caller? I can't recall how this rule applies to detail functions if the memory is not returned to the user but only an internal caller. https://github.com/rapidsai/cudf/blob/branch-25.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#output-memory_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After submitting this review, I scrolled a little further in the link I posted. It does clarify that detail APIs should accept an mr
parameter and use that for data returned to the caller.
This rule automatically applies to all detail APIs that allocate memory. Any detail API may be called by any public API, and therefore could be allocating memory that is returned to the user. To support such uses cases, all detail APIs allocating memory resources should accept an
mr
parameter. Callers are responsible for either passing through a providedmr
orcudf::get_current_device_resource_ref()
as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are okay here as we aren't returning any data allocated with mr
in apply_bloom_filter()
and sub-functions, and only allocating temporary buffers with cudf::get_current_device_resource_ref();
which should be automatically destroyed when we return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the mr
input parameter from all functions in this file and replaced with cudf::get_current_device_resource_ref()
to allocate temp memory where needed
Co-authored-by: Bradley Dice <[email protected]>
…/mhaseeb123/cudf into fea/extract-pq-bloom-filter-data
Is there a way to disable this? Or other parts of predicate push down? in Parquet-java, which is what spark used there are configs to enable/disable lot of things at a granular level related to predicate push down. I am not saying that we have to implement all of these. Just curious if it is something you are planning to do or not. |
/ok to test |
@revans2 AFAIK, currently we do predicate pushdown with stats and bloom filters if we have an input
Maybe @GregoryKimball can better answer this but if it's needed, a quick PR adding new options to control these can be added in 25.02 |
Sorry for any confusion. This is not needed right now. I am thinking more about in the future as Spark wants to try and move towards using CUDF for predicate push down. Eventually we might want something like this. I just wanted to be sure that I understood the code and its intentions. |
In that case, I think it would be trivial to add these options to libcudf's predicate pushdown at any time! 🙂 |
Tests failing due to an unrelated cudf-polars test failing due to |
/ok to test |
/merge |
41215e2
into
rapidsai:branch-25.02
Description
This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on
col == literal
like predicate(s), if provided.Related to #17164
Checklist