Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading bloom filters from Parquet files and filter row groups using them #17289

Merged

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 9, 2024

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024
@mhaseeb123 mhaseeb123 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 18, 2024
rapids-bot bot pushed a commit that referenced this pull request Dec 20, 2024
…st::tree` (#17587)

This PR simplifies the StatsAST expression transformer in Parquet reader's predicate pushdown using `ast::tree` from (#17156). 

This PR is a follow up to @bdice's comment at #17289 (comment). Similar changes for the `BloomfilterAST` expression converter have been incorporated in the PR #17289.

Related to #17164

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #17587
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Great work!

@mhaseeb123
Copy link
Member Author

/ok to test

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments (in response to the request for another round of feedback), but nothing is serious enough to block this PR further. I know it's been a long time in the works, and congrats on implementing such a complex feature!

cpp/src/io/parquet/bloom_filter_reader.cu Show resolved Hide resolved
using policy_type = cuco::arrow_filter_policy<key_type, cudf::hashing::detail::XXHash_64>;
using word_type = typename policy_type::word_type;

// List, Struct, Dictionary types are not supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these types supported by other readers/writers? I would love to know if there is a specification for hashing compound types somewhere.

Copy link
Member Author

@mhaseeb123 mhaseeb123 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
CUDF_EXPECTS(total_row_groups <= std::numeric_limits<cudf::size_type>::max(),
"Total number of row groups exceed the size_type's limit");

auto mr = cudf::get_current_device_resource_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using memory resources properly?

  1. Let's define mr closer to where it is used
  2. This returns data to the caller that is allocated with a memory resource that wasn't passed in. Do we need to accept an mr parameter in this function and use that for the data returned to the caller? I can't recall how this rule applies to detail functions if the memory is not returned to the user but only an internal caller. https://github.com/rapidsai/cudf/blob/branch-25.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#output-memory_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After submitting this review, I scrolled a little further in the link I posted. It does clarify that detail APIs should accept an mr parameter and use that for data returned to the caller.

This rule automatically applies to all detail APIs that allocate memory. Any detail API may be called by any public API, and therefore could be allocating memory that is returned to the user. To support such uses cases, all detail APIs allocating memory resources should accept an mr parameter. Callers are responsible for either passing through a provided mr or cudf::get_current_device_resource_ref() as needed.

Copy link
Member Author

@mhaseeb123 mhaseeb123 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are okay here as we aren't returning any data allocated with mr in apply_bloom_filter() and sub-functions, and only allocating temporary buffers with cudf::get_current_device_resource_ref(); which should be automatically destroyed when we return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the mr input parameter from all functions in this file and replaced with cudf::get_current_device_resource_ref() to allocate temp memory where needed

cpp/src/io/parquet/parquet.hpp Show resolved Hide resolved
@revans2
Copy link
Contributor

revans2 commented Jan 13, 2025

Is there a way to disable this? Or other parts of predicate push down?

in Parquet-java, which is what spark used there are configs to enable/disable lot of things at a granular level related to predicate push down.

https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java#L278-L284

I am not saying that we have to implement all of these. Just curious if it is something you are planning to do or not.

@mhaseeb123 mhaseeb123 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Jan 13, 2025
@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

Is there a way to disable this? Or other parts of predicate push down?

in Parquet-java, which is what spark used there are configs to enable/disable lot of things at a granular level related to predicate push down.

https://github.com/apache/parquet-java/blob/7f77908338192105a5adbfc420a7281d919e8596/parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java#L278-L284

@revans2 AFAIK, currently we do predicate pushdown with stats and bloom filters if we have an input filter available and there isn't a particular option that controls it.

I am not saying that we have to implement all of these. Just curious if it is something you are planning to do or not.

Maybe @GregoryKimball can better answer this but if it's needed, a quick PR adding new options to control these can be added in 25.02

@revans2
Copy link
Contributor

revans2 commented Jan 13, 2025

Maybe @GregoryKimball can better answer this but if it's needed, a quick PR adding new options to control these can be added in 25.02

Sorry for any confusion. This is not needed right now. I am thinking more about in the future as Spark wants to try and move towards using CUDF for predicate push down. Eventually we might want something like this. I just wanted to be sure that I understood the code and its intentions.

@mhaseeb123
Copy link
Member Author

I am thinking more about in the future as Spark wants to try and move towards using CUDF for predicate push down.

In that case, I think it would be trivial to add these options to libcudf's predicate pushdown at any time! 🙂

@mhaseeb123
Copy link
Member Author

Tests failing due to an unrelated cudf-polars test failing due to fastexcel. PR should be able to merge once that is resolved

@mhaseeb123
Copy link
Member Author

/ok to test

@mhaseeb123
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 41215e2 into rapidsai:branch-25.02 Jan 14, 2025
108 of 109 checks passed
@mhaseeb123 mhaseeb123 deleted the fea/extract-pq-bloom-filter-data branch January 14, 2025 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuco cuCollections related issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

6 participants