Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DH-18399: Add ParquetColumnResolver #6558

Merged

Conversation

devinrsmith
Copy link
Member

No description provided.

@devinrsmith devinrsmith added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jan 13, 2025
@devinrsmith devinrsmith added this to the 0.38.0 milestone Jan 13, 2025
@devinrsmith devinrsmith self-assigned this Jan 13, 2025
} else {
final ColumnDescriptor columnDescriptor = resolver.mapping().get(columnName);
if (columnDescriptor == null) {
nameList = List.of(); // empty, will not resolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this actually results in supply ColumnRegion.Null implementations when a user tries to access the column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, verified.

// There are different resolution strategies that could all be reasonable. We could consider using only the
// field id closest to the leaf. This version, however, takes the most general approach and considers field
// ids wherever they appear; ultimately, only being resolvable if the field id mapping is unambiguous.
for (Type type : path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we do encounter a nested type here? That is, what's the current outcome?

Copy link
Member Author

@devinrsmith devinrsmith Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. There may definitely be "nested types" here; but path represents the full path to a single leaf (primitive) field.

* @param tableLocationKey the Parquet TLK
* @return the Parquet column resolver
*/
ParquetColumnResolver init(TableKey tableKey, ParquetTableLocationKey tableLocationKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init -> resolver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used generic naming of

* The map from Deephaven column name to {@link ColumnDescriptor}. The {@link #schema()} must contain each column
* descriptor.
*/
public abstract Map<String, ColumnDescriptor> mapping();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to keep ColumnDescriptor as part of the interface of the implementation? Do we want the Iceberg implementation using this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much in favor of keeping this implementation strongly-typed with the safety checks. Iceberg should be able to use this implementation (and even use the same Factory to create it) in some situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed my mind; the map implementation is now very generic, no Parquet specific types.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@devinrsmith devinrsmith marked this pull request as ready for review January 16, 2025 01:57
Copy link
Contributor

@malhotrashivam malhotrashivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Also update relevant tests to check against full MessageType schema instead of pulling out the ColumnDescriptors
Comment on lines 58 to 59
public static List<String[]> getPaths(MessageType schema) {
final List<String[]> out = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a comment here what these method actually do because there's no comments in the MessageType methods and they are a bit tricky.

return null;
}
primitiveType = field.asPrimitiveType();
if (isRepeated(primitiveType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to merge the two if branches because isRequired and isRepeated only take Type, and not Primitive or Group type

}

/**
* A more efficient implementation of {@link MessageType#getColumnDescription(String[])}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the -1 in the calculation of max repetition level and max definition level in MessageType class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 in the upstream code is because the are counting the MessageType as part of the calculation (and it always appears to be REPEATED).

@devinrsmith
Copy link
Member Author

nightlies all pass

Copy link
Contributor

@malhotrashivam malhotrashivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

@devinrsmith devinrsmith merged commit 4b3ea4b into deephaven:main Jan 22, 2025
17 checks passed
@devinrsmith devinrsmith deleted the DH-18399-parquet-column-resolver branch January 22, 2025 18:07
@github-actions github-actions bot locked and limited conversation to collaborators Jan 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NoDocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants