-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DH-18399: Add ParquetColumnResolver #6558
feat: DH-18399: Add ParquetColumnResolver #6558
Conversation
} else { | ||
final ColumnDescriptor columnDescriptor = resolver.mapping().get(columnName); | ||
if (columnDescriptor == null) { | ||
nameList = List.of(); // empty, will not resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure this actually results in supply ColumnRegion.Null
implementations when a user tries to access the column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, verified.
...s/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetColumnResolver.java
Outdated
Show resolved
Hide resolved
// There are different resolution strategies that could all be reasonable. We could consider using only the | ||
// field id closest to the leaf. This version, however, takes the most general approach and considers field | ||
// ids wherever they appear; ultimately, only being resolvable if the field id mapping is unambiguous. | ||
for (Type type : path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we do encounter a nested type here? That is, what's the current outcome?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the question. There may definitely be "nested types" here; but path
represents the full path to a single leaf (primitive) field.
* @param tableLocationKey the Parquet TLK | ||
* @return the Parquet column resolver | ||
*/ | ||
ParquetColumnResolver init(TableKey tableKey, ParquetTableLocationKey tableLocationKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init
-> resolver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used generic naming of
...s/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetColumnResolver.java
Show resolved
Hide resolved
* The map from Deephaven column name to {@link ColumnDescriptor}. The {@link #schema()} must contain each column | ||
* descriptor. | ||
*/ | ||
public abstract Map<String, ColumnDescriptor> mapping(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this to keep ColumnDescriptor
as part of the interface of the implementation? Do we want the Iceberg implementation using this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very much in favor of keeping this implementation strongly-typed with the safety checks. Iceberg should be able to use this implementation (and even use the same Factory to create it) in some situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed my mind; the map implementation is now very generic, no Parquet specific types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
...ns/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetUtil.java
Outdated
Show resolved
Hide resolved
Also update relevant tests to check against full MessageType schema instead of pulling out the ColumnDescriptors
extensions/iceberg/src/test/java/io/deephaven/iceberg/junit5/SqliteCatalogBase.java
Show resolved
Hide resolved
public static List<String[]> getPaths(MessageType schema) { | ||
final List<String[]> out = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add a comment here what these method actually do because there's no comments in the MessageType methods and they are a bit tricky.
extensions/parquet/base/src/main/java/io/deephaven/parquet/impl/ParquetSchemaUtil.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/impl/ParquetSchemaUtil.java
Outdated
Show resolved
Hide resolved
return null; | ||
} | ||
primitiveType = field.asPrimitiveType(); | ||
if (isRepeated(primitiveType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to merge the two if branches because isRequired and isRepeated only take Type
, and not Primitive or Group type
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/RowGroupWriterImpl.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** | ||
* A more efficient implementation of {@link MessageType#getColumnDescription(String[])}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the -1 in the calculation of max repetition level and max definition level in MessageType class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -1 in the upstream code is because the are counting the MessageType as part of the calculation (and it always appears to be REPEATED).
nightlies all pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
No description provided.