Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for reading legacy date in Hive for Parquet #24611

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

marcinsbd
Copy link
Contributor

Description

#23904

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jan 2, 2025
@github-actions github-actions bot added the hive Hive connector label Jan 2, 2025
@marcinsbd marcinsbd requested a review from Praveen2112 January 2, 2025 11:41
@@ -281,7 +281,7 @@ public static ReaderPageSource createPageSource(
exception -> handleException(dataSourceId, exception),
// We avoid using disjuncts of parquetPredicate for page pruning in ParquetReader as currently column indexes
// are not present in the Parquet files which are read with disjunct predicates.
parquetPredicates.size() == 1 ? Optional.of(parquetPredicates.get(0)) : Optional.empty(),
parquetPredicates.size() == 1 ? Optional.of(parquetPredicates.getFirst()) : Optional.empty(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about getOnly

@@ -265,7 +264,7 @@ public static ReaderPageSource createPageSource(
List<HiveColumnHandle> baseColumns = readerProjections.map(projection ->
projection.get().stream()
.map(HiveColumnHandle.class::cast)
.collect(toUnmodifiableList()))
.toList())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toImmutableList

@@ -255,38 +255,27 @@ public Block getBlock(Page sourcePage, long startRowId)
}
}

private static class SourceColumn
private record SourceColumn(int sourceChannel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general record can be used for POJO but this implements ColumnAdaptation which encapsulates function like details like getBlock which is something beyond POJO so do we need this record conversion ?

Comment on lines 298 to 299
private record CoercedColumn(SourceColumn sourceColumn, TypeCoercer<?, ?> typeCoercer)
implements ColumnAdaptation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -125,7 +125,7 @@ public ColumnReader create(PrimitiveField field, AggregatedMemoryContext aggrega
}
if (DATE.equals(type) && primitiveType == INT32) {
if (annotation == null || annotation instanceof DateLogicalTypeAnnotation) {
return createColumnReader(field, valueDecoders::getIntDecoder, INT_ADAPTER, memoryContext);
return createColumnReader(field, valueDecoders::getInt32Decoder, INT_ADAPTER, memoryContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for date type right ?

@@ -36,6 +36,7 @@ public class ParquetReaderOptions
private final boolean useBloomFilter;
private final DataSize smallFileThreshold;
private final boolean vectorizedDecodingEnabled;
private final boolean hybridCalendarEnabled;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: useHybridCalendar

return julianDays;
}

public static int convertDaysToHybridCalendar(int prolepticDays)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used only for testing ? In that case - can we move to test class ?

Comment on lines 29 to 34
private static final int[] JULIAN_GREGORIAN_DIFFS = {2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, 0};
private static final int[] JULIAN_GREGORIAN_DIFF_SWITCH_DAY = {
-719164, -682945, -646420, -609895, -536845,
-500320, -463795, -390745, -354220, -317695,
-244645, -208120, -171595, -141427
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we build a RangeMap and maintain it ? It would allow us to identify on which set of ranges which dates should we consider ?

Comment on lines 79 to 81
else if (julianDays < LAST_SWITCH_JULIAN_DAY) {
return rebaseDays(JULIAN_GREGORIAN_DIFF_SWITCH_DAY, JULIAN_GREGORIAN_DIFFS, julianDays);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a dedicated conversion for rebase - Can we use convertHybridDaysToProlepticDays ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can, however rebaseDays method will be faster

Comment on lines 136 to 141
if (shouldUseHybridCalendarForDate) {
return createColumnReader(field, encoding -> getDateInt32Decoder(valueDecoders.getInt32Decoder(encoding)), INT_ADAPTER, memoryContext);
}
else {
return createColumnReader(field, valueDecoders::getInt32Decoder, INT_ADAPTER, memoryContext);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement it as a ColumnAdaptation ? In this case the Date/Timestamp conversion can be unified and we could use them for ORC as well.

@marcinsbd marcinsbd force-pushed the marcinsbd/support-legacy-date branch from 9c61f71 to 27ef0fc Compare January 8, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

2 participants