Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datetime incremental tables scanning more of table than expected #993

Closed
wants to merge 8 commits into from

Conversation

tnk-ysk
Copy link
Contributor

@tnk-ysk tnk-ysk commented Oct 28, 2023

resolves #717
docs dbt-labs/docs.getdbt.com/#

Problem

Scan using partition is not possible because the partition column is transformed.

Solution

The simplest solution to this issue is to change datetime_trunc to date_trunc.
It seems to work expected, but date_trunc(datetime, hour) is not in the documentation.
So I am checking with GCP support, but I got a reply from gcp support that date_trunc(datetime,hour) is not supported.

data_type granularity transform description
datetime month datetime_trunc(column, month) full scan
datetime day datetime_trunc(column, day) full scan
datetime hour datetime_trunc(column, hour) full scan
datetime month date_trunc(column, month) valid partition filter
datetime day date_trunc(column, day) valid partition filter
datetime hour date_trunc(column, hour) valid partition filter, but not supported

Therefore, I changed the partition column to a method that does not transform it.

Caution

The following settings were not overwritten until now, but with this modification they will be overwritten.

{{ config(
    materialized='incremental',
    unique_key='id',
    partition_by={
      "field": "created_at",
      "data_type": "datetime",
      "granularity": "month",
    },
    incremental_strategy = 'insert_overwrite',
    partitions = ['"2023-01-02"'],
  )
}}

with a as (
  select 1 as id, datetime("2023-01-01 00:00:00") as created_at
)
select * from a

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes label Oct 28, 2023
@tnk-ysk tnk-ysk force-pushed the fix-datetime-scanning branch 2 times, most recently from bbbfeed to 896074c Compare October 31, 2023 02:29
@tnk-ysk tnk-ysk force-pushed the fix-datetime-scanning branch from 896074c to bb680c5 Compare October 31, 2023 02:32
@tnk-ysk tnk-ysk marked this pull request as ready for review October 31, 2023 02:38
@tnk-ysk tnk-ysk requested a review from a team as a code owner October 31, 2023 02:38
@tnk-ysk tnk-ysk requested a review from mikealfare October 31, 2023 02:38
@tnk-ysk tnk-ysk marked this pull request as draft November 5, 2023 06:07
@tnk-ysk tnk-ysk marked this pull request as ready for review November 5, 2023 06:32
@tnk-ysk tnk-ysk marked this pull request as draft November 5, 2023 07:46
@tnk-ysk tnk-ysk marked this pull request as ready for review November 5, 2023 09:38
@tnk-ysk
Copy link
Contributor Author

tnk-ysk commented Dec 24, 2023

Performance test

model

{{ config(
    materialized='incremental',
    partition_by={
      "field": "created_at",
      "data_type": "datetime",
      "granularity": "day"
    },
    incremental_strategy = 'insert_overwrite'
  )
}}

with r as (
  SELECT n FROM UNNEST(GENERATE_ARRAY(1, 1000000)) AS n ORDER BY n  
),
p as (
{% if is_incremental() -%}
  SELECT n FROM UNNEST(GENERATE_ARRAY(1, 10)) AS n ORDER BY n  
{%- else -%}
  SELECT n FROM UNNEST(GENERATE_ARRAY(1, 1000)) AS n ORDER BY n  
{%- endif %}
)
select
  p.n * 100000 + r.n as id,
  LPAD('', 4096, '.') as val1,
  LPAD('', 4096, '.') as val2,
  LPAD('', 4096, '.') as val3,
  datetime_add(datetime '2020-01-01 00:00:00', interval p.n day) as created_at
from p
cross join r

results

type data_type granularity sec scan bytes
before datetime day 29.00s 11.31 TB
after datetime day 28.76s 229.37 GB
before date day 29.98s 229.37 GB
after date day 26.99s 229.37 GB

Copy link
Contributor

github-actions bot commented Sep 4, 2024

This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 4, 2024
Copy link
Contributor

Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ADAP-551] [Bug] Datetime incremental tables scanning more of table than expected
1 participant