Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-1051] [Feature] Drop the temporary table upon a successful incremental refresh #1036

Closed
3 tasks done
mikealfare opened this issue Nov 28, 2023 · 6 comments · Fixed by #1076
Closed
3 tasks done
Assignees
Labels
help_wanted Extra attention is needed type:enhancement New feature or request

Comments

@mikealfare
Copy link
Contributor

mikealfare commented Nov 28, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Background context

The current behavior is for dbt-bigquery to create a "temporary" table that expires after 12 hours when performing an incremental model update. This duration is hard-coded here.

Describe the feature

Drop the temporary table upon a successful incremental refresh. There are other issues in this repo that speak to that approach. And using that approach would resolve this issue.

Describe alternatives you've considered

An alternative approach would be to make the expiration duration configurable, particularly for incremental refreshes that occur more frequently than every 12 hours.

Who will this benefit?

This benefits folks who refresh incremental models more frequently than every 12 hours. It would also benefit folks who are conscious of incremental spend as a result of these tables.

Are you interested in contributing this feature?

No response

Anything else?

See also: #1032

@github-actions github-actions bot changed the title [Feature] Allow configuration of expiration of temp table used in incremental models [ADAP-1051] [Feature] Allow configuration of expiration of temp table used in incremental models Nov 28, 2023
@dbeatty10
Copy link
Contributor

It may be worth considering dropping the temporary table upon a successful incremental refresh

From a user perspective, it would be awesome if dbt just drops temporary table like you mentioned.

@mikealfare What do you think about the following acceptance criteria?

Proposed acceptance criteria

  • When materializing a model, any intermediate database objects that are created along the way are dropped by the end of the materialization

We can re-use this acceptance criteria for other types of materializations as well (not just incremental models).

@dbeatty10
Copy link
Contributor

Questions about temp table vs. expiring permanent table vs. non-expiring permanent table

@mikealfare is there any way to create the intermediate table so that it is droppable?

Ideally, we'd avoid any intermediate objects persisting beyond the materialization (even if they they expire after X hours).

Is there something about adding an expiration to a permanent table that makes it harder to drop? Or are permanent tables inherently hard to drop in BigQuery? Is using a real BigQuery temp table an option?

Current implementation

To the best that I can tell, we are not creating temp tables. Rather, it looks like the intermediate table is a permanent table with an expiration of 12 hours (like you mentioned).

Side note: To make the code easier to read, it would be nice to replace references to "temp" or "temporary" to be either "intermediate" or "expiring" instead (to the extent either/or are accurate).

Full code trace

Options for acceptance criteria

These are a mix of both behavior and implementation detail.

Option 1 includes the ideal state that all intermediate objects are dropped by the end of the materialization.

Options 2 and 3 explore what we could do if Option 1 isn't possible or practical for some reason.

Option 1

  1. Create an intermediate table using a permanent table without any expiration time.
  2. Try to drop it at the end.
  3. Block until time out, error, or success.
  4. Raise error if any, otherwise continue.

Alternatively, use a (non-permanent) temporary table somehow.

Option 2

  1. Set expiration time for "temp" table to minimum time possible (60 minutes?).
  2. Try to drop it at the end.
  3. On error, make a log entry and continue anyway.
    • Make a log entry that the intermediate table will expire after X minutes.

Option 3

Add detail to one of the GitHub issues that we tried dropping the "temp" table and it didn't work (for some unknown reason).

  1. Set expiration time for the intermediate table to minimum time possible (60 minutes?).
  2. Don't try to drop it at the end.
  3. Make a log entry that the intermediate table will expire after X minutes.

@vinit2107
Copy link
Contributor

@dbeatty10 @mikealfare
Can we also explore an option where the user can define which dataset the temp table gets materialized in? I understand the tables are already configured with an expiration policy, but if the user is refreshing the models at a cadence faster than the expiration, or the model takes a longer time to build, the table will be visible to users who aren't supposed to see it, which might cause some confusion (This will be true if the accesses are defined on the dataset level).

@dbeatty10
Copy link
Contributor

Thanks for this insight @vinit2107 🧠

It seems like there are a couple things dbt developers would like to avoid as it relates to these intermediate tables:

  1. Avoid cost of keeping an extra copy of the data set around (even if it is for "only" 12 hours)
  2. Avoid consumers being able to see this intermediate tables (in the case access is granted at the dataset level)

I think Option 1 outlined above would achieve both of these. And I think Option 1 is the thing we should pursue with full intent and effort unless it is "impossible" somehow.

But we can definitely add your idea as another alternative.

Here's the two options that we haven't numbered yet, in no particular order:

Option 4

Be able to define the dataset in which the intermediate table(s) gets materialized within.

Option 5

Make the expiration duration of the intermediate table so that it is not hard-coded to 12 hours but can be configured instead.

To fit in with the naming convention of hours_to_expiration, the configuration parameter could be named something like tmp_hours_to_expiration.

This is the option originally outlined in #1036

@dbeatty10
Copy link
Contributor

Per @Fleid in an internal Slack conversation here, we don't have the bandwidth to tackle this.

So labeling this as help_wanted for a motivated community member to take on.

Ideally, a PR that solves this would:

  1. Implement Option 1 by dropping any intermediate tables during the materialization so they don't persist beyond
  2. Be well-tested

See here for an implementation idea.

And see here for ideal acceptance criteria.

@dbeatty10 dbeatty10 added the help_wanted Extra attention is needed label Jan 16, 2024
@vinit2107
Copy link
Contributor

Hi @dbeatty10 ,
I have followed your suggestion here and created a PR. I have also added a variable that can be defined in dbt_project.yaml tmp_relation_dataset where developers can define which dataset the tmp relations need to be materialized in. The default behavior remains the same, and the temp relations are dropped in both scenarios.
I have not updated the default expiration as drop_relation is dropping the table after execution.

Please let me know if you need me to update anything else. Thanks!!

@dbeatty10 dbeatty10 changed the title [ADAP-1051] [Feature] Allow configuration of expiration of temp table used in incremental models [ADAP-1051] [Feature] Drop the temporary table upon a successful incremental refresh Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Extra attention is needed type:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants