Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096

josephberni · 2023-07-14T14:32:28Z

josephberni
Jul 14, 2023

So whats the problem?

Currently we are quite limited with options when we need to backfill an incremental model with the only option being to --full-refresh model. This is particularly a problem when you have very large event driven tables which require substantial amounts of resource and memory in order to be backfilled due to the 'all or nothing approach'.

Ideas on how could this be handled:
Option 1: Allow Incremental materialisation's to have an additional config variable

If we were able to pass additional variables to the incremental materialisation then when running a model in a full refresh mode we could in theory batch the data by the batch_reload_by key and order the run of those batches by the batch_order_by key.

{{ config( materialized='incremental', unique_key='unique_event_id', batch_reload_by='etl_month_year', batch_order_by='etl_month_year' ) }}

Option 2: Introduce a new model materialisation
This is similar to the approach that we saw with the now deprecated insert_by_period materialisation built for redshift. Ideally, the new materialisation would behave in the same way as option 1, but I guess there is also an argument for separating the materialisation instead of enhancing it, in terms of simplicity.

Option 3: Open to ideas!

Again, these are just my initial thoughts on how we can solve what is becoming quite a painful and expensive problem each time we need to backfill our largest tables.

vrlambert · 2023-07-16T22:45:38Z

vrlambert
Jul 16, 2023

This would be awesome. Would love if it could be extended to parallelize backfills for tables that don't depend on themselves

0 replies

leo-schick · 2023-11-09T09:23:20Z

leo-schick
Nov 9, 2023

I would suggest to use dbt project variables to define the wanted incremental time period. You can then use jinja conditions (e.g. {% if is_incremental() and var('refresh_last_month) 5}) to define custom SQL code or use the given time period from the dbt project variable in SQL:

{% if is_incremental() %}
WHERE date >= DATEADD(day, {% var('incremental_days') %}, GETDATE())
{% endif %}

To do further, you might define variables on model level e.g. when I do a long run, run last 60 days for model A and last 90 days for model B. (I am not sure if custom variables for models are possible. Maybe an idea for a feature request.)

3 replies

n1k40 Jun 27, 2024

this is not solving the problem of the full-refresh, simply managing the incremental period?

leo-schick Jul 19, 2024

@n1k40 I do not understand what you mean. Could you give a use case e.g. how a final SQL query should look like when giving specific paramters? The "full refresh" just does a complete refresh of all data. When you want to do a "batch full refresh", this sounds to me more like a windowing logic where you delete e.g. past 60 days and fill them with new data from the query.

This is possible today in the dbt-bigquery connector by using partitions and then telling that existing partitions shall be replaced. Its a incremental strategy option see here. Unfortunately, such a logic is not available by all adapters.

n1k40 Jul 19, 2024

@leo-schick the insert_by_period seems to do what I was aiming for.

Basically the problem is execution time and resource usage for some resource intensive models. One model might take 20mins to execute when I run it for the incremental data (only 1 day's data). A full refresh (especially further down the road) would basically means the execution time explodes to 20mins * number of days in scope.

I was looking for a way to split up a full refresh job into several chunks, which you can union at the end.
i've created something similar before manually; create a macro which holds the SQL you want to execute and have one model per hour of the day that calls the macro for that particular time window. then have 1 model after that which unions the results of the other models.
That's a nasty work around, but basically an example of what I would be looking for to be 'generated' on a full refresh: split the model up in several parelel chunks and union it at the end.

n1k40 · 2024-06-27T11:27:49Z

n1k40
Jun 27, 2024

Currently looking into this as well. Can't imagine this is not a problem for more people and situations, especially if you are running DBT for a while and data is growing?
@josephberni did you ever find or create a workaround for this?
I know there is a possibility of defining and creating your own materialization strategies.. so perhaps that is the way to go.
but again, I do wonder if no one solved this problem before and this is available as known knowledge or package 'out there' ?

1 reply

b-per Jun 27, 2024
Collaborator

There have been some recent updates of insert_by_period listed in the first post. This should now work for more adapter than before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096

josephberni Jul 14, 2023

Replies: 3 comments · 4 replies

vrlambert Jul 16, 2023

leo-schick Nov 9, 2023

n1k40 Jun 27, 2024

leo-schick Jul 19, 2024

n1k40 Jul 19, 2024

n1k40 Jun 27, 2024

b-per Jun 27, 2024 Collaborator

josephberni
Jul 14, 2023

Replies: 3 comments 4 replies

vrlambert
Jul 16, 2023

leo-schick
Nov 9, 2023

n1k40
Jun 27, 2024

b-per Jun 27, 2024
Collaborator