Allow incremental materialisation to handle batch full refreshes or introduce new materialisation to manage this behaviour. #8096
Replies: 3 comments 4 replies
-
This would be awesome. Would love if it could be extended to parallelize backfills for tables that don't depend on themselves |
Beta Was this translation helpful? Give feedback.
-
I would suggest to use dbt project variables to define the wanted incremental time period. You can then use jinja conditions (e.g.
To do further, you might define variables on model level e.g. when I do a long run, run last 60 days for model A and last 90 days for model B. (I am not sure if custom variables for models are possible. Maybe an idea for a feature request.) |
Beta Was this translation helpful? Give feedback.
-
Currently looking into this as well. Can't imagine this is not a problem for more people and situations, especially if you are running DBT for a while and data is growing? |
Beta Was this translation helpful? Give feedback.
-
So whats the problem?
Currently we are quite limited with options when we need to backfill an incremental model with the only option being to
--full-refresh
model. This is particularly a problem when you have very large event driven tables which require substantial amounts of resource and memory in order to be backfilled due to the 'all or nothing approach'.Ideas on how could this be handled:
Option 1: Allow Incremental materialisation's to have an additional config variable
If we were able to pass additional variables to the
incremental
materialisation then when running a model in a full refresh mode we could in theory batch the data by thebatch_reload_by
key and order the run of those batches by thebatch_order_by
key.{{ config( materialized='incremental', unique_key='unique_event_id', batch_reload_by='etl_month_year', batch_order_by='etl_month_year' ) }}
Option 2: Introduce a new model materialisation
This is similar to the approach that we saw with the now deprecated insert_by_period materialisation built for redshift. Ideally, the new materialisation would behave in the same way as option 1, but I guess there is also an argument for separating the materialisation instead of enhancing it, in terms of simplicity.
Option 3: Open to ideas!
Again, these are just my initial thoughts on how we can solve what is becoming quite a painful and expensive problem each time we need to backfill our largest tables.
Beta Was this translation helpful? Give feedback.
All reactions