Replies: 4 comments 12 replies
-
@QMalcolm Thanks a lot for this Quigley! |
Beta Was this translation helpful? Give feedback.
-
Can this ship with support or workarounds for packages like dbt_utils? A problem I encountered with |
Beta Was this translation helpful? Give feedback.
-
Something to take into consideration for the different ways to sample is the SQL dialect. For example, Tsql doesn't support |
Beta Was this translation helpful? Give feedback.
-
What if you implemented sampling strategies as plugins? The default plugin if none is specified is time-based, so (FWIW, I'm not a fan of the current experimental plugin architecture, but it wouldn't be too hard to improve it) |
Beta Was this translation helpful? Give feedback.
-
Have you ever wanted to run a smaller slice of your project during development? It would be faster and cheaper if you could 🤔 The answer, for many, is that yes they have (and we do too). Blog posts have been written on the topic, forum discussions have emerged, and we’ve even previously considered the implementation. Well it’s now it’s 2025 🎉, we’ve released our roadmap, and as one of our new year’s resolutions we intend to make “Sample Mode” a built-in part of dbt-core.
Why sampling?
What is the purpose of sampling? Some of us (a lot of us) have REALLY BIG DATA. The primary reason incremental models exist, and why we just did a ton of work to bring microbatch models to 1.9, is because we all deal with a lot of big, huge, mountainous datasets.
The problem is really big datasets result in really big models, and really big models are really painful to deal with when trying to develop quickly and iteratively. No one wants to wait 15 minutes to see whether the change they made to a model will pass data tests and result in sensible data.
Make development faster
How do we make development faster? Well, what if we simply don’t build the full model. After all, do we really need to run the entire model during development or CI? Probably not. If we simply want to ensure that valid data will be produced with our changes, then running only part of the model might be enough. That is, we can get away with just sampling it. Now sampling won’t be perfect since inherently we’ll only be producing a “sample” of a model, and thus we lose some guarantee on completeness. For example, it might be the case that not all joins will be satisfied and populated, and different sampling strategies are better/worse at this.
Make development cheaper
Sampling has the added benefit of reducing cost. Most data warehouses charge based on some combination of compute usage and total capacity of stored data. For all data warehouses, sampling means you’ll be storing less data in your development environments. When it comes to compute costs, for:
Why not
--empty
?Using the empty flag is a great tool! It basically validates that your models are semantically correct. If your sql is invalid (for instance, if a column you are trying to select doesn’t exist), then running with
--empty
will catch this. Running with the--empty
flag is fast, and that is because it runs models withlimit 0
, meaning no data is actually being read or written. However, running with--empty
doesn’t let you inspect the resulting data to get a sense if things are working how you want and data tests don’t actually validate anything because there is no data.Different ways to sample data
As mentioned previously, a ton of thought work has been put into the different ways sampling could work. Some of the proposed methods have been to sample by:
where my_bool_column = true
"If we were trying to create the most flexible system, we’d implement them all. But truthfully, this would result in a rather complicated implementation, and we also prefer to be opinionated. So how do we choose which to implement. The good news, and bad news, is that these different sampling methods are not created equally.
To compare them, we have to keep in mind the things that are important to us:
Random Sampling
Random sampling is fast. It could also be incredibly simple, as it wouldn’t need anything more than
dbt run --sample
. Snowflake, BigQuery, and Databricks have all even gone as far as to implement a clause specifically for randomly sampling tables. However, there are problems. First, random samplings are inherently non-deterministic. Secondly, random sampling gives no way to ensure sampled data from different tables are correlated. The chance of joins being successful when randomly sampling is extremely low. Thus, because most projects contain models with joins, random sampling probably isn’t a good candidate at all.Arbitrary Where Clause Sampling
This is by far the most flexible sampling method. It is guaranteed to cover any node. The overhead however is high. It is incredibly unlikely that a single arbitrary where clause will work for all of a project's nodes. It is much more likely that an arbitrary where clause for sampling would have to be specified on a node by node basis. To change the sampling, one would need to alter the node specification itself, which isn’t great. Its speed, data completeness, and determinism are all determined by where clause written. Overall, arbitrary where clause sampling feels like it should be an escape hatch for when other sampling methods fail. As such, the necessity of implementing this seems dependent on whether sample modes we implement cover enough use cases.
Representative Segment Sampling
Representative segment sampling would be useful for things like sampling by customer_id. Assuming tables are partitioned by the column of representative segment information, it should be relatively fast. Additionally, assuming joins are done via the segment information, then the data should be fairly complete. The drawback is that representative segment sampling is very narrow. Hopefully the representative segment is truly representative, but if it’s not, then the risk of falsely believing everything should work is high.
Limit Sampling
Limit sampling is one of the post popularly proposed sampling methods. It can be fast, depending on the data warehouse (the
LIMIT
clause doesn’t necessarily avoid full table scans in all data warehouses). Additionally, it should be pretty simple from the end user perspectivedbt run --sample --limit=1000
, no other configuration necessary! However, limit sampling is not perfect. TheLIMIT
clause is generally non-deterministic if not coupled with anORDER BY
, thus two sample runs back to back might provide wildly different results. Additionally, although limit sampling should have a recency bias which should help with data completeness of joins, due to the non-determinism the correlation of sampled data is weak.Time Based Sampling
Time based sampling should be fast, especially if one is partitioning by the relevant time column. We could even reuse the
event_time
config introduced by microbatch to declare the time column. Then for the end user, sampling would maybe look something likedbt run --sample --time="3 days"
. It is deterministic, as two runs with the same time window should produce the same results. Finally, the data completeness is moderate to strong (but not perfect). In a majority of implementations, the data within a given window is related. So if one is sampling 3 days of data, they can reasonably assume most joins should be successful.Proposal: Time Based Sampling(at least for now)
Most, but not all, big datasets have some time information (
created_at
,ingested_at
,updated_at
,adopted_a_cat_at
). Additionally, if we use the already existing event_time config that microbatch introduced (oh hey look, there was foreshadowing of this), then the end user experience is actually incredibly straight forward. It should be fast, and there is a strong-ish guarantee of data completeness for joins. Time based sampling should cover most use cases, and is probably the best bang for buck if we were to implement only one.How Would Time Based Sampling Work?
Quite simply, one would need to specify an
event_time
on any direct upstream nodes they want to be sampled when running in sample mode. Then to run in sample mode, all one would need to do isdbt run --sample --time="3 days"
. Doing so would run the project in sample mode, and sampled models would only have data for the specified time window. Interestingly, this had the added benefit that dbt tests downstream of a sampled run would only test the newest data.Discussion questions
Closing Meme
Beta Was this translation helpful? Give feedback.
All reactions