Sample Mode (for faster Development and CI 🚀) #11200

QMalcolm · 2025-01-08T15:18:13Z

QMalcolm
Jan 8, 2025
Maintainer

Have you ever wanted to run a smaller slice of your project during development? It would be faster and cheaper if you could 🤔 The answer, for many, is that yes they have (and we do too). Blog posts have been written on the topic, forum discussions have emerged, and we’ve even previously considered the implementation. Well it’s now it’s 2025 🎉, we’ve released our roadmap, and as one of our new year’s resolutions we intend to make “Sample Mode” a built-in part of dbt-core.

Why sampling?

What is the purpose of sampling? Some of us (a lot of us) have REALLY BIG DATA. The primary reason incremental models exist, and why we just did a ton of work to bring microbatch models to 1.9, is because we all deal with a lot of big, huge, mountainous datasets.

The problem is really big datasets result in really big models, and really big models are really painful to deal with when trying to develop quickly and iteratively. No one wants to wait 15 minutes to see whether the change they made to a model will pass data tests and result in sensible data.

Make development faster

How do we make development faster? Well, what if we simply don’t build the full model. After all, do we really need to run the entire model during development or CI? Probably not. If we simply want to ensure that valid data will be produced with our changes, then running only part of the model might be enough. That is, we can get away with just sampling it. Now sampling won’t be perfect since inherently we’ll only be producing a “sample” of a model, and thus we lose some guarantee on completeness. For example, it might be the case that not all joins will be satisfied and populated, and different sampling strategies are better/worse at this.

Make development cheaper

Sampling has the added benefit of reducing cost. Most data warehouses charge based on some combination of compute usage and total capacity of stored data. For all data warehouses, sampling means you’ll be storing less data in your development environments. When it comes to compute costs, for:

BigQuery, sample mode will decrease the amount of data being computed, reducing your on-demand compute costs.
Databricks, the number of DBUs being used in development should go down as less data will be processed.
Snowflake, the reduction in necessary compute might mean that you can shrink the size of your virtual data warehouse instance for your development environment.

Why not `--empty` ?

Using the empty flag is a great tool! It basically validates that your models are semantically correct. If your sql is invalid (for instance, if a column you are trying to select doesn’t exist), then running with --empty will catch this. Running with the --empty flag is fast, and that is because it runs models with limit 0, meaning no data is actually being read or written. However, running with --empty doesn’t let you inspect the resulting data to get a sense if things are working how you want and data tests don’t actually validate anything because there is no data.

Different ways to sample data

As mentioned previously, a ton of thought work has been put into the different ways sampling could work. Some of the proposed methods have been to sample by:

time, i.e. “Give me the last 3 days of data”
limit, i.e “Give me the last 100 rows”
representative segments, i.e. “Give me data for customer_id 10 and 20)
arbitrary where clause, i.e. “Give me data where my_bool_column = true"
randomly, i.e. “Give me random data”

If we were trying to create the most flexible system, we’d implement them all. But truthfully, this would result in a rather complicated implementation, and we also prefer to be opinionated. So how do we choose which to implement. The good news, and bad news, is that these different sampling methods are not created equally.

To compare them, we have to keep in mind the things that are important to us:

speed (faster is better)
data completeness (ensuring joins will more often than not have data on both sides of the join)
simplicity (the less one needs to configure to sample their data, the better)
use case coverage (will it work for most people / project implementations)
determinism (if a sample mode isn’t deterministic, it can lead to frustrating flakiness in CI runs)

Random Sampling

Random sampling is fast. It could also be incredibly simple, as it wouldn’t need anything more than dbt run --sample. Snowflake, BigQuery, and Databricks have all even gone as far as to implement a clause specifically for randomly sampling tables. However, there are problems. First, random samplings are inherently non-deterministic. Secondly, random sampling gives no way to ensure sampled data from different tables are correlated. The chance of joins being successful when randomly sampling is extremely low. Thus, because most projects contain models with joins, random sampling probably isn’t a good candidate at all.

Arbitrary Where Clause Sampling

This is by far the most flexible sampling method. It is guaranteed to cover any node. The overhead however is high. It is incredibly unlikely that a single arbitrary where clause will work for all of a project's nodes. It is much more likely that an arbitrary where clause for sampling would have to be specified on a node by node basis. To change the sampling, one would need to alter the node specification itself, which isn’t great. Its speed, data completeness, and determinism are all determined by where clause written. Overall, arbitrary where clause sampling feels like it should be an escape hatch for when other sampling methods fail. As such, the necessity of implementing this seems dependent on whether sample modes we implement cover enough use cases.

Representative Segment Sampling

Representative segment sampling would be useful for things like sampling by customer_id. Assuming tables are partitioned by the column of representative segment information, it should be relatively fast. Additionally, assuming joins are done via the segment information, then the data should be fairly complete. The drawback is that representative segment sampling is very narrow. Hopefully the representative segment is truly representative, but if it’s not, then the risk of falsely believing everything should work is high.

Limit Sampling

Limit sampling is one of the post popularly proposed sampling methods. It can be fast, depending on the data warehouse (the LIMIT clause doesn’t necessarily avoid full table scans in all data warehouses). Additionally, it should be pretty simple from the end user perspective dbt run --sample --limit=1000, no other configuration necessary! However, limit sampling is not perfect. The LIMIT clause is generally non-deterministic if not coupled with an ORDER BY, thus two sample runs back to back might provide wildly different results. Additionally, although limit sampling should have a recency bias which should help with data completeness of joins, due to the non-determinism the correlation of sampled data is weak.

Time Based Sampling

Time based sampling should be fast, especially if one is partitioning by the relevant time column. We could even reuse the event_time config introduced by microbatch to declare the time column. Then for the end user, sampling would maybe look something like dbt run --sample --time="3 days". It is deterministic, as two runs with the same time window should produce the same results. Finally, the data completeness is moderate to strong (but not perfect). In a majority of implementations, the data within a given window is related. So if one is sampling 3 days of data, they can reasonably assume most joins should be successful.

Proposal: Time Based Sampling(at least for now)

Most, but not all, big datasets have some time information (created_at, ingested_at, updated_at, adopted_a_cat_at). Additionally, if we use the already existing event_time config that microbatch introduced (oh hey look, there was foreshadowing of this), then the end user experience is actually incredibly straight forward. It should be fast, and there is a strong-ish guarantee of data completeness for joins. Time based sampling should cover most use cases, and is probably the best bang for buck if we were to implement only one.

How Would Time Based Sampling Work?

Quite simply, one would need to specify an event_time on any direct upstream nodes they want to be sampled when running in sample mode. Then to run in sample mode, all one would need to do is dbt run --sample --time="3 days". Doing so would run the project in sample mode, and sampled models would only have data for the specified time window. Interestingly, this had the added benefit that dbt tests downstream of a sampled run would only test the newest data.

Discussion questions

Will event time sampling cover a majority of your sample use cases?
If you were to choose two sampling methods, which would you choose?
Are there any sampling methods we didn’t consider that you think should be considered?
Using event time sampling, would you more often want to sample historical time windows or recent time windows?
Would you want to use different sampling strategies for different models in the same sample mode run, and if so, why?
Would you want to combine sampling strategies to apply to all models during a sample mode run? (i.e. time stamping + representative segment)

Closing Meme

MDTGRD · 2025-01-08T16:04:22Z

MDTGRD
Jan 8, 2025

@QMalcolm Thanks a lot for this Quigley!
As we intend to implement this feature in both local development and CI workflows, I'm curious if this is intended to be added to the config or only the CLI command?
My concern is primarily around CI as I guess we could add the dbt run --sample --time="3 days" to the CI job for microbatch materialized models, but we may need to extend or limit the timeframe depending on the specific model.
Ideally I'd look for something in-script similar to the {% if target.name == 'test' %} of incremental models.

2 replies

QMalcolm Jan 8, 2025
Maintainer Author

@MDTGRD Can you talk more about the use case for specific models using different sample time windows? The main concern here is that if models created with different sample time windows get joined in a downstream model, then a good portion of the joins won't complete.

In regards to

As we intend to implement this feature in both local development and CI workflows, I'm curious if this is intended to be added to the config or only the CLI command?

We plan on allowing the sample time window to be provided by either a config or a CLI flag. However, the plan was for the config to be project level, not model specific. Although, if different models having different sample time windows is a common necessary pattern, we might consider making it a model level config. A work around though could be something like dbt run --select tag:these --sample --time="3 days" && dbt run --select tag:those --sample --time="6 hours"

MDTGRD Jan 9, 2025

@MDTGRD Can you talk more about the use case for specific models using different sample time windows? The main concern here is that if models created with different sample time windows get joined in a downstream model, then a good portion of the joins won't complete.

Just to clarify, I am only referring to the sample limits during local development or CI jobs. Ideally, we want to use all data for testing & development, but given the large nature of some of our models we need limits. We want these limits to be as lenient as possible, so if model A can run within a reasonable timeframe with a sample of 14 days, but model B can only handle 3 we'd like the flexibility to make that distinction.
I realize this comes with risks associated with downstream joins to models with different windows but the onus would be on the developer as this would go under "known limitations".

Does that make sense?

We plan on allowing the sample time window to be provided by either a config or a CLI flag. However, the plan was for the config to be project level, not model specific. Although, if different models having different sample time windows is a common necessary pattern, we might consider making it a model level config. A work around though could be something like dbt run --select tag:these --sample --time="3 days" && dbt run --select tag:those --sample --time="6 hours"

Thanks for the workaround. I still think we'd prefer the model-level config, but it's always good with a few different options.

karenderer · 2025-01-08T17:30:16Z

karenderer
Jan 8, 2025

Can this ship with support or workarounds for packages like dbt_utils? A problem I encountered with --empty is that it would cause errors for some models in our repo so we couldn't use it for CI without a lot of work to define the behavior of each model when --empty is used. I'd like a config to skip sample mode for specific models (I implemented something that was similar in spirit to sample mode, except used a ref override where I skipped specific model names and we were able to get it to run on our entire repo since I could just skip the edge cases).

4 replies

QMalcolm Jan 8, 2025
Maintainer Author

@karenderer can you talk more about the issues you ran into with the --empty flag when using dbt_utils? I'm not familiar with those problems and would like to understand better.

As for opting a model out of being created as a sample, I do understand the value. Although I don't know exactly like it would look like, I think we should provide it and it'll likely be a model level config 🙂 something like dont_sample: True

karenderer Jan 9, 2025

Maybe this will be less of a problem, but in models using dbt_utils.get_column_values() I get errors like 'NoneType' object is not iterable because no values are obtained.

But more generally, yes I would like the ability to exclude specific models from sampling (would be nice if --empty could share the same/similar config, but perhaps out of scope).

QMalcolm Jan 10, 2025
Maintainer Author

Ahhh I can see why that command util fails when using empty 🫠 so we do hope to fix some know empty bugs in Q1 along with the implementation of Sample Mode, I'll see if we can get this roped in as well.

As far as sample mode and dbt_utils, I'll make sure some testing of dbt_utils gets done, with any resulting issues getting tracked and resolved. Thank you again @karenderer, this is important for us to ensure it isn't overlooked 🙂

karenderer Jan 10, 2025

Coming back to this - I have another example for wanting to skip specific models - I have a handful of Python models and when I try to run them with --empty, they also don't run successfully. Here's an example of an error. Ideally empty and sample would work with Python models too, but my point here is that I won't be able to use sample in CI unless it works with every single model so it would be necessary to have a way to bypass it just in case there is some unanticipated edge case model that doesn't successfully run.

File "[redacted]/lib/python3.8/site-packages/snowflake/snowpark/_internal/utils.py", line 229, in validate_object_name
      raise SnowparkClientExceptionMessages.GENERAL_INVALID_OBJECT_NAME(name)
  snowflake.snowpark.exceptions.SnowparkInvalidObjectNameException: (1500): The object name '(select * from analytics_dev.my_schema.my_python_model where false limit 0)' is invalid.
   in function MY_PYTHON_MODEL__DBT_SP with handler main

amychen1776 · 2025-01-08T19:00:37Z

amychen1776
Jan 8, 2025
Collaborator

Something to take into consideration for the different ways to sample is the SQL dialect. For example, Tsql doesn't support limit and requires you to do things like select top 100.

1 reply

QMalcolm Jan 8, 2025
Maintainer Author

@amychen1776 Great point! The compiling of the sample mode node sql will still be controlled by dbt-adapter implementations. So if in core we implement limit based sampling, different adapters might compile the sql differently. This is possible because the BaseRelation in dbt-adapters defines the different rendering functions, which can then be overwritten by implementing adapters. An example of this already is that dbt-bigquery overrides the default implementation of _render_event_time_filtered

dradetsky · 2025-01-08T22:57:11Z

dradetsky
Jan 8, 2025

If we were trying to create the most flexible system, we’d implement them all.

What if you implemented sampling strategies as plugins? The default plugin if none is specified is time-based, so dbt run --sample="3 days" would do the normal thing, but you could also write dbt run --sample-type random --sample=1000.

(FWIW, I'm not a fan of the current experimental plugin architecture, but it wouldn't be too hard to improve it)

5 replies

QMalcolm Jan 8, 2025
Maintainer Author

@dradetsky great question! We haven't dived too far into considering using the experimental plugin architecture for this. In all honesty, we'd rather be opinionated and build one or two sampling method into core directly. Our goal isn't to provide every possible way to sample data, but to build in the most useful couple ways to sample data. And if some "custom" way of sampling data is needed that we don't provide, the escape hatch of custom macros is always available.

dradetsky Jan 9, 2025

@QMalcolm building a system in a plugin-based way isn't just about providing more features via plugins (although it does do that). it can also be a good way to improve code quality. For example, it forces you to think about APIs for parts of the system that you might have otherwise just manipulated directly. There are lost of things you can do when designing a system which are beneficial in indirect ways like this (e.g. TDD).

we'd rather be opinionated

It makes sense to be opinionated when you already know a lot about the subject. For example, in this case, it would make sense to be opinionated if you already knew whether event time sampling would cover a majority of use-cases, or whether the above analysis had actually covered all the important sampling methods, or the other things you listed under "Open questions." Since you don't know them, you should either find out what they are, or build your system in a modular way so you don't need to know the answers to all the open questions ahead of time.

Also, you shouldn't say "We will find out those answers, and then be opinionated" because you need to know those answers to know whether you should be opinionated or not.

and build one or two sampling method into core directly.

I saw elsewhere that various people aspired to make dbt-core smaller and more library-like, and reduce the scope of what it's required to do. This is probably a good idea: code is where you put bugs, so if you make the core smaller, it'll probably be more reliable. However, if you instead start adding a bunch of sampling code to core, you're going in the opposite direction.

Now, I realize that the people who were talking about making core smaller probably used the term "aspire" because they thought it was somewhat unrealistic. But I don't think it is. I think it's very feasible if you wanted to do it (and didn't keep adding more non-plugged code to core).

And if some "custom" way of sampling data is needed that we don't provide, the escape hatch of custom macros is always available.

Doesn't that mean custom macros are already available to provide the default & opinionated sampling methods you were going to implement by adding code to core? if so, did you reject this approach for any particular reason? Can't macros already be distributed in a plugin-like way using dbt packages?

graciegoheen Jan 9, 2025
Maintainer

Hello @dradetsky - we appreciate the feedback. Please keep in mind that discussions are intended to be a positive, open space for the dbt Core team to collaborate with the community! From our OSS expectations docs - “The most helpful comments propose nuances or desirable user experiences to be considered in design and refinement.” Our open questions here are intended to help guide the discussion, they are not an indication that we haven't thought deeply about this subject already. Our team has done a lot of deep dives, research, and experimentation around sample mode and are looking forward to hearing from how community members want to use this in their own dbt projects.

dradetsky Jan 10, 2025

@graciegoheen point taken. However, I hope you can forgive me for the misunderstanding. I assumed that by "open questions" you meant "open questions." You could instead refer to them as "discussion questions" or "feedback we'd like from users" or something else like that to reduce the likelihood of confusion.

In any case, when I said that the team did not already know the answers to the "open questions" this was not meant to suggest they had failed to do work they should have. Rather, I was illustrating the virtues of flexible design. No matter how much research you did ahead of time, your knowledge will be incomplete. For example, even if you had verified that the feature supported the majority of use-cases for every single customer, presumably you'd like to grow your customer base in the future, whereupon your verification is no longer complete. Maybe that wasn't very clear.

Anyhow you want to design things so that you don't need as much knowledge ahead of time to make useful progress. This is true even if you already know a reasonable amount already. (It's like how Dijkstra kept trying to explain to early programmers that they should act like they didn't know how many cpu instruction cycles it took to spin the drum head into the right position, but instead write programs which work correctly no matter how long this took.)

Also, I am trying to be open and positive. You can't hear it because I'm typing, but it would sound a lot more constructive & supportive if I was speaking. I also think I am pointing out some relevant nuances and user experiences connected to this proposal. I don't mean that like "How dare you quote community standards at me!" (because I recognize you may have other goals unrelated to my participation such as maintaining overall tone), but rather "Yes, I agree, sounds fine, will do."

graciegoheen Jan 11, 2025
Maintainer

I'll go ahead and update the post to say "discussion questions" instead of "open questions"! Sorry for the confusion that caused. It's definitely important for us to strike a balance between flexible and opinionated. There is already a way folks have been "hacking" this functionality - by overriding the {{ ref() }} and {{ source() }} macros; being able to override these core pieces of dbt is one of the ways the framework is customizable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample Mode (for faster Development and CI 🚀) #11200

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Sample Mode (for faster Development and CI 🚀) #11200

QMalcolm Jan 8, 2025 Maintainer

Why sampling?

Make development faster

Make development cheaper

Why not --empty ?

Different ways to sample data

Random Sampling

Arbitrary Where Clause Sampling

Representative Segment Sampling

Limit Sampling

Time Based Sampling

Proposal: Time Based Sampling(at least for now)

How Would Time Based Sampling Work?

Discussion questions

Closing Meme

Replies: 4 comments · 12 replies

QMalcolm Jan 8, 2025 Maintainer Author

QMalcolm Jan 8, 2025 Maintainer Author

QMalcolm Jan 10, 2025 Maintainer Author

amychen1776 Jan 8, 2025 Collaborator

QMalcolm Jan 8, 2025 Maintainer Author

QMalcolm Jan 8, 2025 Maintainer Author

graciegoheen Jan 9, 2025 Maintainer

graciegoheen Jan 11, 2025 Maintainer

QMalcolm
Jan 8, 2025
Maintainer

Why not `--empty` ?

Replies: 4 comments 12 replies

QMalcolm Jan 8, 2025
Maintainer Author

QMalcolm Jan 8, 2025
Maintainer Author

QMalcolm Jan 10, 2025
Maintainer Author

amychen1776
Jan 8, 2025
Collaborator

QMalcolm Jan 8, 2025
Maintainer Author

QMalcolm Jan 8, 2025
Maintainer Author

graciegoheen Jan 9, 2025
Maintainer

graciegoheen Jan 11, 2025
Maintainer