Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cache when converting dataflow to SQL #1637

Closed
wants to merge 3 commits into from

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Jan 25, 2025

While building the dataflow plan, we sometimes call node_data_set_resolver.get_output_data_set(node). This resolves the data set for a given node, and uses a cache if the exact same node has already been resolved. Later, we again resolve the same nodes when we go through the dataflow to SQL DAG. This DAG does not make use of the cache.
This PR updates the dataflow to SQL DAG to make use of the cache so that we don't resolve the same exact node twice.
This results in quite a lot of snapshot changes, all of which are just changes to the subquery aliases. This is because we were generating a bunch of aliases when building the dataflow plan that didn't get used, resulting in higher increments for the aliases that actually got used.

@cla-bot cla-bot bot added the cla:yes label Jan 25, 2025
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@courtneyholcomb courtneyholcomb force-pushed the court/cache-node-to-dataset branch from 9385c72 to 76b27f0 Compare January 25, 2025 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant