Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-492] Support partition_by and cluster_by on python models (#680) #681

Merged

Conversation

kalanyuz
Copy link
Contributor

@kalanyuz kalanyuz commented Apr 26, 2023

resolves #680
resolves #984

Description

Currently, Python models do not support table creation with partition_by and cluster_by when supplied to the model configuration, despite the spark-bigquery-connector having support for it with indirect save mode.

For more details, see #680

Checklist

@kalanyuz
Copy link
Contributor Author

kalanyuz commented May 9, 2023

@dbeatty10 Bump in case this is forgotten 🙇🏼

@dbeatty10 dbeatty10 added the triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering label May 9, 2023
@dbeatty10
Copy link
Contributor

Not forgotten @kalanyuz, but thank you for reaching out! ❤️

Next step is that this will be assigned to one of our engineers for review in an upcoming sprint.

@007vasy
Copy link

007vasy commented Jun 19, 2023

wen merge?

@kalanyuz
Copy link
Contributor Author

@dbeatty10 Sorry for reaching out again on this one but our team is really looking forward to this patch being merged. Any update on this one? We are very close to patching this and link a local package in our pipeline at this point since it's been quite a while 😢

@mikealfare
Copy link
Contributor

Thanks for your contribution @kalanyuz! I just left one comment on the functional code. Could you please also provide a test, or tests, for this new functionality? That would demonstrate this functionality is working as expected and help us ensure it doesn't break with future changes.

@kalanyuz
Copy link
Contributor Author

@mikealfare I don't think the e2e tests that connects with BigQuery is being exposed to contributors and will be handled by the dbt-bigquery team? Otherwise I can take a shot when I have time.

@mikealfare
Copy link
Contributor

There's a template test environment file here: https://github.com/dbt-labs/dbt-bigquery/blob/main/test.env.example. If you copy it to a file called test.env and supply the contents for a BQ instance to which you have access, the tests will run against that BQ instance. For more information, please refer to the Testing section of our contributor's guide here: https://github.com/dbt-labs/dbt-bigquery/blob/main/CONTRIBUTING.md.

@kalanyuz kalanyuz force-pushed the feature/python_models_partioning_clustering branch from 97af71c to c4cef68 Compare October 2, 2023 08:25
@kalanyuz
Copy link
Contributor Author

kalanyuz commented Oct 3, 2023

@mikealfare
Finally had the time to come back to this!
Added an e2e test for partitioned python models.

Have you considered breaking out the partition and cluster options as their own variables prior to inserting them into the df.write flow? I think it would make it more readable, especially given that we can't indent on python models.

Unfortunately these parameters (partitionField, partitionType, and clusteredFields) , cannot be set separately via spark.conf like bucket option. Or do you have some other ideas in mind?

@kalanyuz kalanyuz force-pushed the feature/python_models_partioning_clustering branch from b254b10 to 8cc7248 Compare October 3, 2023 10:23
@kalanyuz kalanyuz requested a review from mikealfare October 3, 2023 23:55
@kalanyuz kalanyuz force-pushed the feature/python_models_partioning_clustering branch from 2d21d5d to c1eeae9 Compare October 11, 2023 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ok to test triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
5 participants