feat: adds Schema class and modifies schema handling #2081

chalmerlowe · 2024-11-27T19:00:15Z

This PR adds a Schema class to python-bigquery as part of the Google internal PANGEA work.

Historically, table.schema has been a list of SchemaFields.

Schema objects will now include not only a list of fields, but will have an attribute called foreign_type_info.

The aim of the work here is create, as much as possible, a drop-in replacement for schema.

The end user should not need to know whether they are looking at a list-based schema OR a class Schema-based object unless they need access to the foreign_type_info attr.

They should still be able to:

iterate over a Schema just like you would a list
index and slice the Schema just like a list

For users who already have a list in their code, nothing should break.

For future users who will need a Schema object, they should be able to instantiate one and it should just work.

TODO:

add docstrings
add type hints
remove or modify comment strings in the code to help future maintainers

conventional-commit-lint-gcf · 2024-11-27T19:00:19Z

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

google/cloud/bigquery/schema.py

tswast · 2024-12-06T22:11:51Z

google/cloud/bigquery/client.py

@@ -3680,7 +3680,7 @@ def insert_rows(
        if selected_fields is not None:
            schema = selected_fields

-        if len(schema) == 0:
+        if not schema:


I'm curious, could you make len(schema_object) work? I guess this is to allow None as schema now, too? If table.schema could indeed return None now, that might be considered a breaking change, depending on how strict we want to be about types.

Either way, folks might have to change their code to deal with missing schema, so something to consider. Maybe we return an object of len(...) == 0 for now and have a warning that it will be changed to None in future?

I dislike the idiom of schema-less tables, but it is indeed something we've seen people do intentionally in the wild. My recollection is that it was used historically as a signal for some workflow state (e.g. some kind of is-table-present check).

I was under the impression that Table.schema can currently be set to None. It appears that we allow for that in the existing code:

https://github.com/googleapis/python-bigquery/blob/d4d39acb8574f0d06d4e490b859e5fe6b57d0d9e/google/cloud/bigquery/table.py#L458C1-L466C84

@schema.setter def schema(self, value): api_field = self._PROPERTY_TO_API_FIELD["schema"] if value is None: self._properties[api_field] = None else: value = _to_schema_fields(value) self._properties[api_field] = {"fields": _build_schema_resource(value)}

I switched it back to the old check.

tswast · 2024-12-06T22:15:29Z

google/cloud/bigquery/client.py

@@ -4029,7 +4029,7 @@ def list_rows(

        # No schema, but no selected_fields. Assume the developer wants all
        # columns, so get the table resource for them rather than failing.
-        elif len(schema) == 0:
+        elif not schema:


Same here. Let's be careful about breaking changes regarding table.schema allowing None.

I also switched this back to the old check.

tswast · 2024-12-06T22:15:58Z

google/cloud/bigquery/schema.py

-        Exception: If ``schema`` is not a sequence, or if any item in the
-        sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField`
-        instance or a compatible mapping representation of the field.
+    """TODO docstring


Flagging with a comment just so we don't lose track of this TODO.

tswast · 2024-12-06T22:16:39Z

google/cloud/bigquery/schema.py

@@ -796,8 +792,6 @@ def serde_info(self) -> Any:
        prop = _get_sub_prop(self._properties, ["serDeInfo"])
        if prop is not None:
            prop = StorageDescriptor().from_api_repr(prop)
-            print(f"DINOSAUR prop: {prop}")


Lol, oops! 😆 Could be worth a separate cleanup PR to have something small and easy to approve.

tswast · 2024-12-06T22:22:03Z

google/cloud/bigquery/schema.py

+        self._properties["foreignTypeInfo"] = value
+
+    @property
+    def _fields(self) -> Any:


Why private? I guess because we're trying to emulate a list and want to support mutation that way?

Would be worth an explanation.

You are correct, the idea behind having it be private was to try and emulate a list.
I can add a note to that effect when I add docstrings.

tswast · 2024-12-06T22:28:43Z

google/cloud/bigquery/table.py

-            return _parse_schema_resource(prop)
+        elif isinstance(prop, Schema):
+            return prop
+        return _parse_schema_resource(prop)


Presumably we want to return a Schema object now.

Suggested change

return _parse_schema_resource(prop)

return Schema.from_api_repr(prop)

Aside: Is _parse_schema_resource dead code? I assume it still returns a list of SchemaField?

_parse_schema_resource still returns a list of SchemaFields for backwards compatibility.

Where possible I -tried- to stay consist about what the user provides versus gets back: if someone handed the library a list of SchemaFields (i.e. cause that is what they are used to) then we maintained that state. I tried to avoid randomly converting lists into Schema objects to avoid user surprise.

tswast · 2024-12-06T22:28:57Z

google/cloud/bigquery/table.py


    @schema.setter
    def schema(self, value):
        api_field = self._PROPERTY_TO_API_FIELD["schema"]

        if value is None:
            self._properties[api_field] = None
+        elif isinstance(value, Schema):
+            self._properties[api_field] = value


Suggested change

self._properties[api_field] = value

self._properties[api_field] = value.to_api_repr()

tswast · 2024-12-06T22:30:18Z

google/cloud/bigquery/schema.py

+            Dict[str, Any]:
+                A dictionary in the format used by the BigQuery API.
+        """
+        return copy.deepcopy(self._properties)


It might be a bit wasteful to make deepcopy here and in from_api_repr. Indeed it's safer, but could add a lot of overhead. IIRC we actually removed some deepcopy calls from SchemaField because it was slowing down customers who build dynamic schemas in their code.

See: #6 and #26

tswast · 2024-12-06T22:33:17Z

google/cloud/bigquery/schema.py

+            An instance of the class initialized with data from 'resource'.
+        """
+        config = cls("")
+        config._properties = copy.deepcopy(resource)


Same here. May be better to omit the deepcopy.

tswast · 2024-12-06T22:34:37Z

google/cloud/bigquery/table.py

@@ -452,17 +453,20 @@ def schema(self):
                instance or a compatible mapping representation of the field.
        """
        prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
-        if not prop:
+        if not prop:  # if empty Schema, empty list, None
            return []


Could we return an "empty" Schema here for consistency?

Suggested change

return []

return Schema()

@tswast

Does the following influence your thoughts on this comment?

Historically, we have returned an empty list.
An existing user would still expect a list.
A new user should probably not see an empty Schema object, since they are only needed if the user is gonna try to access an external data source (where they have to identify the foreign_type_info).

My general principal, focused on backwards compatibility is:

avoid as many changes to what the code did before, unless necessary

as much as possible, force the choice to move to Schema to be an explicit decision by any user (especially since external data sources are likely to be the exception, not the rule).

I could go either way on this one. One one hand, using Schema everywhere would make the new Schema class more discoverable and fewer cases to consider when type checking. On the other hand, providing a real list in all cases where Schema is not needed would provide the most backwards compatible option.

On further consideration, I agree with your assessment that avoiding breaking existing users is preferable. +1 on empty list in this case.

@tswast

this is a related concept.
Right now, if someone gives me a Schema object, I have been storing that in _properties and I see your point about avoiding that.

That complicates getting a Schema object back out, if/when you need it.

What are your thoughts on the following approach? (I seem to recall you mention keeping objects in a separate place and ensuring that _properties is a view into the object)

@schema.setter def schema(self, value): api_field = self._PROPERTY_TO_API_FIELD["schema"] if value is None: self._properties[api_field] = None elif isinstance(value, Schema): # store original as selt.<original_schema_object> # TODO find a better name self.<original_schema_object> = value self.properties[api_field] = self.<original_schema_object>.to_api_repr() else: value = _to_schema_fields(value) self._properties[api_field] = {"fields": _build_schema_resource(value)} @property def schema(self): # if user stored a Schema object, short circuit and return early. <original_schema_object> = self.get(<original_schema_object>, None) if isinstance(original_schema_object, Schema): return self.<original_schema_object> # if user did not store a Schema object, behave as normal prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"]) if not prop: # either None OR empty list return [] return _parse_schema_resource(prop)

My worry with self.<original_schema_object> = value is that it can get out of sync with the values stored in _properties, especially since Schema supports in-place mutation.

Also, there are other cases to consider besides "someone gave us a Schema object". Importantly, what if someone calls get_table("proj.dataset.some_table_with_pangea_features_enabled")? IMO this case is one where we should return a Schema object.

Alternatively, we introduce separate properties for the pangea fields to Table and forget all this Schema class business. Something like Table.schema_foreign_type_info could work. I believe we already do similar for certain statistics on job, for example. The only gotcha here is that we'd need to make sure we merge these two properties (schema and schema_foreign_type_info) rather than replace the whole _properties["schema"] when one sets either of these.

tswast · 2024-12-06T22:36:32Z

Thanks! Overall goals outlined in the description look good to me.

The aim of the work here is create, as much as possible, a drop-in replacement for schema.

The end user should not need to know whether they are looking at a list-based schema OR a class Schema-based object unless they need access to the foreign_type_info attr.

They should still be able to:

iterate over a Schema just like you would a list

index and slice the Schema just like a list

For users who already have a list in their code, nothing should break.

For future users who will need a Schema object, they should be able to instantiate one and it should just work.

google/cloud/bigquery/schema.py

tswast · 2024-12-26T18:43:48Z

google/cloud/bigquery/schema.py

+    @_fields.setter
+    def _fields(self, value: list) -> None:
+        value = _isinstance_or_raise(value, list, none_allowed=True)
+        self._properties["_fields"] = value


Why _fields. It's fields in the REST API. Also, this should convert to the list to API representation. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema

Changed _fields to fields. Changed tests to match.

* feat: add property for maxStaleness in table definitions Signed-off-by: Yu Ishikawa <[email protected]> * Update google/cloud/bigquery/table.py --------- Signed-off-by: Yu Ishikawa <[email protected]> Co-authored-by: Lingqing Gan <[email protected]>

* add type hints * Update client.py Moves import from being used solely during specific checks to being more universally available. * Update google/cloud/bigquery/client.py * Update client.py testing some minor changes to deal with mypy quirks * Update google/cloud/bigquery/client.py --------- Co-authored-by: Chalmer Lowe <[email protected]>

Source-Link: googleapis/synthtool@e808c98 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:8e3e7e18255c22d1489258d0374c901c01f9c4fd77a12088670cd73d580aa737 Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.4...3.1.5) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…hemaField` (#2097) * feat: preserve unknown fields from the REST API representaton in `SchemaField` * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * remove unnecessary variable * remove unused private method * fix pytype --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

* fix: adds test of roundingmode as a str * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

tswast · 2025-01-06T15:42:25Z

google/cloud/bigquery/schema.py

+        schema = Schema(
+            [
+                field
+                if isinstance(field, SchemaField)
+                else SchemaField.from_api_repr(field)
+                for field in schema
+            ],
+            foreign_type_info=schema.foreign_type_info,
+        )
+        return schema


If we get a Schema object, the Schema object is responsible for all conversions. I don't think we need to make a copy of it or anything.

Returning the original object should be sufficient.

Suggested change

schema = Schema(

[

field

if isinstance(field, SchemaField)

else SchemaField.from_api_repr(field)

for field in schema

],

foreign_type_info=schema.foreign_type_info,

)

return schema

return schema

This also future-proofs us a bit. In case any additional properties are added to Schema, we won't have to update this code.

tswast · 2025-01-06T15:44:17Z

google/cloud/bigquery/schema.py

-        Exception: If ``schema`` is not a sequence, or if any item in the
-        sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField`
-        instance or a compatible mapping representation of the field.
+        ValueError: If the items in ``schema`` are not valid.
    """

    for field in schema:


We don't need this check if we get a Schema object. Let's move that above.

Suggested change

for field in schema:

if isinstance(schema, Schema):

return schema

for field in schema:

Note: I just return schema if given a Schema. See my comment below.

Alternatively, we could call return Schema.from_api_repr(schema.to_api_repr()), but I don't see much reason to do so.

tswast · 2025-01-06T15:47:59Z

google/cloud/bigquery/schema.py

+
+    @property
+    def _fields(self) -> Optional[List[Any]]:
+        return self._properties.get("fields")


Please avoid storing non-REST API representation in _properties. Given the usage of UserList, I think just a regular instance variable would be all that's needed.

tswast · 2025-01-06T15:53:37Z

google/cloud/bigquery/schema.py

+    # TODO docstrings and type hints
+    def __init__(self, fields=None, foreign_type_info=None):
+        self._properties = {}
+        self._fields = [] if fields is None else list(fields)  # Internal List


Unresolving. I don't think _properties for "fields" makes sense in the context of UserList, because we want to support all sorts of mutations and such. For that, let's do something like this instead:

Suggested change

self._fields = [] if fields is None else list(fields) # Internal List

self.data = _to_schema_fields(fields) # Internal List

And remove the data and _fields properties. Just a data instance variable is needed. That also has the benefit of not documenting data, which isn't really part of our public API.

tswast · 2025-01-06T16:01:20Z

google/cloud/bigquery/schema.py

+        # If this is a RECORD type, then sub-fields are also included,
+        # add this to the serialized representation.
+        answer = self._properties.copy()
+        if self._fields == []:


This would have to be updated if we remove "fields" from _properties, as I suggest above.

Also, per go/pystyle#truefalse-evaluations-decision it is preferred to use the falsey nature of empty list if possible.

Suggested change

if self._fields == []:

if not self.data:

tswast · 2025-01-06T16:05:57Z

google/cloud/bigquery/schema.py

+        schemafields = any([isinstance(f, SchemaField) for f in self._fields])
+        if schemafields:
+            answer["fields"] = [f.to_api_repr() for f in self._fields]


This seems wrong. We should allow people to insert Mapping / dict into their Schema / UserList objects. Even if no remaining field is a SchemaField, we should still return something.

I would expect something more like this:

Suggested change

schemafields = any([isinstance(f, SchemaField) for f in self._fields])

if schemafields:

answer["fields"] = [f.to_api_repr() for f in self._fields]

answer["fields"] = [

# Users may inject the REST representation of schema fields into the list.

# If it's not a `SchemaField` object, then we can assume that's what they've

# done.

field.to_api_repr() if hasattr(field, "to_api_repr") else field

for field in self.data

]

I'd also like to see some tests where the user replaces all fields with the REST representation of different fields. Something like:

def test_schema_to_api_repr_with_rest_entries(): schema = Schema(fields=[SchemaField(), ....]) del schema[:] rest_fields = [{"name": "...", "type": "..."}, ...] schema.extend(rest_fields) got = schema.to_api_repr() assert got = {"fields": rest_fields}

Edit: Might be good to have another test such as above but with mixing and matching REST and SchemaField objects. Perhaps by not deleting all entries but instead inserting some REST fields.

tswast · 2025-01-06T16:13:13Z

google/cloud/bigquery/schema.py

+        config = cls([])
+        config._properties = copy.copy(resource)


With using UserList, you will need to pull out fields and pass them into the constructor.

Suggested change

config = cls([])

config._properties = copy.copy(resource)

resource = copy.copy(resource)

fields = resource.pop("fields", [])

config = cls(fields)

config._properties = resource

This makes sure .data has all SchemaField objects while preserving the future-proof ability to add _properties (other than fields).

tswast · 2025-01-06T16:14:51Z

google/cloud/bigquery/table.py

            return []
-        else:
-            return _parse_schema_resource(prop)
+        elif isinstance(prop, Schema):


This comment still stands. We shouldn't have Schema objects in _properties. Instead, check for the existence of any keys besides "fields" to determine if Schema or list should be used.

tswast · 2025-01-06T16:22:18Z

Re: CLA failure, might need a merge with pangea-v1alpha latest or even a rebase.

Adds Schema class and modifies schema handling

47474a9

chalmerlowe requested review from a team as code owners November 27, 2024 19:00

chalmerlowe requested a review from hongalex November 27, 2024 19:00

blunderbuss-gcf bot assigned obada-ab Nov 27, 2024

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Nov 27, 2024

chalmerlowe assigned chalmerlowe and unassigned obada-ab Dec 2, 2024

Updates some features of schema handle None more effectively.

f22246a

chalmerlowe commented Dec 4, 2024

View reviewed changes

google/cloud/bigquery/schema.py Outdated Show resolved Hide resolved

Update google/cloud/bigquery/schema.py

7670971

chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels Dec 4, 2024

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Dec 5, 2024

Updates Schema object and class related tests for Schema and Table

c37be67

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Dec 6, 2024

chalmerlowe added 3 commits December 6, 2024 18:00

Updates Schema tests for coverage, fixes serdefinfo test

aba6c10

Updates Schema tests

83aacdf

Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object

7c08ca8

tswast requested changes Dec 6, 2024

View reviewed changes

Linchin reviewed Dec 12, 2024

View reviewed changes

google/cloud/bigquery/schema.py Outdated Show resolved Hide resolved

fixes formatting

af02937

tswast reviewed Dec 26, 2024

View reviewed changes

chalmerlowe added 2 commits December 27, 2024 12:46

updates len() checks

ef2b95d

update tests based on Schema superclass UserList

cfa609c

removes whitespace

1f00e76

tswast mentioned this pull request Dec 27, 2024

feat: preserve unknown fields from the REST API representation in SchemaField #2097

Merged

4 tasks

yu-iskw and others added 12 commits January 2, 2025 18:37

more rebase and merge conflict resolution

1837f81

more conflict resolution

94e9e6e

update tests based on Schema superclass UserList

db6ef9e

Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object

0b0b85a

yet another round of merge conflicts.

ba2795b

Clean up some failing tests

67c0182

chalmerlowe added kokoro:run Add this label to force Kokoro to re-run the tests. and removed kokoro:run Add this label to force Kokoro to re-run the tests. labels Jan 3, 2025

chalmerlowe added 2 commits January 3, 2025 17:57

adds tests to ensure code coverage

3fa949c

tweaks tests to increase code coverage

7c77bfb

tswast requested changes Jan 6, 2025

View reviewed changes

	return _parse_schema_resource(prop)
	return Schema.from_api_repr(prop)

	self._properties[api_field] = value
	self._properties[api_field] = value.to_api_repr()

	self._fields = [] if fields is None else list(fields) # Internal List
	self.data = _to_schema_fields(fields) # Internal List

-        config = cls([])
-        config._properties = copy.copy(resource)
+        resource = copy.copy(resource)
+        fields = resource.pop("fields", [])
+        config = cls(fields)
+        config._properties = resource

feat: adds Schema class and modifies schema handling #2081

Are you sure you want to change the base?

feat: adds Schema class and modifies schema handling #2081

Conversation

chalmerlowe commented Nov 27, 2024 • edited Loading

conventional-commit-lint-gcf bot commented Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chalmerlowe Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Dec 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Jan 6, 2025

chalmerlowe commented Nov 27, 2024 •

edited

Loading

conventional-commit-lint-gcf bot commented Nov 27, 2024 •

edited

Loading

chalmerlowe Jan 6, 2025 •

edited

Loading

tswast Jan 6, 2025 •

edited

Loading

tswast Jan 6, 2025 •

edited

Loading