Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds Schema class and modifies schema handling #2081

Open
wants to merge 25 commits into
base: pangea-v1alpha
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
47474a9
Adds Schema class and modifies schema handling
chalmerlowe Nov 27, 2024
f22246a
Updates some features of schema handle None more effectively.
chalmerlowe Dec 2, 2024
7670971
Update google/cloud/bigquery/schema.py
chalmerlowe Dec 4, 2024
c37be67
Updates Schema object and class related tests for Schema and Table
chalmerlowe Dec 6, 2024
aba6c10
Updates Schema tests for coverage, fixes serdefinfo test
chalmerlowe Dec 6, 2024
83aacdf
Updates Schema tests
chalmerlowe Dec 6, 2024
7c08ca8
Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object
chalmerlowe Dec 6, 2024
af02937
fixes formatting
chalmerlowe Dec 19, 2024
ef2b95d
updates len() checks
chalmerlowe Dec 27, 2024
cfa609c
update tests based on Schema superclass UserList
chalmerlowe Dec 27, 2024
1f00e76
removes whitespace
chalmerlowe Dec 27, 2024
39d4c1d
feat: add property for maxStaleness in table definitions (#2087)
yu-iskw Dec 6, 2024
c4f5fd5
feat: add type hints to Client (#2044)
rinarakaki Dec 10, 2024
7006a31
chore(python): update dependencies in .kokoro/docker/docs (#2088)
gcf-owl-bot[bot] Dec 17, 2024
561f05f
chore(deps): bump jinja2 from 3.1.4 to 3.1.5 in /.kokoro (#2094)
dependabot[bot] Dec 26, 2024
e3f57a6
feat: preserve unknown fields from the REST API representation in `Sc…
tswast Dec 27, 2024
3280e79
Fix: add roundingmode as str test (#2098)
chalmerlowe Jan 2, 2025
1837f81
more rebase and merge conflict resolution
chalmerlowe Jan 2, 2025
94e9e6e
more conflict resolution
chalmerlowe Jan 2, 2025
db6ef9e
update tests based on Schema superclass UserList
chalmerlowe Dec 27, 2024
0b0b85a
Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object
chalmerlowe Jan 2, 2025
ba2795b
yet another round of merge conflicts.
chalmerlowe Jan 2, 2025
67c0182
Clean up some failing tests
chalmerlowe Jan 3, 2025
3fa949c
adds tests to ensure code coverage
chalmerlowe Jan 3, 2025
7c77bfb
tweaks tests to increase code coverage
chalmerlowe Jan 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 164 additions & 19 deletions google/cloud/bigquery/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import collections
import copy
import enum
from typing import Any, cast, Dict, Iterable, Optional, Union
from typing import Any, cast, Dict, Iterable, Optional, Union, List

from google.cloud.bigquery import _helpers
from google.cloud.bigquery import standard_sql
Expand Down Expand Up @@ -264,11 +264,11 @@ def __init__(
self._properties["fields"] = [field.to_api_repr() for field in fields]

@classmethod
def from_api_repr(cls, api_repr: dict) -> "SchemaField":
def from_api_repr(cls, api_repr: Dict[str, Any]) -> "SchemaField":
"""Return a ``SchemaField`` object deserialized from a dictionary.

Args:
api_repr (Mapping[str, str]): The serialized representation
api_repr (Dict[str, str]): The serialized representation
of the SchemaField, such as what is output by
:meth:`to_api_repr`.

Expand Down Expand Up @@ -489,7 +489,7 @@ def __repr__(self):


def _parse_schema_resource(info):
"""Parse a resource fragment into a schema field.
"""Parse a resource fragment into a sequence of schema fields.

Args:
info: (Mapping[str, Dict]): should contain a "fields" key to be parsed
Expand All @@ -513,25 +513,33 @@ def _build_schema_resource(fields):
return [field.to_api_repr() for field in fields]


def _to_schema_fields(schema):
"""Coerce `schema` to a list of schema field instances.
def _to_schema_fields(
schema: Union[Schema, List[Union[SchemaField, Dict[str, Any]]]]
) -> Union[Schema, List[SchemaField]]:
"""Convert the input to either a Schema object OR a list of SchemaField objects.

This helper method ensures that the fields in the schema are SchemaField objects.
It accepts:

* A :class:`~google.cloud.bigquery.schema.Schema` instance: It will
convert items that are mappings to
:class:`~google.cloud.bigquery.schema.SchemaField` instances and
preserve foreign_type_info.

* A list of
:class:`~google.cloud.bigquery.schema.SchemaField` instances.

* A list of mappings: It will convert each of the mapping items to
a :class:`~google.cloud.bigquery.schema.SchemaField` instance.

Args:
schema(Sequence[Union[ \
:class:`~google.cloud.bigquery.schema.SchemaField`, \
Mapping[str, Any] \
]]):
Table schema to convert. If some items are passed as mappings,
their content must be compatible with
:meth:`~google.cloud.bigquery.schema.SchemaField.from_api_repr`.
schema: The schema to convert.

Returns:
Sequence[:class:`~google.cloud.bigquery.schema.SchemaField`]
The schema as a list of SchemaField objects or a Schema object.

Raises:
Exception: If ``schema`` is not a sequence, or if any item in the
sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField`
instance or a compatible mapping representation of the field.
ValueError: If the items in ``schema`` are not valid.
"""

for field in schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check if we get a Schema object. Let's move that above.

Suggested change
for field in schema:
if isinstance(schema, Schema):
return schema
for field in schema:

Note: I just return schema if given a Schema. See my comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could call return Schema.from_api_repr(schema.to_api_repr()), but I don't see much reason to do so.

Expand All @@ -541,6 +549,17 @@ def _to_schema_fields(schema):
"mapping representations."
)

if isinstance(schema, Schema):
schema = Schema(
[
field
if isinstance(field, SchemaField)
else SchemaField.from_api_repr(field)
for field in schema
],
foreign_type_info=schema.foreign_type_info,
)
return schema
Comment on lines +553 to +562
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a Schema object, the Schema object is responsible for all conversions. I don't think we need to make a copy of it or anything.

Returning the original object should be sufficient.

Suggested change
schema = Schema(
[
field
if isinstance(field, SchemaField)
else SchemaField.from_api_repr(field)
for field in schema
],
foreign_type_info=schema.foreign_type_info,
)
return schema
return schema

This also future-proofs us a bit. In case any additional properties are added to Schema, we won't have to update this code.

return [
field if isinstance(field, SchemaField) else SchemaField.from_api_repr(field)
for field in schema
Expand Down Expand Up @@ -761,8 +780,6 @@ def serde_info(self) -> Any:
prop = _get_sub_prop(self._properties, ["serDeInfo"])
if prop is not None:
prop = StorageDescriptor().from_api_repr(prop)
print(f"DINOSAUR prop: {prop}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, oops! 😆 Could be worth a separate cleanup PR to have something small and easy to approve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


return prop

@serde_info.setter
Expand Down Expand Up @@ -886,3 +903,131 @@ def from_api_repr(cls, resource: dict) -> SerDeInfo:
config = cls("")
config._properties = copy.deepcopy(resource)
return config


class Schema(collections.UserList):
"""
Represents a BigQuery schema, defining the structure and types of data.

This class manages a list of schema fields and provides methods for
serialization and deserialization with the BigQuery API. It extends the
`collections.UserList` class to allow for list-like behavior.

Args:
fields (Optional[List[Any]], optional): A list of SchemaField objects representing the fields
in the schema. Defaults to None, which creates an empty schema.
foreign_type_info (Optional[str], optional): Optional type information relevant for foreign
systems. Defaults to None.
"""

def __init__(
self,
fields: Optional[List[Any]] = None,
foreign_type_info: Optional[str] = None,
):
self._properties: Dict[str, Any] = {}
self._fields = [] if fields is None else list(fields) # Internal List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert the items of fields from SchemaField into dictionaries here? Ideally _properties would directly reflect the REST representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving. I missed that _fields was a property with a setter. I think your code does actually convert Iterable of SchemaField or dict to the correct API representation in _properties

Copy link
Contributor

@tswast tswast Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving. I don't think _properties for "fields" makes sense in the context of UserList, because we want to support all sorts of mutations and such. For that, let's do something like this instead:

Suggested change
self._fields = [] if fields is None else list(fields) # Internal List
self.data = _to_schema_fields(fields) # Internal List

And remove the data and _fields properties. Just a data instance variable is needed. That also has the benefit of not documenting data, which isn't really part of our public API.

self.foreign_type_info = foreign_type_info

@property
def foreign_type_info(self) -> Optional[str]:
return self._properties.get("foreignTypeInfo")

@foreign_type_info.setter
def foreign_type_info(self, value: Optional[str]) -> None:
"""
Sets the foreign type information for this schema.

Args:
value (Optional[str]): The foreign type information, can be set to None.
"""
value = _isinstance_or_raise(value, str, none_allowed=True)
self._properties["foreignTypeInfo"] = value

@property
def _fields(self) -> Optional[List[Any]]:
return self._properties.get("fields")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid storing non-REST API representation in _properties. Given the usage of UserList, I think just a regular instance variable would be all that's needed.


@_fields.setter
def _fields(self, value: Optional[List[Any]]) -> None:
"""
Sets the internal list of field definitions.

NOTE: In the API representation the 'fields' key points to a sequence of schema fields.
To maintain a similarity in names, the 'Schema._fields' attribute points to the
'_properties["fields"]' key. Schema class is superclassed by UserList, which requires the
use of a '.data' attribute. The decision was made to have both of these attributes point to
the same key "fields" in the '_properties' dictionary. Thus '.data' is an alias
for '_fields'.

Args:
value (Optional[List[Any]]): A list of schema fields, can be set to None.
"""
value = _isinstance_or_raise(value, list, none_allowed=True)
value = _build_schema_resource(value)
self._properties["fields"] = value

@property
def data(self) -> Any:
return self._properties.get("fields")

@data.setter
def data(self, value) -> None:
"""
Sets the list of schema fields.

NOTE: In the API representation the 'fields' key points to a sequence of schema fields.
To maintain a similarity in names, the 'Schema._fields' attribute points to the
'_properties["fields"]' key. Schema class is superclassed by UserList, which requires the
use of a '.data' attribute. The decision was made to have both of these attributes point to
the same key "fields" in the '_properties' dictionary. Thus '.data' is an alias
for '_fields'.

Args:
value (Optional[List[Any]]): A list of schema fields, can be set to None.
"""

value = _isinstance_or_raise(value, list, none_allowed=True)
value = _build_schema_resource(value)
self._properties["fields"] = value

def __str__(self) -> str:
return f"Schema({self._fields}, {self.foreign_type_info})"

def __repr__(self) -> str:
return f"Schema({self._fields!r}, {self.foreign_type_info!r})"

def to_api_repr(self) -> Dict[str, Any]:
"""Build an API representation of this object.

Returns:
Dict[str, Any]:
A dictionary in the format used by the BigQuery API.
If the schema contains SchemaField objects, the fields are
also converted to their API representations.
"""
# If this is a RECORD type, then sub-fields are also included,
# add this to the serialized representation.
answer = self._properties.copy()
if self._fields == []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have to be updated if we remove "fields" from _properties, as I suggest above.

Also, per go/pystyle#truefalse-evaluations-decision it is preferred to use the falsey nature of empty list if possible.

Suggested change
if self._fields == []:
if not self.data:

return answer
schemafields = any([isinstance(f, SchemaField) for f in self._fields])
if schemafields:
answer["fields"] = [f.to_api_repr() for f in self._fields]
Comment on lines +1014 to +1016
Copy link
Contributor

@tswast tswast Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. We should allow people to insert Mapping / dict into their Schema / UserList objects. Even if no remaining field is a SchemaField, we should still return something.

I would expect something more like this:

Suggested change
schemafields = any([isinstance(f, SchemaField) for f in self._fields])
if schemafields:
answer["fields"] = [f.to_api_repr() for f in self._fields]
answer["fields"] = [
# Users may inject the REST representation of schema fields into the list.
# If it's not a `SchemaField` object, then we can assume that's what they've
# done.
field.to_api_repr() if hasattr(field, "to_api_repr") else field
for field in self.data
]

I'd also like to see some tests where the user replaces all fields with the REST representation of different fields. Something like:

def test_schema_to_api_repr_with_rest_entries():
    schema = Schema(fields=[SchemaField(), ....])
    del schema[:]
    rest_fields = [{"name": "...", "type": "..."}, ...]
    schema.extend(rest_fields)
    got = schema.to_api_repr()
    assert got = {"fields": rest_fields}

Edit: Might be good to have another test such as above but with mixing and matching REST and SchemaField objects. Perhaps by not deleting all entries but instead inserting some REST fields.

return answer

@classmethod
def from_api_repr(cls, resource: Dict[str, Any]) -> "Schema":
"""Factory: constructs an instance of the class (cls)
given its API representation.

Args:
resource (Dict[str, Any]):
API representation of the object to be instantiated.

Returns:
An instance of the class initialized with data from 'resource'.
"""
config = cls([])
config._properties = copy.copy(resource)
Comment on lines +1031 to +1032
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With using UserList, you will need to pull out fields and pass them into the constructor.

Suggested change
config = cls([])
config._properties = copy.copy(resource)
resource = copy.copy(resource)
fields = resource.pop("fields", [])
config = cls(fields)
config._properties = resource

This makes sure .data has all SchemaField objects while preserving the future-proof ability to add _properties (other than fields).

return config
13 changes: 9 additions & 4 deletions google/cloud/bigquery/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
from google.cloud.bigquery.schema import _build_schema_resource
from google.cloud.bigquery.schema import _parse_schema_resource
from google.cloud.bigquery.schema import _to_schema_fields
from google.cloud.bigquery.schema import Schema
from google.cloud.bigquery.external_config import ExternalCatalogTableOptions

if typing.TYPE_CHECKING: # pragma: NO COVER
Expand Down Expand Up @@ -453,17 +454,20 @@ def schema(self):
instance or a compatible mapping representation of the field.
"""
prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
if not prop:
if not prop: # if empty Schema, empty list, None
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return an "empty" Schema here for consistency?

Suggested change
return []
return Schema()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast

Does the following influence your thoughts on this comment?

Historically, we have returned an empty list.
An existing user would still expect a list.
A new user should probably not see an empty Schema object, since they are only needed if the user is gonna try to access an external data source (where they have to identify the foreign_type_info).

My general principal, focused on backwards compatibility is:

  • avoid as many changes to what the code did before, unless necessary
  • as much as possible, force the choice to move to Schema to be an explicit decision by any user (especially since external data sources are likely to be the exception, not the rule).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on this one. One one hand, using Schema everywhere would make the new Schema class more discoverable and fewer cases to consider when type checking. On the other hand, providing a real list in all cases where Schema is not needed would provide the most backwards compatible option.

On further consideration, I agree with your assessment that avoiding breaking existing users is preferable. +1 on empty list in this case.

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast

this is a related concept.
Right now, if someone gives me a Schema object, I have been storing that in _properties and I see your point about avoiding that.

That complicates getting a Schema object back out, if/when you need it.

What are your thoughts on the following approach? (I seem to recall you mention keeping objects in a separate place and ensuring that _properties is a view into the object)


    @schema.setter
    def schema(self, value):
        api_field = self._PROPERTY_TO_API_FIELD["schema"]

        if value is None:
            self._properties[api_field] = None
        elif isinstance(value, Schema):
            # store original as selt.<original_schema_object> # TODO find a better name
            self.<original_schema_object> = value
            self.properties[api_field] = self.<original_schema_object>.to_api_repr()
        else:
            value = _to_schema_fields(value)
            self._properties[api_field] = {"fields": _build_schema_resource(value)}


    @property
    def schema(self):
        # if user stored a Schema object, short circuit and return early.
        <original_schema_object> = self.get(<original_schema_object>, None)
        if isinstance(original_schema_object, Schema):
            return self.<original_schema_object>

        # if user did not store a Schema object, behave as normal
        prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
        if not prop: # either None OR empty list  
            return []
        
        return _parse_schema_resource(prop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry with self.<original_schema_object> = value is that it can get out of sync with the values stored in _properties, especially since Schema supports in-place mutation.

Also, there are other cases to consider besides "someone gave us a Schema object". Importantly, what if someone calls get_table("proj.dataset.some_table_with_pangea_features_enabled")? IMO this case is one where we should return a Schema object.

Alternatively, we introduce separate properties for the pangea fields to Table and forget all this Schema class business. Something like Table.schema_foreign_type_info could work. I believe we already do similar for certain statistics on job, for example. The only gotcha here is that we'd need to make sure we merge these two properties (schema and schema_foreign_type_info) rather than replace the whole _properties["schema"] when one sets either of these.

else:
return _parse_schema_resource(prop)
elif isinstance(prop, Schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be concerning if this were ever true. _properties should always be equivalent to the REST representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast Question for you.

This response relates to setting the schema on the table as a whole and thus applies to several comments in this review:

In general, I see the reasoning behind maintaining the _properties attribute as a dict that is pretty close to the API resource representation.

If the user provides a Schema object based on the Schema class, does it make sense to maintain this as a Schema object until we need to explicitly convert it back into an API resource using .to_api_repr()?

Copy link
Contributor

@tswast tswast Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user provides a Schema object based on the Schema class, does it make sense to maintain this as a Schema object

If so, I wouldn't put it in _properties.

Ideally, the objects we provide would just be a "view" on top of the REST resource (_properties). I wouldn't put anything that requires conversion before sending it to the BigQuery API into such objects.

In some cases, we have to store state outside of _properties, in which case we populate the necessary dictionaries and such in to_api_repr(). The UserList approach may be such a case. If so, then don't store the backing list of SchemaField in _properties. Store it in some other instance variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still stands. We shouldn't have Schema objects in _properties. Instead, check for the existence of any keys besides "fields" to determine if Schema or list should be used.

return prop
return _parse_schema_resource(prop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we want to return a Schema object now.

Suggested change
return _parse_schema_resource(prop)
return Schema.from_api_repr(prop)

Aside: Is _parse_schema_resource dead code? I assume it still returns a list of SchemaField?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_schema_resource still returns a list of SchemaFields for backwards compatibility.

Where possible I -tried- to stay consist about what the user provides versus gets back: if someone handed the library a list of SchemaFields (i.e. cause that is what they are used to) then we maintained that state. I tried to avoid randomly converting lists into Schema objects to avoid user surprise.


@schema.setter
def schema(self, value):
api_field = self._PROPERTY_TO_API_FIELD["schema"]

if value is None:
self._properties[api_field] = None
elif isinstance(value, Schema):
self._properties[api_field] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._properties[api_field] = value
self._properties[api_field] = value.to_api_repr()

else:
value = _to_schema_fields(value)
self._properties[api_field] = {"fields": _build_schema_resource(value)}
Expand Down Expand Up @@ -1394,7 +1398,8 @@ def _row_from_mapping(mapping, schema):
Raises:
ValueError: If schema is empty.
"""
if len(schema) == 0:

if not schema:
raise ValueError(_TABLE_HAS_NO_SCHEMA)

row = []
Expand Down
4 changes: 3 additions & 1 deletion tests/unit/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
from google.cloud.bigquery import ParquetOptions
import google.cloud.bigquery.retry
from google.cloud.bigquery.retry import DEFAULT_TIMEOUT
from google.cloud.bigquery.schema import Schema
import google.cloud.bigquery.table

from test_utils.imports import maybe_fail_import
Expand Down Expand Up @@ -2608,7 +2609,8 @@ def test_update_table_w_schema_None(self):
sent = {"schema": None}
self.assertEqual(req[1]["data"], sent)
self.assertEqual(req[1]["path"], "/%s" % path)
self.assertEqual(len(updated_table.schema), 0)
valid_options = [Schema(), [], None]
self.assertIn(updated_table.schema, valid_options)

def test_update_table_delete_property(self):
from google.cloud.bigquery.table import Table
Expand Down
Loading
Loading