Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds Schema class and modifies schema handling #2081

Open
wants to merge 25 commits into
base: pangea-v1alpha
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
47474a9
Adds Schema class and modifies schema handling
chalmerlowe Nov 27, 2024
f22246a
Updates some features of schema handle None more effectively.
chalmerlowe Dec 2, 2024
7670971
Update google/cloud/bigquery/schema.py
chalmerlowe Dec 4, 2024
c37be67
Updates Schema object and class related tests for Schema and Table
chalmerlowe Dec 6, 2024
aba6c10
Updates Schema tests for coverage, fixes serdefinfo test
chalmerlowe Dec 6, 2024
83aacdf
Updates Schema tests
chalmerlowe Dec 6, 2024
7c08ca8
Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object
chalmerlowe Dec 6, 2024
af02937
fixes formatting
chalmerlowe Dec 19, 2024
ef2b95d
updates len() checks
chalmerlowe Dec 27, 2024
cfa609c
update tests based on Schema superclass UserList
chalmerlowe Dec 27, 2024
1f00e76
removes whitespace
chalmerlowe Dec 27, 2024
39d4c1d
feat: add property for maxStaleness in table definitions (#2087)
yu-iskw Dec 6, 2024
c4f5fd5
feat: add type hints to Client (#2044)
rinarakaki Dec 10, 2024
7006a31
chore(python): update dependencies in .kokoro/docker/docs (#2088)
gcf-owl-bot[bot] Dec 17, 2024
561f05f
chore(deps): bump jinja2 from 3.1.4 to 3.1.5 in /.kokoro (#2094)
dependabot[bot] Dec 26, 2024
e3f57a6
feat: preserve unknown fields from the REST API representation in `Sc…
tswast Dec 27, 2024
3280e79
Fix: add roundingmode as str test (#2098)
chalmerlowe Jan 2, 2025
1837f81
more rebase and merge conflict resolution
chalmerlowe Jan 2, 2025
94e9e6e
more conflict resolution
chalmerlowe Jan 2, 2025
db6ef9e
update tests based on Schema superclass UserList
chalmerlowe Dec 27, 2024
0b0b85a
Merge branch 'pangea-v1alpha' into feat-b358215039-update-schema-object
chalmerlowe Jan 2, 2025
ba2795b
yet another round of merge conflicts.
chalmerlowe Jan 2, 2025
67c0182
Clean up some failing tests
chalmerlowe Jan 3, 2025
3fa949c
adds tests to ensure code coverage
chalmerlowe Jan 3, 2025
7c77bfb
tweaks tests to increase code coverage
chalmerlowe Jan 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions google/cloud/bigquery/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3680,7 +3680,7 @@ def insert_rows(
if selected_fields is not None:
schema = selected_fields

if len(schema) == 0:
if not schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, could you make len(schema_object) work? I guess this is to allow None as schema now, too? If table.schema could indeed return None now, that might be considered a breaking change, depending on how strict we want to be about types.

Either way, folks might have to change their code to deal with missing schema, so something to consider. Maybe we return an object of len(...) == 0 for now and have a warning that it will be changed to None in future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idiom of schema-less tables, but it is indeed something we've seen people do intentionally in the wild. My recollection is that it was used historically as a signal for some workflow state (e.g. some kind of is-table-present check).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that Table.schema can currently be set to None. It appears that we allow for that in the existing code:

https://github.com/googleapis/python-bigquery/blob/d4d39acb8574f0d06d4e490b859e5fe6b57d0d9e/google/cloud/bigquery/table.py#L458C1-L466C84

@schema.setter
def schema(self, value):
    api_field = self._PROPERTY_TO_API_FIELD["schema"]
    if value is None:
        self._properties[api_field] = None
    else:
        value = _to_schema_fields(value)
        self._properties[api_field] = {"fields": _build_schema_resource(value)}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it back to the old check.

raise ValueError(
(
"Could not determine schema for table '{}'. Call client.get_table() "
Expand Down Expand Up @@ -4029,7 +4029,7 @@ def list_rows(

# No schema, but no selected_fields. Assume the developer wants all
# columns, so get the table resource for them rather than failing.
elif len(schema) == 0:
elif not schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Let's be careful about breaking changes regarding table.schema allowing None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also switched this back to the old check.

table = self.get_table(table.reference, retry=retry, timeout=timeout)
schema = table.schema

Expand Down
116 changes: 95 additions & 21 deletions google/cloud/bigquery/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -549,33 +549,29 @@ def _build_schema_resource(fields):


def _to_schema_fields(schema):
"""Coerce `schema` to a list of schema field instances.

Args:
schema(Sequence[Union[ \
:class:`~google.cloud.bigquery.schema.SchemaField`, \
Mapping[str, Any] \
]]):
Table schema to convert. If some items are passed as mappings,
their content must be compatible with
:meth:`~google.cloud.bigquery.schema.SchemaField.from_api_repr`.

Returns:
Sequence[:class:`~google.cloud.bigquery.schema.SchemaField`]

Raises:
Exception: If ``schema`` is not a sequence, or if any item in the
sequence is not a :class:`~google.cloud.bigquery.schema.SchemaField`
instance or a compatible mapping representation of the field.
"""TODO docstring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging with a comment just so we don't lose track of this TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

CAST a list of elements to either:
* a Schema object with SchemaFields and an attribute
* a list of SchemaFields but no attribute
"""

for field in schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check if we get a Schema object. Let's move that above.

Suggested change
for field in schema:
if isinstance(schema, Schema):
return schema
for field in schema:

Note: I just return schema if given a Schema. See my comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could call return Schema.from_api_repr(schema.to_api_repr()), but I don't see much reason to do so.

if not isinstance(field, (SchemaField, collections.abc.Mapping)):
raise ValueError(
"Schema items must either be fields or compatible "
"mapping representations."
)

if isinstance(schema, Schema):
schema = Schema(
[
field
if isinstance(field, SchemaField)
else SchemaField.from_api_repr(field)
for field in schema
],
foreign_type_info=schema.foreign_type_info,
)
return schema
Comment on lines +553 to +562
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a Schema object, the Schema object is responsible for all conversions. I don't think we need to make a copy of it or anything.

Returning the original object should be sufficient.

Suggested change
schema = Schema(
[
field
if isinstance(field, SchemaField)
else SchemaField.from_api_repr(field)
for field in schema
],
foreign_type_info=schema.foreign_type_info,
)
return schema
return schema

This also future-proofs us a bit. In case any additional properties are added to Schema, we won't have to update this code.

return [
field if isinstance(field, SchemaField) else SchemaField.from_api_repr(field)
for field in schema
Expand Down Expand Up @@ -796,8 +792,6 @@ def serde_info(self) -> Any:
prop = _get_sub_prop(self._properties, ["serDeInfo"])
if prop is not None:
prop = StorageDescriptor().from_api_repr(prop)
print(f"DINOSAUR prop: {prop}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, oops! 😆 Could be worth a separate cleanup PR to have something small and easy to approve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


return prop

@serde_info.setter
Expand Down Expand Up @@ -921,3 +915,83 @@ def from_api_repr(cls, resource: dict) -> SerDeInfo:
config = cls("")
config._properties = copy.deepcopy(resource)
return config


class Schema:
# TODO docstrings and type hints
def __init__(self, fields=None, foreign_type_info=None):
tswast marked this conversation as resolved.
Show resolved Hide resolved
self._properties = {}
self._fields = [] if fields is None else list(fields) # Internal List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert the items of fields from SchemaField into dictionaries here? Ideally _properties would directly reflect the REST representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving. I missed that _fields was a property with a setter. I think your code does actually convert Iterable of SchemaField or dict to the correct API representation in _properties

Copy link
Contributor

@tswast tswast Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving. I don't think _properties for "fields" makes sense in the context of UserList, because we want to support all sorts of mutations and such. For that, let's do something like this instead:

Suggested change
self._fields = [] if fields is None else list(fields) # Internal List
self.data = _to_schema_fields(fields) # Internal List

And remove the data and _fields properties. Just a data instance variable is needed. That also has the benefit of not documenting data, which isn't really part of our public API.

self.foreign_type_info = foreign_type_info

@property
def foreign_type_info(self) -> Any:
"""TODO: docstring"""
return self._properties.get("foreignTypeInfo")

@foreign_type_info.setter
def foreign_type_info(self, value: str) -> None:
value = _isinstance_or_raise(value, str, none_allowed=True)
self._properties["foreignTypeInfo"] = value

@property
def _fields(self) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why private? I guess because we're trying to emulate a list and want to support mutation that way?

Would be worth an explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, the idea behind having it be private was to try and emulate a list.
I can add a note to that effect when I add docstrings.

"""TODO: docstring"""
return self._properties.get("_fields")

@_fields.setter
def _fields(self, value: list) -> None:
value = _isinstance_or_raise(value, list, none_allowed=True)
self._properties["_fields"] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why _fields. It's fields in the REST API. Also, this should convert to the list to API representation. https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#TableSchema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed _fields to fields. Changed tests to match.


def __len__(self):
chalmerlowe marked this conversation as resolved.
Show resolved Hide resolved
return len(self._fields)

def __getitem__(self, index):
return self._fields[index]

def __setitem__(self, index, value):
self._fields[index] = value

def __delitem__(self, index):
del self._fields[index]

def __iter__(self):
return iter(self._fields)

def __str__(self):
return f"Schema({self._fields}, {self.foreign_type_info})"

def __repr__(self):
return f"Schema({self._fields!r}, {self.foreign_type_info!r})"

def append(self, item):
self._fields.append(item)

def extend(self, iterable):
self._fields.extend(iterable)

def to_api_repr(self) -> dict:
"""Build an API representation of this object.

Returns:
Dict[str, Any]:
A dictionary in the format used by the BigQuery API.
"""
return copy.deepcopy(self._properties)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit wasteful to make deepcopy here and in from_api_repr. Indeed it's safer, but could add a lot of overhead. IIRC we actually removed some deepcopy calls from SchemaField because it was slowing down customers who build dynamic schemas in their code.

See: #6 and #26


@classmethod
def from_api_repr(cls, resource: dict) -> Schema:
"""Factory: constructs an instance of the class (cls)
given its API representation.

Args:
resource (Dict[str, Any]):
API representation of the object to be instantiated.

Returns:
An instance of the class initialized with data from 'resource'.
"""
config = cls("")
config._properties = copy.deepcopy(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. May be better to omit the deepcopy.

return config
13 changes: 9 additions & 4 deletions google/cloud/bigquery/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
from google.cloud.bigquery.schema import _build_schema_resource
from google.cloud.bigquery.schema import _parse_schema_resource
from google.cloud.bigquery.schema import _to_schema_fields
from google.cloud.bigquery.schema import Schema
from google.cloud.bigquery.external_config import ExternalCatalogTableOptions

if typing.TYPE_CHECKING: # pragma: NO COVER
Expand Down Expand Up @@ -452,17 +453,20 @@ def schema(self):
instance or a compatible mapping representation of the field.
"""
prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
if not prop:
if not prop: # if empty Schema, empty list, None
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return an "empty" Schema here for consistency?

Suggested change
return []
return Schema()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast

Does the following influence your thoughts on this comment?

Historically, we have returned an empty list.
An existing user would still expect a list.
A new user should probably not see an empty Schema object, since they are only needed if the user is gonna try to access an external data source (where they have to identify the foreign_type_info).

My general principal, focused on backwards compatibility is:

  • avoid as many changes to what the code did before, unless necessary
  • as much as possible, force the choice to move to Schema to be an explicit decision by any user (especially since external data sources are likely to be the exception, not the rule).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way on this one. One one hand, using Schema everywhere would make the new Schema class more discoverable and fewer cases to consider when type checking. On the other hand, providing a real list in all cases where Schema is not needed would provide the most backwards compatible option.

On further consideration, I agree with your assessment that avoiding breaking existing users is preferable. +1 on empty list in this case.

Copy link
Collaborator Author

@chalmerlowe chalmerlowe Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast

this is a related concept.
Right now, if someone gives me a Schema object, I have been storing that in _properties and I see your point about avoiding that.

That complicates getting a Schema object back out, if/when you need it.

What are your thoughts on the following approach? (I seem to recall you mention keeping objects in a separate place and ensuring that _properties is a view into the object)


    @schema.setter
    def schema(self, value):
        api_field = self._PROPERTY_TO_API_FIELD["schema"]

        if value is None:
            self._properties[api_field] = None
        elif isinstance(value, Schema):
            # store original as selt.<original_schema_object> # TODO find a better name
            self.<original_schema_object> = value
            self.properties[api_field] = self.<original_schema_object>.to_api_repr()
        else:
            value = _to_schema_fields(value)
            self._properties[api_field] = {"fields": _build_schema_resource(value)}


    @property
    def schema(self):
        # if user stored a Schema object, short circuit and return early.
        <original_schema_object> = self.get(<original_schema_object>, None)
        if isinstance(original_schema_object, Schema):
            return self.<original_schema_object>

        # if user did not store a Schema object, behave as normal
        prop = self._properties.get(self._PROPERTY_TO_API_FIELD["schema"])
        if not prop: # either None OR empty list  
            return []
        
        return _parse_schema_resource(prop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry with self.<original_schema_object> = value is that it can get out of sync with the values stored in _properties, especially since Schema supports in-place mutation.

Also, there are other cases to consider besides "someone gave us a Schema object". Importantly, what if someone calls get_table("proj.dataset.some_table_with_pangea_features_enabled")? IMO this case is one where we should return a Schema object.

Alternatively, we introduce separate properties for the pangea fields to Table and forget all this Schema class business. Something like Table.schema_foreign_type_info could work. I believe we already do similar for certain statistics on job, for example. The only gotcha here is that we'd need to make sure we merge these two properties (schema and schema_foreign_type_info) rather than replace the whole _properties["schema"] when one sets either of these.

else:
return _parse_schema_resource(prop)
elif isinstance(prop, Schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be concerning if this were ever true. _properties should always be equivalent to the REST representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast Question for you.

This response relates to setting the schema on the table as a whole and thus applies to several comments in this review:

In general, I see the reasoning behind maintaining the _properties attribute as a dict that is pretty close to the API resource representation.

If the user provides a Schema object based on the Schema class, does it make sense to maintain this as a Schema object until we need to explicitly convert it back into an API resource using .to_api_repr()?

Copy link
Contributor

@tswast tswast Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user provides a Schema object based on the Schema class, does it make sense to maintain this as a Schema object

If so, I wouldn't put it in _properties.

Ideally, the objects we provide would just be a "view" on top of the REST resource (_properties). I wouldn't put anything that requires conversion before sending it to the BigQuery API into such objects.

In some cases, we have to store state outside of _properties, in which case we populate the necessary dictionaries and such in to_api_repr(). The UserList approach may be such a case. If so, then don't store the backing list of SchemaField in _properties. Store it in some other instance variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still stands. We shouldn't have Schema objects in _properties. Instead, check for the existence of any keys besides "fields" to determine if Schema or list should be used.

return prop
return _parse_schema_resource(prop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we want to return a Schema object now.

Suggested change
return _parse_schema_resource(prop)
return Schema.from_api_repr(prop)

Aside: Is _parse_schema_resource dead code? I assume it still returns a list of SchemaField?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_schema_resource still returns a list of SchemaFields for backwards compatibility.

Where possible I -tried- to stay consist about what the user provides versus gets back: if someone handed the library a list of SchemaFields (i.e. cause that is what they are used to) then we maintained that state. I tried to avoid randomly converting lists into Schema objects to avoid user surprise.


@schema.setter
def schema(self, value):
api_field = self._PROPERTY_TO_API_FIELD["schema"]

if value is None:
self._properties[api_field] = None
elif isinstance(value, Schema):
self._properties[api_field] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._properties[api_field] = value
self._properties[api_field] = value.to_api_repr()

else:
value = _to_schema_fields(value)
self._properties[api_field] = {"fields": _build_schema_resource(value)}
Expand Down Expand Up @@ -1359,7 +1363,8 @@ def _row_from_mapping(mapping, schema):
Raises:
ValueError: If schema is empty.
"""
if len(schema) == 0:

if not schema:
raise ValueError(_TABLE_HAS_NO_SCHEMA)

row = []
Expand Down
4 changes: 3 additions & 1 deletion tests/unit/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
from google.cloud.bigquery import ParquetOptions
import google.cloud.bigquery.retry
from google.cloud.bigquery.retry import DEFAULT_TIMEOUT
from google.cloud.bigquery.schema import Schema
import google.cloud.bigquery.table

from test_utils.imports import maybe_fail_import
Expand Down Expand Up @@ -2608,7 +2609,8 @@ def test_update_table_w_schema_None(self):
sent = {"schema": None}
self.assertEqual(req[1]["data"], sent)
self.assertEqual(req[1]["path"], "/%s" % path)
self.assertEqual(len(updated_table.schema), 0)
valid_options = [Schema(), [], None]
self.assertIn(updated_table.schema, valid_options)

def test_update_table_delete_property(self):
from google.cloud.bigquery.table import Table
Expand Down
Loading
Loading